Local Context Normalization: Revisiting Local Normalization

12/12/2019 ∙ by Anthony Ortiz, et al. ∙ Microsoft 0

Normalization layers have been shown to improve convergence in deep neural networks. In many vision applications the local spatial context of the features is important, but most common normalization schemes includingGroup Normalization (GN), Instance Normalization (IN), and Layer Normalization (LN) normalize over the entire spatial dimension of a feature. This can wash out important signals and degrade performance. For example, in applications that use satellite imagery, input images can be arbitrarily large; consequently, it is nonsensical to normalize over the entire area. Positional Normalization (PN), on the other hand, only normalizes over a single spatial position at a time. A natural compromise is to normalize features by local context, while also taking into account group level information. In this paper, we propose Local Context Normalization (LCN): a normalization layer where every feature is normalized based on a window around it and the filters in its group. We propose an algorithmic solution to make LCN efficient for arbitrary window sizes, even if every point in the image has a unique window. LCN outperforms its Batch Normalization (BN), GN, IN, and LN counterparts for object detection, semantic segmentation, and instance segmentation applications in several benchmark datasets, while keeping performance independent of the batch size and facilitating transfer learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Proposed Local Context Normalization (LCN) layer. LCN normalizes each value in a channel according to the values in its feature group and spatial neighborhood. The figure shows how our proposed method compares to other normalization layers in terms of which features are used in normalization (shown in blue), where H, W, and C are the height, width, and number of channels in the output volume of a convolutional layer.

It is well-known that input normalization is important for faster convergence in neural networks. Many normalization layers have been proposed in the literature with different advantages and disadvantages. Batch Normalization (bn) is a subtractive and divisive feature normalization scheme widely used in deep learning architectures 

[12]. Recent research has shown that bn facilitates convergence of very deep learning architectures by smoothing the optimization landscape [31]

. bn normalizes the features by the mean and variance computed within a mini-batch. Using the batch dimension while calculating the normalization statistics has two main drawbacks:

  • Small batch sizes affect model performance because the mean and variance estimates are less accurate.

  • Batches might not exist during inference, so the mean and variance are pre-computed from the training set and used during inference. Therefore, changes in the target data distribution lead to issues while performing transfer learning, since the model assumes the statistics of the original training set [26].

To address both of these issues, Group Normalization (gn) was recently proposed by Wu and He [38]

. gn divides channels into groups and normalizes the features by using the statistics within each group. gn does not exploit the batch dimension so the computation is independent of batch sizes and model performance does not degrade when the batch size is reduced. gn shows competitive performance with respect to bn when the batch size is small; consequently, gn is being quickly adopted for computer vision tasks like segmentation and video classification, since batch sizes are often restricted for those applications. When the batch size is sufficiently large, bn still outperforms gn. bn, gn, in, and ln all perform “global” normalization where spatial information is not exploited, and all features are normalized by a common mean and variance value. We argue that for the aforementioned applications, local context matters. To incorporate this intuition we propose

Local Context Normalization (lcn) as a normalization layer which takes advantage of the context of the data distribution by normalizing each feature based on the statistics of its local neighborhood and corresponding feature group. lcn is in fact inspired by computational neuroscience, specifically the contrast normalization approach leveraged by the human vision system [20]. lcn provides a performance boost over all previously-proposed normalization techniques, independent of the batch size, while keeping the advantages of being computationally agnostic to the batch size and suitable for transfer learning. We empirically demonstrate the performance benefit of lcn for object detection as well as semantic and instance segmentation. Another issue with gn is that because it performs normalization using the entire spatial dimension of the features, when it is used for inference in applications where input images need to be processed in patches, just shifting the input patch for a few pixels produces different predictions. This is a common scenario in geospatial analytics and remote sensing applications where the input tends to cover an immense area [27, 22]. Interactive fine-tuning applications like  [28] become infeasible using gn, since a user won’t be able to recognize whether changes in the predictions are happening because of fine-tuning or simply because of changes in the image input statistics. With lcn, predictions depend only on the statistics within the feature neighborhood; inference does not change when the input is shifted.

2 Related Work

Normalization in Neural Networks.

Since the early days of neural networks, it has been understood that input normalization usually improves convergence [16, 17]. LeCun et al. showed that convergence in neural networks is faster if the average of each input variable to any layer is close to zero and their covariances are about the same [17]. Many normalization schemes have been proposed in the literature since then [20, 13, 15, 12, 18, 36, 38]. A Local Contrast Normalization Layer was introduced by [13]

, later referred to as lrn. A modification of this original version of lrn was used by the original AlexNet paper which won the Imagenet 

[7] challenge in 2012 [15], as well as the 2013 winning entry [41]. Most popular deep learning architectures until 2015 including Overfeat and GoogLeNet [33, 35] also used lrn, which normalizes based on the statistics in a very small window (at most ) around each feature. After Ioffe et al. proposed bn in 2015, the community moved towards global normalization schemes where the statistics are computed along entire spatial dimensions [12]. bn normalizes the feature maps of a given mini-batch along the batch dimension. For convolutional layers the mean and variance are computed over both the batch and spatial dimensions, meaning that each location in the feature map is normalized in the same way. Mean and variance are pre-computed on the training set and used at inference time, so when presented with any distribution shift in the input data, bn produces inconsistency at the time of transfer or inference [26]. Reducing the batch size also affects bn performance as the estimated statistics are less accurate. Other normalization methods [36, 38, 18] have been proposed to avoid exploiting the batch dimension. ln [18] performs normalization along the channel dimension, in [36] performs normalization for each sample, and gn uses the mean and variance from the entire spatial dimension and a group of feature channels. See Figure 1 for a visual representation of different normalization schemes. Instead of operating on features, wn normalizes the filter weights [30]. These strategies do not suffer from the issues caused by normalizing along the batch dimension, but they have not been able to approach bn performance in most visual recognition applications. Wu and He recently proposed gn, which is able to match bn performance on some computer vision tasks when the batch size is small [38]. All of these approaches perform global normalization, which might wipe out local context. Our proposed lcn takes advantages of both local context around the features and improved convergence from global normalization methods.

Contrast Enhancement.

In general, contrast varies widely across a typical image. Contrast enhancement is used to boost contrast in the regions where it is low or moderate, while leaving it unchanged where it is high. This requires that the contrast enhancement be adapted to the local image content. Contrast normalization is inspired by computational neuroscience models [13, 20]

and reflects certain aspects of human visual perception. This inspired early normalization schemes for neural networks, but contrast enhancement has not been incorporated into recent normalization methods. Perin et al. showed evidence for synaptic clustering, where small groups of neurons (a few dozen) form small-world networks without hubs 

[24]

. For example, in each group, there is an increased probability of connection to other members of the group, not just to a small number of central neurons, facilitating inhibition or excitation within a whole group. Furthermore, these cell assemblies are interlaced so that together they form overlapping groups. Such groups could in fact implement lcn. Local contrast enhancement has been applied in computer vision to pre-process input images 

[25, 32] ensuring that contrast is normalized across a very small window ( or traditionally). Local contrast normalization was essential for the performance of the popular Histogram of Oriented Gradients (HOG) feature descriptors [6]. In this work, we propose applying a similar normalization not only at the input layer, but in all layers of a neural network, to groups of neurons.

3 Local Context Normalization

3.1 Formulation

Local Normalization

In the lrn scheme proposed by [13], every feature – where i is refers to channel i and h,w refer to spatial position of the feature – is normalized by equation 1, where is a Gaussian weighting window of size , , is set to be , and

is the weighted standard deviation of all features over a small spatial neighborhood.

and are spatial coordinates, and is the feature index.

(1)

Global Normalization

Most recent normalization techniques, including bn, ln, in, and gn, apply global normalization. In these techniques, features are normalized following equation 2.

(2)

For a 2D image,

is a 4D vector indexing the features in

order, where is the batch axis, is the channel axis, and and are the spatial height and width axes. and are computed as:

(3)

with as a small constant. is the set of pixels in which the mean and standard deviation are computed, and is the size of this set. As shown by [38], most recent types of feature normalization methods mainly differ in how the set is defined. Figure 1 shows graphically the corresponding set for different normalization layers. For bn, statistics are computed along (). is defined as:

(4)

For ln, normalization is performed per-sample, within each layer. and are computed along (). is defined as:

(5)

For in, normalization is performed per-sample, per-channel. and are computed along (H,W). is defined as:

(6)

For gn, normalization is performed per-sample, within groups of size along the channel axis as shown in equation 7.

(7)

All global normalization schemes (gn, bn, ln, in) learn a per-channel linear transformation to compensate for the change in feature amplitude:

(8)

where and are learned during training.

Local Context Normalization

In lcn, the normalization statistics and are computed following equation 2 using the set defined by 9. We propose performing the normalization per-sample, within a window of size , for groups of filters of size predefined by the number of channels per group along the channel axis, as shown in equation 9. We consider windows much bigger than the ones used in lrn and can compute and in a computationally efficient manner. The size and should be adjusted according to the input size and resolution and can be different for different layers of the network.

(9)

Relation to Previous Normalization Schemes

lcn allows an efficient generalization of most previously proposed mini-batch-independent normalization layers. Like gn, we perform per-group normalization. If the chosen is greater than or equal to and the chosen is greater than or equal to , lcn behaves exactly as gn, but keeping the number of channels per group fixed throughout the network instead of the number or groups. If in that scenario the number of channels per group (c_group) is chosen as the total number of channels (c_group = C), lcn becomes ln. If the number of channels per group (c_group) is chosen as 1 (c_group = 1), lcn becomes in.

3.2 Implementation

lcn can be implemented easily in any framework with support for automatic differentiation like PyTorch 

[23]

and TensorFlow 

[1]. For an efficient calculation of mean and variance, we used the summed area table algorithm, also known in computer vision as the integral image trick [37], along with dilated convolutions [40, 2]. Algorithm 1

shows the pseudo-code for the implementation of lcn. We first create two integral images using the input features and the square of the input features. Then, we apply dilated convolution to both integral images with proper dilation (dilation depends on c_group, p, and q), kernel and stride of one. This provides us the sum and sum of squares tensors for each feature

within the corresponding window and group. From the sums and sum of square tensors we obtain mean and variance tensors needed to normalize the input features. Note that the running time is constant with respect to the window size making lcn efficient for arbitrarily large windows. Code listing LABEL:lst:lcn-code shows a Python implementation of the proposed LCN normalization layer using Pytorch. [language=Python, basicstyle=

, caption=Python implementation of LCN using PyTorch, label=lst:lcn-code] def localContextNorm(x, gamma, beta, c_group, window_size, eps=1e-5): # x: input features, shape [B,C,H,W] # gamma, beta: scale and shifting, [1,C,1,1] # c_group: number of channels for each LCN group # window_size: spatial window size for LCN, tuple (p, q) B, C, H, W = x.size() p, q = window_size x_sq = x * x # Build integral image integral_img = x.cumsum(dim=1).cumsum(dim=2).cumsum(dim=3) integral_img = torch.unsqueeze(integral_img, dim=1) integral_img_sq = x_sq.cumsum(dim=1).cumsum(dim=2).cumsum(dim=3) integral_img_sq = torch.unsqueeze(integral_img_sq, dim=1) # Dilation d = (c_group, p, q) kernel = torch.tensor([[[[[-1, 1], [ 1, -1]], [[ 1, -1], [-1, 1]]]]]) with torch.no_grad(): sums = F.conv3d(integral_img, kernel, stride=[1, 1, 1], dilation=d) squares = F.conv3d(integral_img_sq, kernel, stride=[1, 1, 1], dilation=d) n = p * q * c_group means = sums / n var = (1.0 / n * (squares - sums * sums / n)) _, _, c, h, w = means.size() pad3d = (int(math.floor((W - w)/2)), int(math.ceil((W - w)/2)), int(math.floor((H - h)/2)), int(math.ceil((H - h)/2)), int(math.floor((C - c)/2)), int(math.ceil((C - c)/2))) padded_means = F.pad(means, pad3d, ’replicate’) padded_vars = F.pad(var, pad3d, ’replicate’) x = (x - torch.squeeze(padded_means, dim=1)) / (torch.squeeze(padded_vars, dim=1) + eps).sqrt() return x * gamma + beta

Input: : input features of shape [B, C, H, W],
: number of channels per group ,
: spatial window size as a tuple (p, q),
, : scale and shifting parameters to be learned
Output:
   blue/* blueI(x) is integral image of x, dilation d is (c_group,p,q), kernel k is a tensor with -1 and 1 to substract or add dimension */
   blue// blueI() is integral image of
   blue// blueCompute Mean
   blue// bluecompute Variance
   blue// blueNormalize activation
   blue// blueApply affine transform
Algorithm 1 LCN pseudo-code

4 Experimental Results

In this section we evaluate our proposed normalization layer for the tasks of object detection, semantic segmentation, and instance segmentation in several benchmark datasets, and we compare its performance to the best previously known normalization schemes.

4.1 Semantic Segmentation on Cityscapes

Semantic segmentation consists of assigning a class label to every pixel in an image. Each pixel is typically labeled with the class of an enclosing object or region. We test for semantic segmentation on the Cityscapes dataset [5] which contains 5,000 finely-annotated images. The images are divided into 2,975 training, 500 validation, and 1,525 testing images. There are 30 classes, 19 of which are used for evaluation.

Method Normalization mIoU Class (%) Pixel Acc. (%) Mean Acc. (%)
HRNetV2 W48 BN 76.22 96.39 83.73
HRNetV2 W48 GN 75.08 95.84 82.70
HRNetV2 W48 LCN (ours) 77.49 96.14 84.60
HRNetV2 W18 Small v1 BN 71.27 95.36 79.49
HRNetV2 W18 Small v1 IN 69.74 94.92 77.77
HRNetV2 W18 Small v1 LN 66.81 94.51 75.46
HRNetV2 W18 Small v1 GN 70.31 95.03 78.99
HRNetV2 W18 Small v1 LCN (ours) 71.77 95.26 79.72
GN 1.46 0.23 0.73
Table 1: Cityscapes Semantic Segmentation Performance

Implementation Details.

We train state-of-the-art HRNetV2 [34] and HRNetV2-W18-Small-v1 networks as baselines 111We used the official implementation code from: https://github.com/leoxiaobin/deep-high-resolution-net.pytorch. We follow the same training protocol as  [34]. The data is augmented by random cropping (from 1024 2048 to 512

1024), random scaling in the range of [0.5, 2], and random horizontal flipping. We use the Stochastic Gradient Descent (SGD) optimizer with a base learning rate of 0.01, momentum of 0.9, and weight decay of 0.0005. The poly learning rate policy with the power of 0.9 is used for reducing the learning rate as done in  

[34]

. All the models are trained for 484 epochs. We train HRNetV2 using four GPUs and a batch size of two per GPU. We then substitute sync-batch normalization layers by bn, gn, lcn and compare results. We do exhaustive comparisons using HRNetV2-W18-Small-v1, which is a smaller version of HRNetV2; all training details are kept the same except for the batch size, which is increased to four images per GPU for faster training.

Quantitative Results.

Table 1 shows the performance of the different normalization layers on the Cityscapes validation set. In addition to the mean of class-wise intersection over union (mIoU), we also report pixel-wise accuracy (Pixel Acc.) and mean of class-wise pixel accuracy (Mean Acc.).

Method Number of Groups mIoU Class (%) Pixel Acc. (%) Mean Acc. (%)
HRNetV2 W18 Small v1 1 (=LN) 66.81 94.51 75.46
HRNetV2 W18 Small v1 2 69.28 94.78 77.39
HRNetV2 W18 Small v1 4 67.00 94.50 76.13
HRNetV2 W18 Small v1 8 67.67 94.76 75.81
HRNetV2 W18 Small v1 16 70.31 95.03 78.99
HRNetV2 W18 Small v1 C (=IN) 69.74 94.92 77.77
Table 2: GN Performance for Different Numbers of Groups

We observe that our proposed normalization layer outperforms all other normalization techniques including bn. lcn is almost 1.5% better than the best gn configuration in terms of mIoU. For lcn, c_group was chosen as 2, with a window size of 227 227 ( = = 227) for HRNetV2 W18 Small v1 and 255 255 for HRNetV2 W48. For gn, we tested different numbers of groups as shown in Table 2, and we report the best (using 16 groups) for comparison with other approaches in Table 1. Table 2 shows that gn is somewhat sensitive to the number of groups, ranging from 67% to 70.3% mIoU. Table 2 also shows results for in and ln, both of which perform worse than the best gn performance. These results were obtained using HRNetV2-W18-Small-v1 network architecture. It is important to mention that we used the same learning rate values to train all models, which implies that lcn still benefits from the same fast convergence as other global normalization techniques; this is not true for local normalization schemes such as lrn, which tend to require lower learning rates for convergence.

Figure 2: Qualitative results on Cityscapes. Going from left to right, this figure shows: Input, Ground Truth, Group Norm Predictions, and Local Context Norm Predictions. The second and fourth rows are obtained by maximizing the orange area from the images above. We observe how lcn allows the model to detect small objects missed by gn and offers sharper and more accurate predictions.

Sensitivity to Number of Channels per Group.

We tested the sensitivity of lcn to the number of channels per group (c_group) parameter by training models for different values of c_group while keeping the window size fixed to 227 227 ( = = 227). Table 3 shows the performance of lcn for the different number of channels per group, which is fairly stable among all configurations.

Method Channels per Group mIoU Class (%) Pixel Acc. (%) Mean Acc. (%)
HRNetV2 W18 Small v1 2 71.77 95.26 79.72
HRNetV2 W18 Small v1 4 70.26 95.07 78.49
HRNetV2 W18 Small v1 8 70.14 94.97 78.11
HRNetV2 W18 Small v1 16 70.11 94.78 79.10
Table 3: LCN sensitivity to number of channels per group for a fixed window size (227, 227)

Sensitivity to Window Size.

We also tested how lcn performance varies with respect to changes in window size while keeping the number of channels per group fixed. The results are shown in Table 4. The bigger the window size is the closer lcn gets to gn. When the window size (p, q) is equal to the entire spatial dimensions lcn becomes gn. From Table 4 we see how performance decreases as the window size gets closer to the gn equivalent.

Method Window Size mIoU Class (%) Pixel Acc. (%) Mean Acc. (%)
HRNetV2 Small v1 199 71.55 95.18 79.89
HRNetV2 Small v1 227 71.77 95.26 79.72
HRNetV2 Small v1 255 71.80 95.18 79.26
HRNetV2 Small v1 383 70.09 95.06 77.64
HRNetV2 Small v1 511 70.03 95.09 77.94
HRNetV2 Small v1 all/GN 70.30 95.04 78.97
Table 4: LCN sensitivity to Window Size

Qualitative Results

Figure 2 shows two randomly selected examples of the semantic segmentation results obtained from HRNetV2-W18-Small-v1 using gn (last column) and lcn (second-to-last column) as the normalization layers. The second and fourth rows are obtained by maximizing the orange area from the images above them. By zooming in and looking at the details in the segmentation results, we see that lcn allows sharper and more accurate predictions. Carefully looking at the second row, we can observe how using gn HRNet misses pedestrians, which are recognized when using lcn. From the last row, we can see that using lcn results in sharper and less discontinuous predictions. lcn allows HRNet to distinguish between the bike and the legs of the cyclist while gn cannot. lcn also provides more precise boundaries for the cars in the background than gn.

4.2 Object Detection and Instance Segmentation on Microsoft COCO Dataset

Method AP (%) AP (%) AP (%) AP (%) AP (%) AP
R50 BN 37.47 59.15 40.76 34.06 55.74 36.04
R50 GN 37.34 59.65 40.34 34.33 56.53 36.31
R50 LCN (Ours) 37.90 59.82 41.16 34.50 56.81 36.43
Table 5: Detection and Instance Segmentation Performance on the Microsoft Coco Dataset

We evaluate our lcn against previously-proposed normalization schemes for object detection and instance segmentation. Object detection involves detecting instances of objects from a particular class in an image. Instance segmentation involves detecting and segmenting each object in an image. The Microsoft COCO dataset [19] is a high-quality dataset which provides labels appropriate for both detection and instance segmentation and is the standard dataset for both tasks. The annotations include both pixel-level segmentation masks and bounding boxes for objects belonging to 80 categories. These computer vision tasks in general benefit from higher-resolution input. We experiment with the Mask R-CNN baselines [9], implemented in the publicly available Detectron codebase. We replace bn and/or gn by lcn during finetuning, using the model pre-trained from ImageNet using GN. We fine-tune with a batch size of one image per GPU and train the model using four GPUs. The models are trained in the COCO [19] train2017 set and evaluated in the COCO val2017 set (a.k.a. minival). We report the standard COCO metrics of Average Precision (AP), , and , for both bounding box detection () and instance segmentation (). Table 5 shows the performance of the different normalization techniques222Our results differ slightly from the ones reported in the original paper, but this should not affect the comparison across normalization schemes.. lcn outperforms both gn and bn by a substantial margin in all experiments, even using hyper-parameters tuned for the other schemes.

Network Architecture Normalization Top 1 Err. (%) Top 5 Err. (%)
Resnet 50 BN 23.59 6.82
Resnet 50 GN 24.24 7.35
Resnet 50 LCN 24.23 7.22
Table 6: Image Classification Error on Imagenet

4.3 Image Classification in ImageNet

We also experiment with image classification using the ImageNet dataset [7]

. In this experiment, images must be classified into one of 1000 classes. We train on all training images and evaluate on the 50,000 validation images, using the ResNet models 

[11].

Implementation Details.

As in most reported results, we use eight GPUs to train all models, and the batch mean and variance of bn are computed within each GPU. We use He’s initialization [10] to initialize convolution weights. We train all models for 100 epochs, and decrease the learning rate by at 30, 60, and 90 epochs. During training, we adopt the data augmentation of Szegedy et al. [35] as used in [38]. We evaluate the top-1 classification error on the center crops of pixels in the validation set. To reduce random variations, we report the median error rate of the final five epochs [8]. As in [38] our baseline is the ResNet trained with bn [11]. To compare with gn and lcn, we replace bn with the specific variant. We use the same hyper-parameters for all models. We set the number of channels per group for lcn as 32, and we used for the window size parameters. Table 6 shows that lcn offers similar performance as gn, but we don’t see the same boost in performance observed for object detection and image segmentation. We hypothesize that this happens because image classification is a global task which might not benefit from local context.

4.4 Panoptic Segmentation in Cityscapes

Norm. Method PQ SQ RQ mIoU AP_50%
BN 58.9 79.6 72.7 74.87 61.60
GN 55.2 78.7 68.5 72.18 61.0
LCN (ours)
Table 7: Panoptic Segmentation on Cityscapes

Panoptic Segmentation.

Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance) [14]. We replace bn with lcn and gn in the UPSNet network proposed by [39]. UPSNet is a popular network architecture for panoptic segmentation. We followed the same implementation details as the UPSNet authors333We used the official implementation code from: https://github.com/uber-research/UPSNet. We evaluate the performance of different normalization layers on the panoptic segmentation track of the Cityscapes dataset [5] and show the results in Table 7. From Table 7, we observe that lcn outperforms the other normalization layers by a substantial margin.

Method Bellingham Bloomington Innsbruck San Francisco East Tyrol Overall
IoU Acc. IoU Acc. IoU Acc. IoU Acc. IoU Acc. IoU Acc.
U-Net + BN 65.37 96.53 55.07 95.83 67.62 96.08 72.80 91.00 67.00 96.91 67.98 95.27
U-Net + GN 55.48 93.38 55.47 94.41 58.93 93.77 72.12 89.56 62.27 95.73 63.71 93.45
U-Net + LCN 63.61 96.26 60.47 96.22 68.99 96.28 75.01 91.46 68.90 97.19 69.90 95.48
Table 8: Performance in INRIA Aerial Image Labeling Dataset. lcn outperforms all the other normalization layers overall.

4.5 Systematic Generalization on INRIA Aerial Imagery Dataset

The INRIA Aerial Image Labeling Dataset was introduced to test generalization of remote-sensing segmentation models [21]. It includes imagery from 10 dissimilar urban areas in North America and Europe. Instead of splitting adjacent portions of the same images into training and test sets, the splitting was done city-wise. All tiles of five cities were included in the training set and the remaining ones are used as the test set. The imagery is orthorectified [21] and has a spatial resolution of 0.3m per pixel. The dataset covers 810 (405 for training and 405 for the test set). Images were labeled for the semantic classes of building and non-building.

Implementation Details.

We trained different versions of U-Net [29] where just the normalization layer was changed. We trained all models in this set of experiments using

randomly sampled patches from all training image tiles. We used the Adam optimizer with a batch size of 12. All networks were trained from scratch with a starting learning rate of 0.001. We keep the same learning rate for the first 60 epochs and decay it to 0.0001 over the next 40 epochs. Every network was trained for 100 epochs. In every epoch 8,000 patches are seen. Binary cross-entropy loss was used as the loss function. Table

8 summarizes the performance of the different normalization layers in the INRIA aerial image labeling dataset. Our proposed lcn outperforms all the other normalization layers with an overall mIoU almost 2% higher than the next-best normalization scheme, and more than 6% better than gn in terms of overall IoU. lcn provides much better performance than other methods in almost every test city. lcn was trained using a window size and four channels per group.

4.6 Land Cover Mapping

Method mIoU (%) Pixel Acc. (%)
UNet + BN 76.69 87.15
UNet + GN 74.15 85.18
UNet + LCN 76.51 86.96
Table 9: Landcover Mapping Tested on Maryland 2013 Test Tiles

Finally, we evaluate lcn on a land cover mapping task previously studied in [27, 4]. Land cover mapping is a semantic image segmentation task where each pixel in an aerial or satellite image must be classified as belonging to one of a variety of land cover classes. This process of turning raw remotely sensed imagery into a summarized data product is an important first step in many downstream sustainability related applications. For example, the Chesapeake Bay Conservancy uses land cover data in a variety of settings including determining where to target planting riparian forest buffers [3]. The dataset can be found at [4] and contains 4-channel (red, green, blue, and near-infrared), 1m resolution imagery from the National Agricultural Imagery Program (NAIP) and dense pixel labels from the Chesapeake Conservancy’s land cover mapping program over 100,000 square miles intersecting 6 states in the northeastern US. We use the Maryland 2013 subset - training on the 50,000 multi-spectral image patches, each of size , from the train split. We test over the 20 test tiles444Consisting of pixels. Each pixel must be classified as: water, tree canopy / forest, low vegetation / field, or impervious surfaces.

Implementation Details

We trained different versions of U-Net architecture used on [27] for different normalization layers without doing any data augmentation and compared results. We used the Adam optimizer with a batch size of 96. All networks were trained from scratch for 100 epochs with a starting learning rate of 0.001 with decay to 0.0001 after 60 epochs. The multi-class cross-entropy loss was used as criterion. The best gn results are obtained using 8 groups. lcn results are obtained using 4 channels per group and and a window. Table 9 shows the mean IoU and Pixel Accuracy of the different normalization layers for land cover mapping. lcn outperforms gn for this task with performance slightly lower than bn. We notice that lcn benefits from larger input images. When input images are small like in this setting the performance boost from using lcn becomes smaller.

5 Discussion and Conclusion

We proposed Local Context Normalization (lcn) as a normalization layer where every feature is normalized based on a window around it and the filters in its group. We empirically showed that lcn outperforms all previously-proposed normalization layers for object detection, semantic segmentation, and instance image segmentation across a variety of datasets. The performance of lcn is invariant to batch size, and it is well-suited for transfer learning and interactive systems. We note that we used hyper-parameters which were already highly optimized for bn and/or gn without tuning, so it is likely that we could obtain better results with lcn by just searching for better hyper-parameters. In our experiments we also do not consider varying the window size for different layers in the network, but it is a direction worth exploring: adjusting the window size during training via gradient descent may further improve performance for lcn.

Acknowledgement

This work was partially supported by the Army Research Office under award W911NF-17-1-0370. We gratefully acknowledge the support of NVIDIA Corporation with the donation of two of the Titan Xp GPUs used for this research. The authors thank Lucas Joppa and the Microsoft AI for Earth initiative for their support.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    .
    Note: Software available from tensorflow.org External Links: Link Cited by: §3.2.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:1412.7062. Cited by: §3.2.
  • [3] Chesapeake Conservancychesapeakeconservancy.org (Ed.) (2016) Land cover data project 2013/2014. Note: https://chesapeakeconservancy.org/conservation-innovation-center/high-resolution-data/land-cover-data-project/ Cited by: §4.6.
  • [4] (2019) Chesapeake land cover. Note: Maryland split. External Links: Link Cited by: §4.6.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3213–3223. Cited by: §4.1, §4.4.
  • [6] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In 2005 IEEE conference on computer vision and pattern recognition, Cited by: §2.
  • [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2, §4.3.
  • [8] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch SGD: training Imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.3.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §4.3.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Cision and Pattern Recognition, pp. 770–778. Cited by: §4.3, §4.3.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference in Machine Learning (ICML). Cited by: §1, §2.
  • [13] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al. (2009) What is the best multi-stage architecture for object recognition?. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153. Cited by: §2, §2, §3.1.
  • [14] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9404–9413. Cited by: §4.4.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.
  • [16] Y. A. LeCun, L. Bottou, G. B. Orr, and K. Müller (1998) Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Cited by: §2.
  • [17] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
  • [18] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.
  • [19] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.2.
  • [20] S. Lyu and E. P. Simoncelli (2008) Nonlinear image representation using divisive normalization. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §1, §2, §2.
  • [21] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2017) Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Cited by: §4.5.
  • [22] A. Ortiz, A. Granados, O. Fuentes, C. Kiekintveld, D. Rosario, and Z. Bell (2018)

    Integrated learning and feature selection for deep neural networks in multispectral images

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1196–1205. Cited by: §1.
  • [23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS-W, Cited by: §3.2.
  • [24] R. Perin, T. K. Berger, and H. Markram (2011) A synaptic organizing principle for cortical neuronal groups. Proceedings of the National Academy of Sciences 108 (13), pp. 5419–5424. Cited by: §2.
  • [25] N. Pinto, D. D. Cox, and J. J. DiCarlo (2008) Why is real-world visual object recognition hard?. PLoS computational biology 4 (1), pp. e27. Cited by: §2.
  • [26] S. Rebuffi, H. Bilen, and A. Vedaldi (2017) Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pp. 506–516. Cited by: 2nd item, §2.
  • [27] C. Robinson, L. Hou, K. Malkin, R. Soobitsky, J. Czawlytko, B. Dilkina, and N. Jojic (2019-06) Large scale high-resolution land cover mapping with multi-resolution data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.6, §4.6.
  • [28] C. Robinson, A. Ortiz, K. Malkin, B. Elias, A. Peng, D. Morris, B. Dilkina, and N. Jojic (2020) Human-machine collaboration for fast land cover mapping.

    AAAI Conference on Artificial Intelligence (AAAI 2020)

    .
    Cited by: §1.
  • [29] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.5.
  • [30] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909. Cited by: §2.
  • [31] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §1.
  • [32] P. Sermanet, S. Chintala, and Y. LeCun (2012) Convolutional neural networks applied to house numbers digit classification. In 2012 21st International Conference on Pattern Recognition (ICPR 2012), pp. 3288–3291. Cited by: §2.
  • [33] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. Cited by: §2.
  • [34] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: §4.1.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2, §4.3.
  • [36] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.
  • [37] P. Viola, M. Jones, et al. (2001) Rapid object detection using a boosted cascade of simple features. 2001 IEEE conference on computer vision and pattern recognition. Cited by: §3.2.
  • [38] Y. Wu and K. He (2018) Group normalization. In European Conference on Computer Vision, pp. 3–19. Cited by: §1, §2, §3.1, §4.3.
  • [39] Y. Xiong (2019) UPSNet: a unified panoptic segmentation network. In CVPR, Cited by: §4.4.
  • [40] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §3.2.
  • [41] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2.