Learning Inward Scaled Hypersphere Embedding: Exploring Projections in Higher Dimensions

10/16/2018
by   Muhammad Kamran Janjua, et al.
SEECS Orientation
0

Majority of the current dimensionality reduction or retrieval techniques rely on embedding the learned feature representations onto a computable metric space. Once the learned features are mapped, a distance metric aids the bridging of gaps between similar instances. Since the scaled projection is not exploited in these methods, discriminative embedding onto a hyperspace becomes a challenge. In this paper, we propose to inwardly scale feature representations in proportional to projecting them onto a hypersphere manifold for discriminative analysis. We further propose a novel, yet simpler, convolutional neural network based architecture and extensively evaluate the proposed methodology in the context of classification and retrieval tasks obtaining results comparable to state-of-the-art techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/14/2021

DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

Unsupervised learning of feature representations is a challenging yet im...
06/20/2018

DEFRAG: Deep Euclidean Feature Representations through Adaptation on the Grassmann Manifold

We propose a novel technique for training deep networks with the objecti...
09/19/2019

Interpretable Discriminative Dimensionality Reduction and Feature Selection on the Manifold

Dimensionality reduction (DR) on the manifold includes effective methods...
01/25/2018

NDDR-CNN: Layer-wise Feature Fusing in Multi-Task CNN by Neural Discriminative Dimensionality Reduction

State-of-the-art Convolutional Neural Network (CNN) benefits a lot from ...
08/13/2018

Deep Randomized Ensembles for Metric Learning

Learning embedding functions, which map semantically related inputs to n...
12/20/2013

Learned versus Hand-Designed Feature Representations for 3d Agglomeration

For image recognition and labeling tasks, recent results suggest that ma...
05/22/2018

Image Retrieval using Heat Diffusion for Deep Feature Aggregation

Image-level feature descriptors obtained from convolutional neural netwo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In last few years, mainly due to the advances in convolutional neural networks the performance on tasks such as image classification (Szegedy et al., 2015), cross-modal and uni-modal retrieval Wang et al. (2016); Park & Im (2016)

, and face recognition and verification 

(Calefati et al., 2018; Wen et al., 2016; Deng et al., 2017)

has increased drastically. It has been observed that deeper architectures tend to provide better capabilities in terms of approximating any learnable function. A common observation is that deeper architectures (large number of parameters) can ”learn” features at various levels of abstraction. However, it is a well explored problem that deeper architectures are more prone to overfitting than their shallower counterparts, thus hampering their generalization ability, furthermore they are computationally expensive. Majority of convolutional neural networks (CNNs) based pipelines follow the same structure i.e. alternating convolution and max pool layers, fully connected along with activation functions and dropout for regularization 

(Jarrett et al., 2009; Szegedy et al., 2017; Simonyan & Zisserman, 2014). Recently, the work in (Springenberg et al., 2015) proposed an all convolutional neural network, an architecture based on just CNN layers.

Another major reason for this drastic growth is discriminative learning techniques (Sun et al., 2014; Schroff et al., 2015; Wen et al., 2016) aiming at embedding the learned feature representations onto a hyperspace, linear or quadratic in most cases. There are studies in literature (Aggarwal et al., 2001; Beyer et al., 1999) arguing that in higher dimensions when the data is projected onto an input space there is not much divergence in terms of distance ratio of the nearest and farthest neighbors to a given target and tends to be . Due to this relative contrast of the distance to an input point cannot be discriminated effectively. It is important to note that since retrieval and search tasks tend to operate on higher dimensions, this phenomenon is valid for these problems as well. The works done by (Nawaz et al., 2018; Park & Im, 2016) for cross modal retrieval assert that Recall@K (a metric depending on Euclidean distance for similarity computation between feature representations) is not a competitive metric to evaluate the retrieval systems. Euclidean distance can be formulated as where and are two points in the input space. Surprisingly enough, (Aggarwal et al., 2001) argues that in -norms, the meaningfulness in high dimensionality is not independent of value of with lower values of norms performing better than their greater value counterparts i.e. . The general formula of norm can be setup as for . The relation considers norms with , referred to as fractional norms. Although fractional norms do not necessarily follow the triangle inequality where X is the input space, they tend to provide better contrast than their integral counterparts in terms of relative distances between query points and target.

In this paper we explore projections of feature representations onto different hyper-spaces and propose that hypersphere projection has superior performance to linear hyperspace where discriminative analysis and disintegration of multiple classes becomes challenging for networks, Figure 1

. We propose that inward scaling applied to projection on hypersphere enhances the network performance in terms of classification and retrieval. Furthermore, we introduce a simpler CNN-based architecture for classification and retrieval tasks and show that non-linear activations (RELU) and techniques like dropout are not necessary for CNN-based networks to generalize. We evaluate proposed network along with inward scaling layer on a number of benchmark datasets for classification and retrieval. We employ MNIST, FashionMNIST

(Xiao et al., 2017), CIFAR10 (Krizhevsky, 2009), URDU-Characters (S. Nawaz, 2018) and SVHN (Netzer et al., ) datasets for classification while we employ FashionMNIST for retrieval. Note that the inward scaling layer is not dependent on the proposed network i.e. it can be applied to different types of networks i.e. VGG, Inception-ResNet-V1, GoogleNet (Szegedy et al., 2015) etc and can be trained end-to-end with the underlying network. The main contributions of this work are listed as follows.

  • We propose the inward scaling layer which can be applied along with the projection layer to ensure maximum separability between divergent classes. We show that the layer enhances the network performance on multiple datasets.

  • We propose a simpler architecture without dropout (Srivastava et al., 2014)

    , and batch normalization 

    (Ioffe & Szegedy, 2015) layers and experimentally validate that the network achieves results comparable to deeper and wider networks.

  • We explore the effect of inward scaling layer with different loss functions such as centerloss, contrastive loss and softmax.

The rest of the paper is structured as follows: we explore related literature in Section 2, followed by inward scale layer and architecture in Section 3. We review datasets employed and experimental results in Section 4. We finish with conclusion and future work in Section 5.

Figure 1: A toy figure representing how the projection takes place. The manifold is transformed into a hypersphere during training. The small black and white dots represent different classes. The figure is made with MNIST dataset in perspective. The perfect alignment of all the classes on the circumference of the hypersphere is very ideal condition assuming that there exists no intra-class variation. (best viewed in color)

2 Related Work

2.1 Metric Learning

Metric learning aims at learning a similarity function, distance metric. Traditionally, metric learning approaches (Weinberger & Saul, 2009; Ying & Li, 2012; Koestinger et al., 2012) focused on learning a similarity matrix

. The similarity matrix is used to measure the similarity between two vectors. Consider feature vectors

where each vector corresponds to the relevant features. Then the similarity matrix for a corresponding distance metric can be computed as where and are given features. However, in recent metric learning methodologies (Hu et al., 2014; Oh Song et al., 2016; Lu et al., 2015; Hadsell et al., 2006; S. Nawaz, 2018), neural networks are employed to learn the discriminative features followed by a distance metric i.e. Euclidean or Manhattan distance where d is the distance metric used. Contrastive loss (Chopra et al., 2005; Hadsell et al., 2006) and Triplet loss (Hoffer & Ailon, 2015; Wang et al., 2014; Schroff et al., 2015) are commonly used metric learning techniques. Contrastive loss function is a pairwise loss function i.e. reduces the similarity between query and target ; where is the distance metric. However, triplet loss leverages on triplets () which should be carefully selected to utilize the benefit of the function ; where and are the distances between query and positive pair and query and negative pair respectively. Note that triplet and pair selection is an expensive process and the space complexity becomes exponential.

2.2 Normalization Techniques

To accelerate the training process of neural networks, normalization was introduced and is still a common operation in modern neural network models. Batch normalization (Ioffe & Szegedy, 2015) was proposed to speed up the training process by reducing the internal covariate shift of immediate features. Scaling and shifting the normalized values becomes necessary to avoid the limitation in representation. The normalization of a layer can be defined as where the layer is normalized along the -th dimension where represents the input, represents the mean of activation computed and

represents the variance. The work by 

(LeCun et al., 2012) shows that such normalization aids convergence of the network. Recently, weight normalization (Salimans & Kingma, 2016) technique was introduced to normalize the weights of convolution layers to speed up the convergence rate.

2.3 Hypersphere Embedding Techniques

Different works in literature have explored different hyper-spaces for projection of learned features to figure out manifold with maximum separability between the deep features. Hypersphere embedding is one of the technique where the learned features are projected onto a hypersphere with the

-normalize layer i.e. . Works in literature have employed hypersphere embedding for different face recognition and verification tasks (Ranjan et al., 2017; Wang et al., 2017; Liu et al., 2017). These techniques function by imposing discriminative constraints on a hypersphere manifold. As (Ioffe & Szegedy, 2015) explains that scale and shift is necessary to avoid the limitations and are introduced as ; where , are learnable parameters. Inspired from this work, techniques such as (Ranjan et al., 2017) explore -normalize layer followed by scaling layer which scales the projected features by a factor i.e. where is the radius of the hypersphere and can be both learnable and predefined, larger values of result in improved results. However, in (Ranjan et al., 2017) the is restricted to the radius of hypersphere and normalizes the features only. Furthermore, (Liu et al., 2017) normalizes the weights of last inner-product layer only and does not explore the scaling factor. The work presented in Wang et al. (2017) optimizes both weights and features, and defines the normalization layer as without exploring the scaling factor.

2.4 Revisiting Softmax-based Techniques

A generic pipeline for classification tasks consists of a CNN network learning the features of the input coupled with softmax as a supervision signal. We revisit the softmax function by looking at its definition ; where x is the learned feature, denotes weights in the last fully connected layer and is the bias term corresponding to class . By examining, it is clear that is responsible for the class decision which forms intuition for the necessity of the fully connected layer after normalization. (Liu et al., 2017) reformulates softmax and introduces an angular margin and modifies the decision boundary of softmax as for class 1 and for class 2. This differs from standard softmax in a sense that (Liu et al., 2017) requires for the learned feature x

to be correctly classified as class 1. This reformulation results in a hypersphere embedding due to the subtended angle. Similarly, 

(Ranjan et al., 2017) constraints the softmax by adding a normalization layer.

(a)

Plot at epoch

(b) Plot at epoch
(c) Plot at epoch
Figure 2: Plots on test set of MNIST dataset during different epochs. The figure shows realistic plots of test set of MNIST at different epochs. At epoch the projection of data points on hypersphere embedding space is in initial stages with little to no inward scaling. However, at epoch effects of inward scaling are visible with the projection being maximum scaled at epoch . (best viewed in color)

3 Proposed Method

In this section, we explore the intuition behind the inward scale layer and explain why normalization along with a fully connected layer is necessary before the softmax. We term a normalization layer along with the inward scaling factor as the inward scale layer. The reason behind this terminology is that normalization without the inward scaling acts as constraint imposer on the feature space and hampers the discriminative ability of the network. Furthermore, network struggles to converge if either of the layers are removed i.e. normalization, inward scale factor and fully connected. We set some terminology before proceeding with the explanation.

Terminology Explanation
Input manifold
Projected hypersphere manifold
Learned features of class
Weight of class
Bias of class
Inward scale factor
Inward scale layer with feature x and scale factor
Fully connected layer with weight W, feature x and bias b
Table 1: Some important terminology used throughout this manuscript.
(a) Plot on test set of MNIST reduced to 2-dimensional features with the softmax as supervision signal without the .
(b) Plot on test set of MNIST reduced to 2-dimensional features with the softmax as supervision signal with the .
Figure 3: Comparison of employing softmax with (a) and without (b) the inward scale layer. The softmax tends to have a radial distribution whereas with the distribution changes to hypersphere. Note that the plot (b) has some variation between the features in a radial fashion. This is due to the tendency of softmax. Note that figure (b) is slightly off from the ideal hypersphere embedding, since the features are extracted from the half trained network to establish analogy with the softmax, this scenarios takes place. (best viewed in color)

The work in (Wang et al., 2017) establishes that softmax function always encourages well-separated features to have bigger magnitudes resulting in radial distribution Figure 3(a). However, the effect is minimized in Figure 3(b) because of the .

3.1 Inward Scale Layer

In this paper, we define the inward scale layer as the normalization layer along with the inward scale factor . The normalization layer can be defined as in Equation 1.

(1)

where is the factor to avoid division by zero. Note that it is unlikely that norm , but to avoid the risk, we introduce the factor. Inspired from the works in literature (Ranjan et al., 2017; Salimans & Kingma, 2016) we further introduce a scale factor . Unlike employing it in the product fashion as in (Ranjan et al., 2017), we couple with the norm in inverse fashion to ensure the scaling of the features as they are projected onto the manifold . In other words, we couple the factor with to enhance the norm of the features instead of bounding entire layer. The Equation 2 is modified as . L2-norm can be re-written as . Thus, can be formulated as follows.

(2)

where is the feature from the previous layer. Note that the factor is not trainable. We experiment with different values of and find that maximum separability is obtained with , see appendix A for experiments with different values of .

The CNN layers are responsible for providing a meaningful feature space, without the layer, learning non-linear combinations of these features would not be possible. Simply put, the features are classified into different classes due to

layers followed by a softmax layer. The Figure 3(b) in 

(Ranjan et al., 2017) visually illustrates the effect of L2-constrained softmax. On comparing it with our Figure 2(c) we visually see the effects of the inward scale layer. It is necessary to note that we do not modify the softmax and employ it as it is with the which in turn benefits the network with faster convergence and the learned features are discriminative enough for efficient classification and retrieval without the need for any metric learning. As the module is fully differentiable and is employed in end-to-end fashion, the gradient with respect to is given as

and can be solved using the chain-rule, see appendix B for the prove and appendix C for learning curves of the

.

3.2 SimpleNet

Here we explain the proposed network referred to as SimpleNet. The Figure 4 represents the architecture visually. Due to the inclusion of the layer, normalizing features or weights during training becomes redundant and adds no performance benefit to the pipeline. So to overcome this redundancy, we do not use any batch or weight normalization layer. Furthermore, it is proposed by (Liu, ; Liu et al., 2017) to remove the ReLU nonlinearity from the networks. We reinforce the idea that ReLU nonlinearity restricts the feature space to non-negative range i.e. . In order to avoid the feature space from this sufferance, we do not employ ReLU nonlinearity in between the CNN and MaxPool blocks in the network. However, a PRelu layer is added before the last which helps in approximation. It is interesting to note that this does not restrict the feature space to .

3.3 Activation Maps of SimpleNet and

It becomes intuitive to analyze how the network behaves with the unit in terms of approximating the function. We visualize the activation maps of convolutional layers in the SimpleNet followed by the unit. Figure 5 is a visual illustration of activation maps extracted from trained SimpleNet 4. Since the scale factor is set to

, the change of standard deviation and mean, Figure 

6, is according to the factor. Standard deviation is given by . With the introduction of the unit, the standard deviation can be re-written as and mean of data as .

Figure 4: The SimpleNet architecture used for the experiments. 2xCNN is a block of convolutional neural network (2D) containing two CNN layers followed by a MaxPool2D layer. Whereas the IS is the layer followed by a FC which stands for layer and softmax loss at the end. A single PRelu is used to add non-linearity before the last FC layer. It is worth noting that no batch normalization or data preprocessing is employed throughout the network. (best viewed in color)
Input Image Conv1_1 Conv2_2 Conv3_3 Layer
Figure 5: Feature Maps extracted from second convolutional layer from each 2xCNN block followed by activation map of the layer. The input image is from the URDU dataset.

4 Experimental Results

In order to quantify the effects of layer and simplified architecture, in this section we report results of the layer with and without the SimpleNet on multiple datasets.

4.1 Experimental Setup

We perform series of experiments for each dataset. Firstly we report results of different works available in the literature followed by the results of layer with SimpleNet as baseline network and lastly we report results of SimpleNet without the layer. Note that in order demonstrate the modular nature of layer, we perform experiments with different baseline networks containing the proposed layer. The SimpleNet can be trained with standard gradient descent algorithms. In all of the following experiments we employ Adam (Kingma & Ba, 2015) optimizer with an initial learning rate of and employ weight decay strategy to prevent indefinite growing of because after updating for all cases.

4.2 Classification Results

4.2.1 MNIST and FashionMNIST

For the basic experiment to quantify results of proposed layer and architecture, we perform the test on MNIST and FashionMNIST dataset which are famous benchmark dataset for neural networks. FashionMNIST is a drop-in replacement for the original MNIST dataset. Table 2 demonstrates the results of layer and SimpleNetwork and compares it with available works in literature.

Methods Dataset Accuracy ()
Softmax Loss MNIST 98.64
Ours (without ) MNIST 98.40
Ours (with ) MNIST 99.33
Ours (without ) FashionMNIST 89.64
Ours (with ) FashionMNIST 93.00
Ranjan et al. (2017) MNIST 99.05
 Zhong et al. (2017) FashionMNIST 96.35
Table 2: Accuracy on MNIST and FashionMNIST test set in ().
Input Image Conv_3 Pool_3 Flatten Layer
Figure 6: Mean and Standard deviation of data extracted from third convolutional block, maxpool layer, flatten layer and unit from SimpleNet. The input image is from the URDU dataset.

4.2.2 CIFAR10 and SVHN

For the next experiment, we perform the test on CIFAR10 and SVHN datasets. Since MNIST and FashionMNIST are low resolution, grayscale and synthetic datasets, we test the layer on datasets with increasing complexity. Table 3 demonstrates the results of layer and SimpleNetwork.

Methods Dataset Accuracy ()
Ours (Without ) CIFAR 58.2
Ours (With ) CIFAR 64.0
Ours (Without ) SVHN 93.20
Ours (With ) SVHN 95.05
 Zagoruyko & Komodakis (2016) CIFAR 96.11
 Zagoruyko & Komodakis (2016) SVHN 98.46
Table 3: Accuracy on CIFAR10 and SVHN test set in (). SimpleNet is the baseline network for both experiments.

4.2.3 URDU Dataset

As an experiment on a non-standard dataset, we perform classification on the URDU dataset introduced by (S. Nawaz, 2018). Since MNIST and FashionMNIST are , we employ the format of the URDU dataset to validate the layer on increasing image dimensions along number of channels. Table 4 demonstrates the results of layer. Furthermore, this experiment confirms that the layer is not SimpleNet dependent since URDU dataset is trained using LeNet with softmax as a supervision signal. For comparison, we also experiment with SimpleNet. It is important to note that with networks like GoogleNet, accuracy on the URDU dataset crosses when coupled with the layer. The aim of the experiments is not to demonstrate the superiority of SimpleNet, but to demonstrate the increase in accuracy when a network is coupled with the layer.

Methods Dataset Network Accuracy ()
Ours (Without ) URDU LeNet 70.02
Ours (With ) URDU LeNet 71.54
Ours (Without ) URDU SimpleNet 74.03
Ours (With ) URDU SimpleNet 77.76
Table 4: Accuracy on URDU dataset test set in ().
(a) Training loss graph with the layer on CIFAR100 dataset using GoogleNet as baseline network. Classification accuracy is .
(b) Training loss graph without the layer on CIFAR100 dataset using GoogleNet as baseline network. Classification accuracy is .
Figure 7: Plots of training loss on CIFAR100 dataset with and without the proposed layer using GoogleNet as a baseline architecture with no pre or post processing.

4.3 Cifar100

We perform an additional experiment on the CIFAR100 dataset to confirm the efficacy of the proposed layer. This experiment is particularly interesting because it augments an important claim behind the layer. We employ GoogleNet for this experiment for two reasons: (i) to verify that introduced layer can be coupled with GoogleNet and (ii) CIFAR100 is a large dataset compared to the datasets previously employed, thus, the accuracy with networks like LeNet is not satisfactory. Table 5 demonstrates the results of GoogleNet on CIFAR100 with and without the layer. Figure 7 visualizes the training graph with and without the proposed unit. It is interesting for readers to note the difference between the two graphs. Note that we imply the idea that projection and scaling happens during each pass and almost simultaneously due to scaling just before the projection. This is the major reason why loss behaves in variating fashion in the start. It should be noted that this does not mean the network struggles to converge.

Methods Dataset Accuracy ()
Ours (Without ) CIFAR100 59.23
Ours (With ) CIFAR100 60.44
 (Cireşan et al., 2011) CIFAR100 64.32
 (Goodfellow et al., 2013) CIFAR100 65.46
 (Springenberg et al., 2015) CIFAR100 66.29
Table 5: Accuracy on CIFAR100 dataset test set in (). We employ GoogleNet for this experiment.

4.4 Large Scale Classification

Training state-of-the-art models on ImageNet dataset can take several weeks of computation time. We did not aim for the best performance, rather perform a proof of concept experiment. It is necessary to test if architecture coupled with

layer performing best on smaller datasets like CIFAR10, FashionMNIST etc also apply to larger datasets. We employ ILVRC-2012 (Russakovsky et al., 2015) subset of the ImageNet dataset to train GoogleNet with and without the unit.

4.5 Retrieval Results

In this section we report the retrieval results on FashionMNIST dataset. Most retrieval systems employ Recall@K as a metric to compute the scores. R@K is the percentage of queries in which the ground truth terms are one of the first K retrieved results. To retrieve results, we take query image and simply compute nearest neighbor (euclidean distance) between all images and sort results based on the distance. The first five distances correspond to Recall@K (K = 5) results and so on. We report results for . Since this is a unimodal retrieval, images are at the input and retrieval end. It is known that Recall@K increases even if one true positive out of Top is encountered, so the results are almost similar. For a more valid quantitive analysis, we also present results of average occurrence of true positives (TP) in Top . For retrieval, distance minimization is the major objective which softmax alone can not handle efficiently, thus we employ contrastive loss introduced by (Hadsell et al., 2006) along with softmax for the retrieval problem which shows that the proposed layer can function regardless of the architecture and loss function.

Without With TP with TP without
R@1 88.75 89.74 86.70
R@5 95.88 89.90 86.60
R@10 97.33 90.00 85.60
Table 6: Recall@K and average occurrence of true positives (TP) in Top scores for FashionMNIST test set with and without the . Note that SimpleNet is the baseline architecture.

4.6 Result Discussion

We explore classification and retrieval tasks with and without the layer. The reported results indicate the superior performance of architecture with the layer. It is important to note that the each experiments is run 5 times and k-fold validation methodology is employed. The architecture with layer maintains the upper bound over its counter part without the layer. In Table 6, the TP with and without indicate the average occurrence of true positives in TopK retrieved results. The reason we employ this metric is Recall@K is incremented even if a single true positive is encountered out of TopK and thus the results with and without

layer are almost similar. However, with average true positive, we compute the actual number of true positives out of TopK and compute the average to report more discriminative comparison between the two. Furthermore, we compare the network with state-of-the-art approaches. Note that most of the works obtaining state-of-the-art perform data preprocessing, while we do not employ any pre or post processing technique for any experiment performed and use single optimization policy without fine-tuning hyperparameters for any specific task.

5 Conclusion and Future Work

In this paper, we proposed a novel layer for embedding the learned deep features into an hypersphere. We propose that hypersphere embedding is important for discriminative analysis of the features. We verify the claim with extensive evaluation on multiple classification and retrieval tasks. We propose a simpler architecture for the said tasks and demonstrate that simpler networks achieve results comparable to the deeper networks when coupled with the layer . Furthermore, the layer module can be added to any network and is fully differentiable and can be trained end-to-end with any network as well.

In future, we would like to explore different hyperspaces for discriminatively embedding feature representations. Furthermore, we would like to explore constraint-enforced hyperspaces where networks learns a mapping function under certain constraints thus resulting in a desired embedding.

References

Appendix A Exploring Different Values of

In this appendix section, we explore different values of the scale factor employed in layer. During training, the hyperspace adapts according to the values of . With lower values, stretching is minimum with low intra-class dispersion and results in a hypersphere. Whereas, with greater values of , parallel stretch of hyperspace is maximum with compactness of features in one direction. We visually illustrate the effects on MNIST dataset’s test set with increasing values of in factor of where . The inward scaling effect is still visible in the figure, but the hypersphere embedding is distorted once the training completes with large false positive rate and low classification accuracy. Note that values of lie in the factor, values otherwise yield unsatisfactory results.

(a) The value of is . Classification accuracy is .
(b) The value of is . Classification accuracy is .
(c) The value of is . Classification accuracy is .
(d) The value of is . Classification accuracy is .
Figure 8: Plots on MNIST test set with different values of . The values out of the defined range do not yield satisfactory results due to high compactness of inter-class along one direction. Note that in (c) and (d) concentration at corners starts due to increased inward scaling and stretching to match as the scale changes. It would interesting for the readers to note how the scale changes with values of . (Best viewed when zoomed in.)

Appendix B Proving

In this appendix section we prove the gradient with respect to i.e. . We adopt the strategy presented by (Wang et al., 2017; Ranjan et al., 2017). Since is not a learnable parameter, we can ignore it during gradient computation. We know from Equation 1 that , similarly where . We have as follows.

where . The gradient of is denoted as . We ignore in the final equation because our main objective is to demonstrate gradient w.r.t to the introduced . Using these notations, we proceed as follows.

.

Appendix C Learning Curves of

For more intuitive understanding of the proposed layer it is important to visualize the training graph plotted as the loss decreases. It would be interesting to note the difference between the graph with and without the layer. For the sake of employing different networks with the layer, this plot is with LeNet.

(a) Training loss graph with the layer on URDU dataset using LeNet as baseline network. Classification accuracy is .
(b) Training loss graph without the layer on URDU dataset using LeNet as baseline network. Classification accuracy is .
Figure 9: Plots of training loss on URDU dataset with and without the proposed layer using LeNet as a baseline architecture.