1 Introduction
In last few years, mainly due to the advances in convolutional neural networks the performance on tasks such as image classification (Szegedy et al., 2015), crossmodal and unimodal retrieval Wang et al. (2016); Park & Im (2016)
, and face recognition and verification
(Calefati et al., 2018; Wen et al., 2016; Deng et al., 2017)has increased drastically. It has been observed that deeper architectures tend to provide better capabilities in terms of approximating any learnable function. A common observation is that deeper architectures (large number of parameters) can ”learn” features at various levels of abstraction. However, it is a well explored problem that deeper architectures are more prone to overfitting than their shallower counterparts, thus hampering their generalization ability, furthermore they are computationally expensive. Majority of convolutional neural networks (CNNs) based pipelines follow the same structure i.e. alternating convolution and max pool layers, fully connected along with activation functions and dropout for regularization
(Jarrett et al., 2009; Szegedy et al., 2017; Simonyan & Zisserman, 2014). Recently, the work in (Springenberg et al., 2015) proposed an all convolutional neural network, an architecture based on just CNN layers.Another major reason for this drastic growth is discriminative learning techniques (Sun et al., 2014; Schroff et al., 2015; Wen et al., 2016) aiming at embedding the learned feature representations onto a hyperspace, linear or quadratic in most cases. There are studies in literature (Aggarwal et al., 2001; Beyer et al., 1999) arguing that in higher dimensions when the data is projected onto an input space there is not much divergence in terms of distance ratio of the nearest and farthest neighbors to a given target and tends to be . Due to this relative contrast of the distance to an input point cannot be discriminated effectively. It is important to note that since retrieval and search tasks tend to operate on higher dimensions, this phenomenon is valid for these problems as well. The works done by (Nawaz et al., 2018; Park & Im, 2016) for cross modal retrieval assert that Recall@K (a metric depending on Euclidean distance for similarity computation between feature representations) is not a competitive metric to evaluate the retrieval systems. Euclidean distance can be formulated as where and are two points in the input space. Surprisingly enough, (Aggarwal et al., 2001) argues that in norms, the meaningfulness in high dimensionality is not independent of value of with lower values of norms performing better than their greater value counterparts i.e. . The general formula of norm can be setup as for . The relation considers norms with , referred to as fractional norms. Although fractional norms do not necessarily follow the triangle inequality where X is the input space, they tend to provide better contrast than their integral counterparts in terms of relative distances between query points and target.
In this paper we explore projections of feature representations onto different hyperspaces and propose that hypersphere projection has superior performance to linear hyperspace where discriminative analysis and disintegration of multiple classes becomes challenging for networks, Figure 1
. We propose that inward scaling applied to projection on hypersphere enhances the network performance in terms of classification and retrieval. Furthermore, we introduce a simpler CNNbased architecture for classification and retrieval tasks and show that nonlinear activations (RELU) and techniques like dropout are not necessary for CNNbased networks to generalize. We evaluate proposed network along with inward scaling layer on a number of benchmark datasets for classification and retrieval. We employ MNIST, FashionMNIST
(Xiao et al., 2017), CIFAR10 (Krizhevsky, 2009), URDUCharacters (S. Nawaz, 2018) and SVHN (Netzer et al., ) datasets for classification while we employ FashionMNIST for retrieval. Note that the inward scaling layer is not dependent on the proposed network i.e. it can be applied to different types of networks i.e. VGG, InceptionResNetV1, GoogleNet (Szegedy et al., 2015) etc and can be trained endtoend with the underlying network. The main contributions of this work are listed as follows.
We propose the inward scaling layer which can be applied along with the projection layer to ensure maximum separability between divergent classes. We show that the layer enhances the network performance on multiple datasets.

We propose a simpler architecture without dropout (Srivastava et al., 2014)
, and batch normalization
(Ioffe & Szegedy, 2015) layers and experimentally validate that the network achieves results comparable to deeper and wider networks. 
We explore the effect of inward scaling layer with different loss functions such as centerloss, contrastive loss and softmax.
The rest of the paper is structured as follows: we explore related literature in Section 2, followed by inward scale layer and architecture in Section 3. We review datasets employed and experimental results in Section 4. We finish with conclusion and future work in Section 5.
2 Related Work
2.1 Metric Learning
Metric learning aims at learning a similarity function, distance metric. Traditionally, metric learning approaches (Weinberger & Saul, 2009; Ying & Li, 2012; Koestinger et al., 2012) focused on learning a similarity matrix
. The similarity matrix is used to measure the similarity between two vectors. Consider feature vectors
where each vector corresponds to the relevant features. Then the similarity matrix for a corresponding distance metric can be computed as where and are given features. However, in recent metric learning methodologies (Hu et al., 2014; Oh Song et al., 2016; Lu et al., 2015; Hadsell et al., 2006; S. Nawaz, 2018), neural networks are employed to learn the discriminative features followed by a distance metric i.e. Euclidean or Manhattan distance where d is the distance metric used. Contrastive loss (Chopra et al., 2005; Hadsell et al., 2006) and Triplet loss (Hoffer & Ailon, 2015; Wang et al., 2014; Schroff et al., 2015) are commonly used metric learning techniques. Contrastive loss function is a pairwise loss function i.e. reduces the similarity between query and target ; where is the distance metric. However, triplet loss leverages on triplets () which should be carefully selected to utilize the benefit of the function ; where and are the distances between query and positive pair and query and negative pair respectively. Note that triplet and pair selection is an expensive process and the space complexity becomes exponential.2.2 Normalization Techniques
To accelerate the training process of neural networks, normalization was introduced and is still a common operation in modern neural network models. Batch normalization (Ioffe & Szegedy, 2015) was proposed to speed up the training process by reducing the internal covariate shift of immediate features. Scaling and shifting the normalized values becomes necessary to avoid the limitation in representation. The normalization of a layer can be defined as where the layer is normalized along the th dimension where represents the input, represents the mean of activation computed and
represents the variance. The work by
(LeCun et al., 2012) shows that such normalization aids convergence of the network. Recently, weight normalization (Salimans & Kingma, 2016) technique was introduced to normalize the weights of convolution layers to speed up the convergence rate.2.3 Hypersphere Embedding Techniques
Different works in literature have explored different hyperspaces for projection of learned features to figure out manifold with maximum separability between the deep features. Hypersphere embedding is one of the technique where the learned features are projected onto a hypersphere with the
normalize layer i.e. . Works in literature have employed hypersphere embedding for different face recognition and verification tasks (Ranjan et al., 2017; Wang et al., 2017; Liu et al., 2017). These techniques function by imposing discriminative constraints on a hypersphere manifold. As (Ioffe & Szegedy, 2015) explains that scale and shift is necessary to avoid the limitations and are introduced as ; where , are learnable parameters. Inspired from this work, techniques such as (Ranjan et al., 2017) explore normalize layer followed by scaling layer which scales the projected features by a factor i.e. where is the radius of the hypersphere and can be both learnable and predefined, larger values of result in improved results. However, in (Ranjan et al., 2017) the is restricted to the radius of hypersphere and normalizes the features only. Furthermore, (Liu et al., 2017) normalizes the weights of last innerproduct layer only and does not explore the scaling factor. The work presented in Wang et al. (2017) optimizes both weights and features, and defines the normalization layer as without exploring the scaling factor.2.4 Revisiting Softmaxbased Techniques
A generic pipeline for classification tasks consists of a CNN network learning the features of the input coupled with softmax as a supervision signal. We revisit the softmax function by looking at its definition ; where x is the learned feature, denotes weights in the last fully connected layer and is the bias term corresponding to class . By examining, it is clear that is responsible for the class decision which forms intuition for the necessity of the fully connected layer after normalization. (Liu et al., 2017) reformulates softmax and introduces an angular margin and modifies the decision boundary of softmax as for class 1 and for class 2. This differs from standard softmax in a sense that (Liu et al., 2017) requires for the learned feature x
to be correctly classified as class 1. This reformulation results in a hypersphere embedding due to the subtended angle. Similarly,
(Ranjan et al., 2017) constraints the softmax by adding a normalization layer.3 Proposed Method
In this section, we explore the intuition behind the inward scale layer and explain why normalization along with a fully connected layer is necessary before the softmax. We term a normalization layer along with the inward scaling factor as the inward scale layer. The reason behind this terminology is that normalization without the inward scaling acts as constraint imposer on the feature space and hampers the discriminative ability of the network. Furthermore, network struggles to converge if either of the layers are removed i.e. normalization, inward scale factor and fully connected. We set some terminology before proceeding with the explanation.
Terminology  Explanation 

Input manifold  
Projected hypersphere manifold  
Learned features of class  
Weight of class  
Bias of class  
Inward scale factor  
Inward scale layer with feature x and scale factor  
Fully connected layer with weight W, feature x and bias b 
The work in (Wang et al., 2017) establishes that softmax function always encourages wellseparated features to have bigger magnitudes resulting in radial distribution Figure 3(a). However, the effect is minimized in Figure 3(b) because of the .
3.1 Inward Scale Layer
In this paper, we define the inward scale layer as the normalization layer along with the inward scale factor . The normalization layer can be defined as in Equation 1.
(1) 
where is the factor to avoid division by zero. Note that it is unlikely that norm , but to avoid the risk, we introduce the factor. Inspired from the works in literature (Ranjan et al., 2017; Salimans & Kingma, 2016) we further introduce a scale factor . Unlike employing it in the product fashion as in (Ranjan et al., 2017), we couple with the norm in inverse fashion to ensure the scaling of the features as they are projected onto the manifold . In other words, we couple the factor with to enhance the norm of the features instead of bounding entire layer. The Equation 2 is modified as . L2norm can be rewritten as . Thus, can be formulated as follows.
(2) 
where is the feature from the previous layer. Note that the factor is not trainable. We experiment with different values of and find that maximum separability is obtained with , see appendix A for experiments with different values of .
The CNN layers are responsible for providing a meaningful feature space, without the layer, learning nonlinear combinations of these features would not be possible. Simply put, the features are classified into different classes due to
layers followed by a softmax layer. The Figure 3(b) in
(Ranjan et al., 2017) visually illustrates the effect of L2constrained softmax. On comparing it with our Figure 2(c) we visually see the effects of the inward scale layer. It is necessary to note that we do not modify the softmax and employ it as it is with the which in turn benefits the network with faster convergence and the learned features are discriminative enough for efficient classification and retrieval without the need for any metric learning. As the module is fully differentiable and is employed in endtoend fashion, the gradient with respect to is given asand can be solved using the chainrule, see appendix B for the prove and appendix C for learning curves of the
.3.2 SimpleNet
Here we explain the proposed network referred to as SimpleNet. The Figure 4 represents the architecture visually. Due to the inclusion of the layer, normalizing features or weights during training becomes redundant and adds no performance benefit to the pipeline. So to overcome this redundancy, we do not use any batch or weight normalization layer. Furthermore, it is proposed by (Liu, ; Liu et al., 2017) to remove the ReLU nonlinearity from the networks. We reinforce the idea that ReLU nonlinearity restricts the feature space to nonnegative range i.e. . In order to avoid the feature space from this sufferance, we do not employ ReLU nonlinearity in between the CNN and MaxPool blocks in the network. However, a PRelu layer is added before the last which helps in approximation. It is interesting to note that this does not restrict the feature space to .
3.3 Activation Maps of SimpleNet and
It becomes intuitive to analyze how the network behaves with the unit in terms of approximating the function. We visualize the activation maps of convolutional layers in the SimpleNet followed by the unit. Figure 5 is a visual illustration of activation maps extracted from trained SimpleNet 4. Since the scale factor is set to
, the change of standard deviation and mean, Figure
6, is according to the factor. Standard deviation is given by . With the introduction of the unit, the standard deviation can be rewritten as and mean of data as .Input Image  Conv1_1  Conv2_2  Conv3_3  Layer 

4 Experimental Results
In order to quantify the effects of layer and simplified architecture, in this section we report results of the layer with and without the SimpleNet on multiple datasets.
4.1 Experimental Setup
We perform series of experiments for each dataset. Firstly we report results of different works available in the literature followed by the results of layer with SimpleNet as baseline network and lastly we report results of SimpleNet without the layer. Note that in order demonstrate the modular nature of layer, we perform experiments with different baseline networks containing the proposed layer. The SimpleNet can be trained with standard gradient descent algorithms. In all of the following experiments we employ Adam (Kingma & Ba, 2015) optimizer with an initial learning rate of and employ weight decay strategy to prevent indefinite growing of because after updating for all cases.
4.2 Classification Results
4.2.1 MNIST and FashionMNIST
For the basic experiment to quantify results of proposed layer and architecture, we perform the test on MNIST and FashionMNIST dataset which are famous benchmark dataset for neural networks. FashionMNIST is a dropin replacement for the original MNIST dataset. Table 2 demonstrates the results of layer and SimpleNetwork and compares it with available works in literature.
Methods  Dataset  Accuracy () 
Softmax Loss  MNIST  98.64 
Ours (without )  MNIST  98.40 
Ours (with )  MNIST  99.33 
Ours (without )  FashionMNIST  89.64 
Ours (with )  FashionMNIST  93.00 
Ranjan et al. (2017)  MNIST  99.05 
Zhong et al. (2017)  FashionMNIST  96.35 
Input Image  Conv_3  Pool_3  Flatten  Layer 

4.2.2 CIFAR10 and SVHN
For the next experiment, we perform the test on CIFAR10 and SVHN datasets. Since MNIST and FashionMNIST are low resolution, grayscale and synthetic datasets, we test the layer on datasets with increasing complexity. Table 3 demonstrates the results of layer and SimpleNetwork.
Methods  Dataset  Accuracy () 
Ours (Without )  CIFAR  58.2 
Ours (With )  CIFAR  64.0 
Ours (Without )  SVHN  93.20 
Ours (With )  SVHN  95.05 
Zagoruyko & Komodakis (2016)  CIFAR  96.11 
Zagoruyko & Komodakis (2016)  SVHN  98.46 
4.2.3 URDU Dataset
As an experiment on a nonstandard dataset, we perform classification on the URDU dataset introduced by (S. Nawaz, 2018). Since MNIST and FashionMNIST are , we employ the format of the URDU dataset to validate the layer on increasing image dimensions along number of channels. Table 4 demonstrates the results of layer. Furthermore, this experiment confirms that the layer is not SimpleNet dependent since URDU dataset is trained using LeNet with softmax as a supervision signal. For comparison, we also experiment with SimpleNet. It is important to note that with networks like GoogleNet, accuracy on the URDU dataset crosses when coupled with the layer. The aim of the experiments is not to demonstrate the superiority of SimpleNet, but to demonstrate the increase in accuracy when a network is coupled with the layer.
Methods  Dataset  Network  Accuracy () 

Ours (Without )  URDU  LeNet  70.02 
Ours (With )  URDU  LeNet  71.54 
Ours (Without )  URDU  SimpleNet  74.03 
Ours (With )  URDU  SimpleNet  77.76 
4.3 Cifar100
We perform an additional experiment on the CIFAR100 dataset to confirm the efficacy of the proposed layer. This experiment is particularly interesting because it augments an important claim behind the layer. We employ GoogleNet for this experiment for two reasons: (i) to verify that introduced layer can be coupled with GoogleNet and (ii) CIFAR100 is a large dataset compared to the datasets previously employed, thus, the accuracy with networks like LeNet is not satisfactory. Table 5 demonstrates the results of GoogleNet on CIFAR100 with and without the layer. Figure 7 visualizes the training graph with and without the proposed unit. It is interesting for readers to note the difference between the two graphs. Note that we imply the idea that projection and scaling happens during each pass and almost simultaneously due to scaling just before the projection. This is the major reason why loss behaves in variating fashion in the start. It should be noted that this does not mean the network struggles to converge.
Methods  Dataset  Accuracy () 

Ours (Without )  CIFAR100  59.23 
Ours (With )  CIFAR100  60.44 
(Cireşan et al., 2011)  CIFAR100  64.32 
(Goodfellow et al., 2013)  CIFAR100  65.46 
(Springenberg et al., 2015)  CIFAR100  66.29 
4.4 Large Scale Classification
Training stateoftheart models on ImageNet dataset can take several weeks of computation time. We did not aim for the best performance, rather perform a proof of concept experiment. It is necessary to test if architecture coupled with
layer performing best on smaller datasets like CIFAR10, FashionMNIST etc also apply to larger datasets. We employ ILVRC2012 (Russakovsky et al., 2015) subset of the ImageNet dataset to train GoogleNet with and without the unit.4.5 Retrieval Results
In this section we report the retrieval results on FashionMNIST dataset. Most retrieval systems employ Recall@K as a metric to compute the scores. R@K is the percentage of queries in which the ground truth terms are one of the first K retrieved results. To retrieve results, we take query image and simply compute nearest neighbor (euclidean distance) between all images and sort results based on the distance. The first five distances correspond to Recall@K (K = 5) results and so on. We report results for . Since this is a unimodal retrieval, images are at the input and retrieval end. It is known that Recall@K increases even if one true positive out of Top is encountered, so the results are almost similar. For a more valid quantitive analysis, we also present results of average occurrence of true positives (TP) in Top . For retrieval, distance minimization is the major objective which softmax alone can not handle efficiently, thus we employ contrastive loss introduced by (Hadsell et al., 2006) along with softmax for the retrieval problem which shows that the proposed layer can function regardless of the architecture and loss function.
Without  With  TP with  TP without  

R@1  88.75  89.74  86.70  
R@5  95.88  89.90  86.60  
R@10  97.33  90.00  85.60 
4.6 Result Discussion
We explore classification and retrieval tasks with and without the layer. The reported results indicate the superior performance of architecture with the layer. It is important to note that the each experiments is run 5 times and kfold validation methodology is employed. The architecture with layer maintains the upper bound over its counter part without the layer. In Table 6, the TP with and without indicate the average occurrence of true positives in TopK retrieved results. The reason we employ this metric is Recall@K is incremented even if a single true positive is encountered out of TopK and thus the results with and without
layer are almost similar. However, with average true positive, we compute the actual number of true positives out of TopK and compute the average to report more discriminative comparison between the two. Furthermore, we compare the network with stateoftheart approaches. Note that most of the works obtaining stateoftheart perform data preprocessing, while we do not employ any pre or post processing technique for any experiment performed and use single optimization policy without finetuning hyperparameters for any specific task.
5 Conclusion and Future Work
In this paper, we proposed a novel layer for embedding the learned deep features into an hypersphere. We propose that hypersphere embedding is important for discriminative analysis of the features. We verify the claim with extensive evaluation on multiple classification and retrieval tasks. We propose a simpler architecture for the said tasks and demonstrate that simpler networks achieve results comparable to the deeper networks when coupled with the layer . Furthermore, the layer module can be added to any network and is fully differentiable and can be trained endtoend with any network as well.
In future, we would like to explore different hyperspaces for discriminatively embedding feature representations. Furthermore, we would like to explore constraintenforced hyperspaces where networks learns a mapping function under certain constraints thus resulting in a desired embedding.
References
 Aggarwal et al. (2001) Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory, pp. 420–434. Springer, 2001.
 Beyer et al. (1999) Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? In International conference on database theory, pp. 217–235. Springer, 1999.
 Calefati et al. (2018) Alessandro Calefati, Muhammad Kamran Janjua, Shah Nawaz, and Ignazio Gallo. Git loss for deep face recognition. arXiv preprint arXiv:1807.08512, 2018.
 Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pp. 539–546. IEEE, 2005.
 Cireşan et al. (2011) Dan C Cireşan, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. Highperformance neural networks for visual object classification. arXiv preprint arXiv:1102.0183, 2011.
 Deng et al. (2017) Jiankang Deng, Yuxiang Zhou, and Stefanos Zafeiriou. Marginal loss for deep face recognition. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPRW), Faces “inthewild” Workshop/Challenge, volume 4, 2017.
 Goodfellow et al. (2013) Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
 Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In null, pp. 1735–1742. IEEE, 2006.
 Hoffer & Ailon (2015) Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pp. 84–92. Springer, 2015.
 Hu et al. (2014) Junlin Hu, Jiwen Lu, and YapPeng Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1882, 2014.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jarrett et al. (2009) Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multistage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2146–2153. IEEE, 2009.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference of Learning Representations, 2015.
 Koestinger et al. (2012) Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. Large scale metric learning from equivalence constraints. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2288–2295. IEEE, 2012.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 LeCun et al. (2012) Yann A LeCun, Léon Bottou, Genevieve B Orr, and KlausRobert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.
 (17) Weiyang Liu. Largemargin softmax loss for convolutional neural networks.
 Liu et al. (2017) Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. 2017.
 Lu et al. (2015) Jiwen Lu, Gang Wang, Weihong Deng, Pierre Moulin, and Jie Zhou. Multimanifold deep metric learning for image set classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1137–1145, 2015.
 Nawaz et al. (2018) Shah Nawaz, Muhammad Kamran Janjua, Alessandro Calefati, and Ignazio Gallo. Revisiting cross modal retrieval. arXiv preprint arXiv:1807.07364, 2018.
 (21) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning.
 Oh Song et al. (2016) Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012, 2016.
 Park & Im (2016) Gwangbeen Park and Woobin Im. Imagetext multimodal representation learning by adversarial backpropagation. arXiv preprint arXiv:1612.08354, 2016.
 Ranjan et al. (2017) Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 S. Nawaz (2018) N. Ahmed I. Gallo S. Nawaz, A. Calefati. Hand written characters recognition via deep metric learning. In 13th IAPR International Workshop on Document Analysis Systems (DAS), volume 05, pp. 417–422, 2018. doi: 10.1109/DAS.2018.18.
 Salimans & Kingma (2016) Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.
 Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Springenberg et al. (2015) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. International Conference on Leanring Representations, 2015.

Srivastava et al. (2014)
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, 15(1):1929–1958, 2014.  Sun et al. (2014) Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identificationverification. In Advances in neural information processing systems, pp. 1988–1996, 2014.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Szegedy et al. (2017)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
2017.  Wang et al. (2017) Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l2 hypersphere embedding for face verification. In Proceedings of the 2017 ACM on Multimedia Conference, pp. 1041–1049. ACM, 2017.

Wang et al. (2014)
Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James
Philbin, Bo Chen, and Ying Wu.
Learning finegrained image similarity with deep ranking.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393, 2014.  Wang et al. (2016) Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structurepreserving imagetext embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013, 2016.
 Weinberger & Saul (2009) Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
 Wen et al. (2016) Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Springer, 2016.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Ying & Li (2012)
Yiming Ying and Peng Li.
Distance metric learning with eigenvalue optimization.
Journal of Machine Learning Research, 13(Jan):1–26, 2012.  Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zhong et al. (2017) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
Appendix A Exploring Different Values of
In this appendix section, we explore different values of the scale factor employed in layer. During training, the hyperspace adapts according to the values of . With lower values, stretching is minimum with low intraclass dispersion and results in a hypersphere. Whereas, with greater values of , parallel stretch of hyperspace is maximum with compactness of features in one direction. We visually illustrate the effects on MNIST dataset’s test set with increasing values of in factor of where . The inward scaling effect is still visible in the figure, but the hypersphere embedding is distorted once the training completes with large false positive rate and low classification accuracy. Note that values of lie in the factor, values otherwise yield unsatisfactory results.
Appendix B Proving
In this appendix section we prove the gradient with respect to i.e. . We adopt the strategy presented by (Wang et al., 2017; Ranjan et al., 2017). Since is not a learnable parameter, we can ignore it during gradient computation. We know from Equation 1 that , similarly where . We have as follows.
where . The gradient of is denoted as . We ignore in the final equation because our main objective is to demonstrate gradient w.r.t to the introduced . Using these notations, we proceed as follows.
.
Appendix C Learning Curves of
For more intuitive understanding of the proposed layer it is important to visualize the training graph plotted as the loss decreases. It would be interesting to note the difference between the graph with and without the layer. For the sake of employing different networks with the layer, this plot is with LeNet.
Comments
There are no comments yet.