Deep Learning of Constrained Autoencoders for Enhanced Understanding of Data

by   Babajide O. Ayinde, et al.
University of Louisville

Unsupervised feature extractors are known to perform an efficient and discriminative representation of data. Insight into the mappings they perform and human ability to understand them, however, remain very limited. This is especially prominent when multilayer deep learning architectures are used. This paper demonstrates how to remove these bottlenecks within the architecture of Nonnegativity Constrained Autoencoder (NCSAE). It is shown that by using both L1 and L2 regularization that induce nonnegativity of weights, most of the weights in the network become constrained to be nonnegative thereby resulting into a more understandable structure with minute deterioration in classification accuracy. Also, this proposed approach extracts features that are more sparse and produces additional output layer sparsification. The method is analyzed for accuracy and feature interpretation on the MNIST data, the NORB normalized uniform object data, and the Reuters text categorization dataset.



There are no comments yet.


page 4

page 5

page 6

page 7

page 8


Deep Learning of Part-based Representation of Data Using Sparse Autoencoders with Nonnegativity Constraints

We demonstrate a new deep learning autoencoder network, trained by a non...

RMDL: Random Multimodel Deep Learning for Classification

The continually increasing number of complex datasets each year necessit...

On the Regularization of Autoencoders

While much work has been devoted to understanding the implicit (and expl...

"You might also like this model": Data Driven Approach for Recommending Deep Learning Models for Unknown Image Datasets

For an unknown (new) classification dataset, choosing an appropriate dee...

Knock-Knock: Acoustic Object Recognition by using Stacked Denoising Autoencoders

This paper presents a successful application of deep learning for object...

Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

Fast Fourier transform, Wavelets, and other well-known transforms in sig...

Bag-of-Vectors Autoencoders for Unsupervised Conditional Text Generation

Text autoencoders are often used for unsupervised conditional text gener...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning (DL) networks take the form of heuristic and rich architectures that develop unique intermediate data representation. The complexity of architectures is reflected by both the sizes of layers and, for a large number of data sets reported in the literature, also by the processing. In fact, the architectural complexity and the excessive number of weights and units are often built in into the DL data representation by design and are deliberate

[1, 2, 3, 4, 5]. Although deep architectures are capable of learning highly complex mappings, they are difficult to train, and it is usually hard to interpret what each layer has learnt. Moreover, gradient-based optimization with random initialization used in training is susceptible to converging to local minima [6, 7].
In addition, it is generally believed that humans analyze complex interactions by breaking them into isolated and understandable hierarchical concepts. The emergence of part-based representation in human cognition can be conceptually tied to the nonnegativity constraints [8]

. One way to enable easier human understandability of concepts in neural networks is to constrain the network’s weights to be nonnegative. Note that such representation through nonnegative weights of a multilayer network perceptron can implement any shattering of points provided suitable negative bias values are used

Drawing inspiration from the idea of Nonnegative Matrix Factorization (NMF) and sparse coding [10, 8], the hidden structure of data can be unfolded by learning features that have capabilities to model the data in parts. Although NMF enforces the encoding of both the data and features to be nonnegative thereby resulting in additive data representation, however, incorporating sparse coding within NMF for the purpose of encoding data is computationally expensive, while with AEs, this incorporation is learning-based and fast. In addition, the performance of a deep network can be enhanced using Nonnegativity Constrained Sparse Autoencoder (NCAE) with part-based data representation capability [11, 12].
It is remarked that weight regularization is a concept that has been employed both in the understandability and generalization context. It is used to suppress magnitudes of the weights by reducing the sum of their squares. Enhancement in sparsity can also be achieved by penalizing sum of absolute values of the weights rather than the sum of their squares [13, 14, 15, 16, 17]. In this paper, the work proposed in [11]

is extended by modifying the cost function to extract more sparse features, encouraging nonnegativity of the network weights, and enhancing the understandability of the data. Other related model is the Nonnegative Sparse Autoencoder (NNSAE) trained with an online algorithm with tied weights and linear output activation function to mitigate the training hassle

[18]. While [18] uses a piecewise linear decay function to enforce nonnegativity and focuses on shallow architecture, the proposed uses a composite norm with focus on deep architectures. Dropout is another recently introduced and widely used heuristic to sparsify AEs and prevent overfitting by randomly dropping units and their connections from the neural network during training [19, 20].
More recently, different paradigm of AEs that constrain the output of encoder to follow a chosen prior distribution have been proposed [21, 22, 23]. In variational autoencoding, the decoder is trained to reconstruct the input from samples that follow chosen prior using variational inference [21]. Realistic data points can be reconstructed in the original data space by feeding the decoder with samples from chosen prior distribution. On the other hand, adversarial AE matches the encoder’s output distribution to an arbitrary prior distribution using adversarial training with discriminator and the generator [22]. Upon adversarial training, encoder learns to map data distribution to the prior distribution.
The problem addressed here is three-fold: (i) The interpretability of AE-based deep layer architecture fostered by enforcing high degree of weight’s nonnegativity in the network. This improves on NCAEs that show negative weights despite imposing nonnegativity constraints on the network’s weights [11]

. (ii) It is demonstrated how the proposed architecture can be utilized to extract meaningful representations that unearth the hidden structure of a high-dimensional data. (iii) It is shown that the resulting nonnegative AEs do not deteriorate their classification performance. This paper considerably expands the scope of the AE model first introduced in

[24] by: (i) introducing smoothing function for

regularization for numerical stability, (ii) illustrating the connection between the proposed regularization and weights’ nonnegativity, (iii) drawing more insight into variety of dataset, (iv) comparing the proposed with recent AE architectures, and lastly (v) supporting the interpretability claim with new experiments on text categorization data. The paper is structured as follows: Section II introduces the network configuration and the notation for nonnegative sparse feature extraction. Section III discusses the experimental designs and Section IV presents the results. Finally, conclusions are drawn in Section V.

Ii Nonnegative sparse feature extraction using Constrained Autoencoders

As shown in [8], one way of representing data is by shattering it into various distinct pieces in a manner that additive merging of these pieces can reconstruct the original data. Mapping this intuition to AEs, the idea is to sparsely disintegrate data into parts in the encoding layer and subsequently additively process the parts to recombine the original data in the decoding layer. This disintegration can be achieved by imposing nonnegativity constraint on the network’s weights [25, 26, 11].

(a) (b) (c) (d)
Fig. 1: (a) Symmetric (

) and skewed (

and ) weight distributions. Decay function with three values of and for weight distribution (b) (c) and (d) .

Ii-a -Nonnegativity Constrained Sparse Autoencoder (-Ncsae)

In order to encourage higher degree of nonnegativity in network’s weights, a composite penalty term (1) is added to the objective function resulting in the cost function expression for -NCSAE:


where and represent the weights and biases of encoding and decoding layers respectively;

is the number of neurons in layer

. represents the connection between th neuron in layer and th neuron in layer and for given input x,


where is the number of training examples, is the Euclidean norm. is the Kullback-Leibler (KL) divergence for sparsity control [27] with denoting the desired activation and the average activations of hidden units, is the number of hidden units, denotes the activation of hidden unit due to input , and is the element-wise application of the logistic sigmoid, , controls the sparsity penalty term, and


where and are and nonnegativity-constraint weight penalty factors, respectively. , , , and are experimentally set to , , , and , respectively using

randomly sampled images from the training set as a held-out validation set for hyperparameter tuning and the network is retrained on the entire dataset. The weights are updated as below using the error backpropagation:


where is the learning rate and the gradient of

-NCSAE loss function is computed as in (



where is a composite function denoting the derivative of (3) with respect to as in (7).


Although the penalty function in (1) is an extension of NCAE (obtained by setting to zero), a close scrutiny of the weight distribution of both the encoding and decoding layer in NCAE reveals that many weights are still not nonnegative despite imposing nonnegativity constraints. The reason for this is that the original norm used in NCAE penalizes the negative weights with big magnitudes stronger than those with smaller magnitudes. This forces a good number of the weights to take on small negative values. This paper uses additional to even out this occurrence, that is, the penalty forces most of the negative weights to become nonnegative.

Ii-B Implication of imposing nonnegative parameters with composite decay function

The graphical illustration of the relation between the weight distribution and the composite decay function is shown in Fig. 1. Ideally, addition of Frobenius norm of the weight matrix () to the reconstruction error in (2) imposes a Gaussian prior on the weight distribution as shown in curve in Fig. 1a. However, using the composite function in (3

) results in imposition of positively-skewed deformed Gaussian distribution as in curves

and . The degree of nonnegativity can be adjusted using parameters and

. Both parameters have to be carefully chosen to enforce nonnegativity while simultaneously ensuring good supervised learning outcomes. The effect of

(), () and ( and ) nonnegativity penalty terms on weight updates for weight distributions , and are respectively shown in Fig. 1c,d, and b. It can be observed for all the three distributions that regularization enforces stronger weight decay than individual and regularization. Other observation from Fig. 1 is that the more positively-skewed the weight distribution becomes, the lesser the weight decay function.
The consequences of minimizing (1) are that: (i) the average reconstruction error is reduced (ii) the sparsity of the hidden layer activations is increased because more negative weights are forced to zero thereby leading to sparsity enhancement, and (iii) the number of nonnegative weights is also increased. The resultant effect of penalizing the weights simultaneously with and norm is that large positive connections are preserved while their magnitudes are shrunk. However, the norm in (3) is non-differentiable at the origin, and this can lead to numerical instability during simulations. To circumvent this drawback, one of the well known smoothing function that approximates norm as in (3

) is utilized. Given any finite dimensional vector

z and positive constant , the following smoothing function approximates norm:


with gradient


For convenience, we adopt (8) to smoothen the penalty function and is experimentally set to .

Iii Experiments

Fig. 2: Filtering the signal through the -NCSAE trained using the reduced MNIST data set with class labels , and . The test image is a 2828 pixels image unrolled into a vector of 784 values. Both the input test sample and the receptive fields of the first autoencoding layer are presented as images. The weights of the output layer are plotted as a diagram with one row for each output neuron and one column for every hidden neuron in layer. The architecture is 784-10-10-3. The range of weights are scaled to [-1,1] and mapped to the graycolor map. is assigned to black, to grey, and is assigned to white color. That is, black pixels indicate negative, grey pixels indicate zero-valued weights and white pixels indicate positive weights.

In the experiments, three data sets are used, namely: MNIST [28], NORB normalized-uniform [29], and Reuters-21578 text categorization dataset. The Reuters-21578 text categorization dataset comprises of documents that featured in 1987 Reuters newswire. The ModApte split was employed to limit the dataset to 10 most frequent classes. The ModApte split was utilized to limit the categories to 10 most frequent categories. The bag-of-words format that has been stemmed and stop-word removed was used; see for further clarification. The dataset contains documents with dimensions. Two techniques were used to reduce the dimensionality of each document in order to preserve the most informative and less correlated words [30]. To reduce the dimensionality of each document to contain the most informative and less correlated words, words were first sorted based on their frequency of occurrence in the dataset. Words with frequency below 4 and above were then eliminated. The most informative words that do not occur in every topic were selected based on information gain with the class attribute. The remaining words (or features) in the dataset were sorted using this method, and the less important features were removed based on the desired dimension of documents. In this paper, the length of the feature vector for each of the documents was reduced to 200.
In the preliminary experiment, the subset , and from the MNIST handwritten digits as extracted for the purpose of understanding how the deep network constructed using

-NCSAE processes and classifies its input. For easy interpretation, a small deep network was constructed and trained by stacking two AEs with

hidden neurons each and softmax neurons. The number of hidden neurons was chosen to obtain reasonably good classification accuracy while keeping the network reasonably small. The network is intentionally kept small because the full MNIST data would require larger hidden layer size and this may limit network interpretability. An image of digit is then filtered through the network, and it can be observed in Fig. 2 that sparsification of the weights in all the layers is one of the aftermath of nonnegativity constraints imposed on the network. Another observation is that most of the weights in the network have been confined to nonnegative domain, which removes opaqueness of the deep learning process. It can be seen that the fourth and seventh receptive fields of the first AE layer have dominant activations (with activation values and respectively) and they capture most information about the test input. Also, they are able to filter distinct part of input digit. The outputs of the first layer sigmoid constitute higher level features extracted from test image with emphasis on the fourth and seventh features. Subsequently in second layer the second, sixth, eight, and tenth neurons have dominant activations (with activation values , , , and

respectively) because they have stronger connections with the dominant neurons in first layer than the rest. Lastly in the softmax layer, the second neuron was

activated because it has strongest connections with the dominant neurons in second layer thereby classifying the test image as ”2”.

Fig. 3: The weights were trained using two stacked -NCSAEs. RFs learned from the reduced NORB dataset are plotted as images at the bottom part of (a). The intensity of each pixel is proportional to the magnitude of the weight connected to that pixel in the input image with negative value indicating black, positive values white, and the value 0 corresponding to gray. The biases are not shown. The activations of first layer hidden units for the NORB objects presented in (b) are depicted on the bar chart on top of the RFs. The weights of the second layer AE are plotted as a diagram at the topmost part of (a). Each row of the plot corresponds to the weight of each hidden unit of second AE and each column for weight of every hidden unit of the first layer AE. The magnitude of the weight corresponds to the area of each square; white indicates positive, grey indicates zero, and black negative sign. The activations of second layer hidden units are shown as bar chart in the right-hand side of the second layer weight diagram. Each column shows the activations of each hidden unit for five color-coded examples of the same object. The outputs of Softmax layer for color-coded test objects with class labels (c) ”fourlegged animals” tagged as class 1, (d) ”human figures” as class 2, and (e) ”airplanes” as class 3.

The fostering of interpretability is also demonstrated using a subset of NORB normalized-uniform dataset [29] with class labels ”four-legged animals”, ”human figures”, ”airplanes”. The --- network configuration was trained on the subset of the NORB data using two stacked -NCSAEs and a Softmax layer. Fig. 3b shows the randomly sampled test patterns and the weights and activations of first and second AE layer are shown in Fig. 3a. The bar charts indicate the activations of hidden units for the sample input patterns. The features learned by units in each layer are localized, sparse and allow easy interpretation of isolated data parts. The features mostly show nonnegative weights making it easier to visualize to what input object patterns they respond. It can be seen that units in the network discriminate among objects in the images and react differently to input patterns. Third, sixth, eight, and ninth hidden units of layer 1 capture features that are common to objects in class ”2” and react mainly to them as shown in the first layer activations. Also, the features captured by the second layer activations reveal that second and fifth hidden units are mainly stimulated by objects in class ”2”.
The outputs of Softmax layer represent the a posteriori

class probabilities for a given sample and are denoted as Softmax scores. An important observation from Fig. 

3a,b, and c is that hidden units in both layers did not capture significant representative features for class ”1” white color-coded test sample. This is one of the reasons why it is misclassified into class ”3” with probability of 0.57. The argument also goes for class ”1” dark-grey color-coded test sample misclassified into class ”3” with probability of 0.60. In contrast, hidden units in both layers capture significant representative features for class ”2” test samples of all color codes. This is why all class ”2” test samples are classified correctly with high probabilities as shown in Fig. 3d. Lastly, the network contains a good number of representative features for class ”3” test samples and was able to classify 4 out of 5 correctly as given in Fig. 3e.

Iv Results and Discussion

Iv-a Unsupervised Feature Learning of Image Data

In the first set of experiments, three-layer -NCSAE, NCAE [11], DpAE [19], and conventional SAE network with hidden neurons were trained using MNIST dataset of handwritten digits and their ability to discover patterns in high dimensional data are compared. These experiments were run one time and recorded. The encoding weights , also known as receptive fields or filters as in the case of image data, are reshaped, scaled, centered in a 28 28 pixel box and visualized. The filters learned by -NCSAE are compared with that learned by its counterparts, NCAE and SAE. It can be easily observed from the results in Fig. 4 that -NCSAE learned receptive fields that are more sparse and localized than those of SAE, DpAE, and NCAE. It is remarked that the black pixels in both SAE and DpAE features are results of the negative weights whose values and numbers are reduced in NCAE with nonnegativity constraints, which are further reduced by imposing an additional penalty term in -NCSAE as shown in the histograms located on the right side of the figure. In the case of -NCSAE, tiny strokes and dots which constitute the basic part of handwritten digits, are unearthed compared to SAE, DpAE, and NCAE. Most of the features learned by SAE are major parts of the digits or the blurred version of the digits, which are obviously not as sparse as those learned by -NCSAE. Also, the features learned by DpAE are fuzzy compared to those of -NCSAE which are sparse and distinct. Therefore, the achieved sparsity in the encoding can be traced to the ability of and regularization in enforcing high degree of weights’ nonnegativity in the network.

(a) SAE (b) DpAE (c) NCAE (d) -NCSAE
Fig. 4: 196 receptive fields () with weight histograms learned from MNIST digit data set using (a) SAE, (b) DpAE (c) NCAE, and (d) -NCSAE. Black pixels indicate negative, and white pixels indicate positive weights. The range of weights are scaled to [-1,1] and mapped to the graycolor map. is assigned to black, to grey, and is assigned to white color.
(a) (b)
Fig. 5: (a) Reconstruction error and (b) Sparsity of hidden units measured by KL-divergence using MNIST train dataset with = 0.05.
Fig. 6: t-SNE projection [31] of 196D representations of MNIST handwritten digits using (a) DpAE (b) NCAE (c) -NCSAE.
(a) SAE (b) DpAE (c) NCAE (d) -NCSAE
Fig. 7: Weights of randomly selected 90 out of 200 receptive filters of (a) SAE (b) DpAE (c) NCAE, and (d) -NCSAE using NORB dataset. The range of weights are scaled to [-1,1] and mapped to the graycolor map. is assigned to black, to grey, and is assigned to white color.
(a) (b) (c)
Fig. 8: The distribution of 200 encoding () and decoding filters () weights learned from NORB dataset using (a) DpAE (b) NCAE (c) -NCSAE.
Fig. 9: Visualizing 20D representations of a subset of Reuters Documents data using (a) DpAE, (b) NCAE, and (c) -NCSAE.

Likewise in Fig. 5a, -NCSAE with other AEs are compared in terms of reconstruction error, while varying the number of hidden nodes. As expected, it can be observed that -NCSAE yields a reasonably lower reconstruction error on the MNIST training set compared to SAE, DpAE, and NCAE. Although, a close scrutiny of the result also reveals that the reconstruction error of -NCSAE deteriorates compared to NCAE when the hidden size grows beyond . However on the average, -NCSAE reconstructs better than other AEs considered. It can also be observed that DpAE with 50% dropout has high reconstruction error when the hidden layer size is relatively small (100 or less). This is because the few neurons left are unable to capture the dynamics in the data, which subsequently results in underfitting the data. However, the reconstruction error improves as the hidden layer size is increased. Lower reconstruction error in the case of -NCSAE and NCAE is an indication that nonnegativity constraint facilitates the learning of parts of digits that are essential for reconstructing the digits. In addition, the KL-divergence sparsity measure reveals that -NCSAE has more sparse hidden activations than SAE, DpAE and NCAE for different hidden layer size as shown in Fig. 5b. Again, averaging over all the training examples, -NCSAE yields less activated hidden neurons compared to its counterparts.

(a) (b)
Fig. 10: Deep network trained on Reuters-21578 data using (a) DpAE, (b) -NCSAE. The area of each square is proportional to the weight’s magnitude. The range of weights are scaled to [-1,1] and mapped to the graycolor map. is assigned to black, to grey, and is assigned to white color.

Also, using t-distributed stochastic neighbor embedding (t-SNE) to project the -D representation of MNIST handwritten digits to 2D space, the distribution of features encoded by encoding filters of DpAE, NCAE, and -NCSAE are respectively visualized in Figs. 6a, b, and c. A careful look at Fig. 6a reveals that digits ”” and ”” are overlapping in DpAE, and this will inevitably increase the chance of misclassifying these two digits. It can also be observed in Fig. 6b corresponding to NCAE that digit ”” is projected with two different landmarks. In sum, the manifolds of digits with -NCSAE are more separable than its counterpart as shown in Fig. 6c, aiding the classifier to map out the separating boundaries among the digits more easily.
In the second experiment, SAE, NCAE, -NCSAE, and DpAE with 200 hidden nodes were trained using the NORB normalized-uniform dataset. The NORB normalized-uniform dataset, which is the second dataset, contains training images and test images of toys from generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The training and testing sets consist of instances of each category. Each image consists of two channels, each of size pixels. The inner

pixels of one of the channels cropped out and resized using bicubic interpolation to

pixels that form a vector with entries as the input. Randomly selected weights of out of neurons are plotted in Fig. 7. It can be seen that -NCSAE learned more sparse features compared to features learned by all the other AEs considered. The receptive fields learned by -NCSAE captured the real actual edges of the toys while the edges captured by NCAE are fuzzy, and those learned by DpAE and SAE are holistic. As shown in the weight distribution depicted in Fig. 8, -NCSAE has both its encoding and decoding weights centered around zero with most of its weights positive when compared with those of DpAE and NCAE that have weights distributed almost even on both sides of the origin.

Before fine-tuning After fine-tuning
Dataset Mean ( SD) p-value Mean ( SD) p-value

# Epochs

MNIST SAE 0.735 0.015 <0.001 0.977 0.0007 <0.001 400
NCAE 0.844 (0.0085) 0.0018 0.974 (0.0012) 0.812 126
NNSAE 0.702 (0.027) <0.0001 0.970 (0.001) <0.0001 400
-NCSAE 0.847 (0.0077) - 0.974 (0.0087) - 84
DAE (50% input dropout) 0.551 (0.011) <0.0001 0.972 (0.0021) 0.034 400
DpAE (50% hidden dropout) 0.172 (0.0021) <0.0001 0.964 (0.0017) <0.0001 400
AAE - - 0.912 (0.0016) <0.0001 1000
NORB SAE 0.562 0.0245 <0.0001 0.814 0.0099 0.041 400
NCAE 0.696 (0.021) 0.406 0.817 (0.0095) 0.001 305
NNSAE 0.208 (0.025) <0.0001 0.738 ( 0.012) <0.001 400
-NCSAE 0.695 (0.0084) - 0.812 (0.0001) - 196
DAE (50% input dropout) 0.461 (0.0019) <0.0001 0.807 (0.0015) 0.0103 400
DpAE (50% hidden dropout) 0.491 (0.0013) <0.0001 0.815 (0.0038) <0.0001 400
AAE - - 0.791 (0.041) <0.0001 1000
TABLE I: Classification accuracy on MNIST and NORB dataset

Iv-B Unsupervised Semantic Feature Learning from Text Data

In this experiment DpAE, NCAE, and -NCSAE are evaluated and compared based on their ability to extract semantic features from text data, and how they are able to discover the underlined structure in text data. For this purpose, the Reuters-21578 text categorization dataset with features is utilized to train all the three types of AEs with hidden nodes. A subset of examples belonging to categories ”grain”, ”crude”, and ”money-fx” was extracted from the test set. The experiments were run three times, averaged and recorded. In Fig. 9, the 20-dimensional representations of the Reuters data subset using DpAE, NCAE, and -NCSAE are visualized. It can be observed that -NCSAE is able to disentangle the documents into three distinct categories with more linear manifolds than NCAE. In addition, -NCSAE is able to group documents that are closer in the semantic space into the same categories than DpAE that finds it difficult to group the documents into any distinct categories with less overlap.

Iv-C Supervised Learning

In the last set of experiments, a deep network was constructed using two stacked -NCSAE and a softmax layer for classification to test if the enhanced ability of the network to shatter data into parts and lead to improved classification. Eventually, the entire deep network is fine-tuned to improve the accuracy of the classification. In this set of experiments, the performance of pre-training a deep network with -NCSAE is compared with those pre-trained with recent AE architectures. The MNIST and NORB data sets were utilized, and every run of the experiments is repeated ten times and averaged to combat the effect of random initialization. The classification accuracy of the deep network pre-trained with NNSAE [18], DpAE [19], DAE [32], AAE [22], NCAE, and -NCSAE using MNIST and NORB data respectively are detailed in Table I. The network architectures are 784-196-20-10 and 1024-200-20-5 for MNIST and NORB dataset respectively. It is remarked that for training of AAE with two layers of 196 hidden units in the encoder, decoder, discriminator, and other hyperparameters tuned as described in [22], the accuracy was %. The AAE reported in Table I used encoder, decoder, and discriminator each with two layers of 1000 hidden units and trained for 1000 epochs. The classification accuracy and speed of convergence are the figures of merit used to benchmark -NCSAE with other AEs.
It is observed from the result that -NCSAE-based deep network gives an improved accuracy before fine-tuning compared to methods such as NNSAE, NCAE, DpAE, and NCAE. However, the performance in terms of classification accuracy after fine-tuning is very competitive. In fact, it can be inferred from the p-value of the experiments conducted on MNIST and NORB in Table I that there is no significant difference in the accuracy after fine-tuning between NCAE and -NCSAE even though most of the weights in -NCSAE are nonnegativity constrained. Therefore it is remarked that even though the interpretability of the deep network has been fostered by constraining most of the weights to be nonnegative and sparse, nothing significant has been lost in terms of accuracy. In addition, network trained with -NCSAE was also observed to converge faster than its counterparts. On the other hand, NNSAE also has nonnegative weights but with deterioration in accuracy, which is more conspicuous especially before the fine-tuning stage. The improved accuracy before fine-tuning in -NCSAE based network can be traced to its ability to decompose data more into distinguishable parts. Although the performance of -NCSAE after fine-tuning is similar to those of DAE and NCAE but better than NNSAE, DpAE, and AAE, -NCSAE constrains most of the weights to be nonnegative and sparse to foster transparency than for other AEs. However, DpAE and NCAE performed slightly more accurate than -NCSAE on NORB after network fine-tuning.
In light of constructing an interpretable deep network, an -NCSAE pre-trained deep network with hidden neurons in the first AE layer, hidden neurons in the second AE, and 10 output neurons (one for each category) in the softmax layer was constructed. It was trained on Reuters data, and compared with that pre-trained using DpAE. The interpretation of the encoding layer of the first AE is provided by listing words associated with strongest weights, and the interpretation of the encoding layer of the second AE is portrayed as images characterized by both the magnitude and sign of the weights. Compared to the AE with weights of both signs shown in Fig. 10a, Fig. 10b allows for much better insight into the categorization of the topics.
Topic earn in the output weight matrix resonates with the 5th hidden neuron most, lesser with the 3rd, and somewhat with the 4th. This resonance can happen only when the 5th hidden neuron reacts to input by words of columns 1 and 4, and in addition, to a lesser degree, when the 3rd hidden neuron reacts to input by words of the 3rd column of words. So, in tandem, the dominant columns 1, 4 and then also 3 are sets of words that trigger the category earn.
Analysis of the term words for the topic acq leads to a similar conclusion. This topic also resonates with the two dominant hidden neurons 5 and 3 and somewhat also with neuron 2. These neurons 5 and 3 are driven again by the columns of words 1,4, and 3. The difference between the categories is now that to a lesser degree, the category acq is influenced by the 6th column of words. An interesting point is in contribution of the 3rd column of words. The column connects only to the 4th hidden neuron but weights from this neuron in the output layer are smaller and hence less significant than for any other of the five neurons (or rows) of the output weight matrix. Hence this column is of least relevance in the topical categorization.

Iv-D Experiment Running Times

The training time for networks with and without the nonnegativity constraints was compared. The constrained network converges faster and requires lesser number of training epochs. In addition, the unconstrained network requires more time per epoch than the constrained one. The running time experiments were performed using full MNIST benchmark dataset on Intel(r) Core(TM) i7-6700 CPU @ 3.40Ghz and a 64GB of RAM running a 64-bit Windows 10 Enterprise edition. The software implementation has been with MATLAB 2015b with batch Gradient Descent method, and LBFGS in minFunc ([33]) is used to minimize the objective function. The usage times for constrained and unconstrained networks were also compared. We consider the usage time in milliseconds (ms) as the time elapsed in ms a fully trained deep network requires to classify all the test samples. The unconstrained network took 48 ms per epoch in the training phase while the constrained counterpart took 46 ms. Also, the unconstrained network required 59.9 ms usage time, whereas the network with nonnegative weights took 55 ms. From the above observations, it is remarked that the nonnegativity constraint simplifies the resulting network.

V Conclusion

This paper addresses the concept and properties of special regularization of DL AE that takes advantage of non-negative encodings and at the same time of special regularization. It has been shown that by using both and to penalize the negative weights, most of them are forced to be nonnegative and sparse, and hence the network interpretability is enhanced. In fact, it is also observed that most of the weights in the Softmax layer become nonnegative and sparse. In sum, it has been observed that encouraging nonnegativity in NCAE-based deep architecture forces the layers to learn part-based representation of their input and leads to a comparable classification accuracy before fine-tuning the entire deep network and not-so-significant accuracy deterioration after fine-tuning. It has also been shown on select examples that concurrent and regularization improve the network interpretability. The performance of the proposed method was compared in terms of sparsity, reconstruction error, and classification accuracy with the conventional SAE and NCAE, and we utilized MNIST handwritten digits, Reuters documents, and the NORB dataset to illustrate the proposed concepts.


  • [1] Y. Bengio and Y. LeCun, “Scaling learning algorithms towards ai,” Large-Scale Kernel Machines, vol. 34, no. 1, pp. 1–41, 2007.
  • [2] Y. Bengio, “Learning deep architectures for ai,”

    Foundations and trends® in Machine Learning

    , vol. 2, no. 1, pp. 1–127, 2009.
  • [3] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [4] L. Deng, “A tutorial survey of architectures, algorithms, and applications for deep learning,” APSIPA Transactions on Signal and Information Processing, vol. 3, p. e2, 2014.
  • [5] S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinov, “Guest editors introduction: Special section on learning deep architectures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1795–1797, 2013.
  • [6] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” Advances in Neural Information Processing Systems, vol. 19, p. 153, 2007.
  • [7] B. Ayinde and J. Zurada, “Clustering of receptive fields in autoencoders,” in Neural Networks (IJCNN), 2016 International Joint Conference on.   IEEE, 2016, pp. 1310–1317.
  • [8] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
  • [9] J. Chorowski and J. M. Zurada, “Learning understandable neural networks with nonnegative weight constraints,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 26, no. 1, pp. 62–69, 2015.
  • [10] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996.
  • [11] E. Hosseini-Asl, J. M. Zurada, and O. Nasraoui, “Deep learning of part-based representation of data using sparse autoencoders with nonnegativity constraints,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 27, no. 12, pp. 2486–2498, 2016.
  • [12]

    M. Ranzato, Y. Boureau, and Y. LeCun, “Sparse feature learning for deep belief networks,”

    Advances in Neural Information Processing Systems, vol. 20, pp. 1185–1192, 2007.
  • [13] M. Ishikawa, “Structural learning with forgetting,” Neural Networks, vol. 9, no. 3, pp. 509–521, 1996.
  • [14] P. L. Bartlett, “The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network,” Information Theory, IEEE Transactions on, vol. 44, no. 2, pp. 525–536, 1998.
  • [15] G. Gnecco and M. Sanguineti, “Regularization techniques and suboptimal solutions to optimization problems in learning from data,” Neural Computation, vol. 22, no. 3, pp. 793–829, 2010.
  • [16] J. Moody, S. Hanson, A. Krogh, and J. A. Hertz, “A simple weight decay can improve generalization,” Advances in Neural Information Processing Systems, vol. 4, pp. 950–957, 1995.
  • [17] O. E. Ogundijo, A. Elmas, and X. Wang, “Reverse engineering gene regulatory networks from measurement with missing values,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2017, no. 1, p. 2, 2017.
  • [18] A. Lemme, R. Reinhart, and J. Steil, “Online learning and generalization of parts-based image representations by non-negative sparse autoencoders,” Neural Networks, vol. 33, pp. 194–203, 2012.
  • [19] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [21] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [22] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
  • [23] Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance weighted autoencoders,” arXiv preprint arXiv:1509.00519, 2015.
  • [24] B. O. Ayinde, E. Hosseini-Asl, and J. M. Zurada, “Visualizing and understanding nonnegativity constrained sparse autoencoder in deep learning,” in

    Rutkowski L., Korytkowski M., Scherer R., Tadeusiewicz R., Zadeh L., Zurada J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science, vol 9692

    .   Springer, 2016, pp. 3–14.
  • [25] S. J. Wright and J. Nocedal, Numerical optimization.   Springer New York, 1999, vol. 2.
  • [26]

    T. D. Nguyen, T. Tran, D. Phung, and S. Venkatesh, “Learning partsbased representations with nonnegative restricted boltzmann machine,” in

    Asian Conference on Machine Learning, 2013, pp. 133–148.
  • [27] A. Ng, “Sparse autoencoder,” in CS294A Lecture notes.   URL Stanford University, 2011.
  • [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [29] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2.   IEEE, 2004, pp. II–97.
  • [30] P.-N. Tan, M. Steinbach, V. Kumar et al., Introduction to data mining.   Pearson Addison Wesley Boston, 2006, vol. 1.
  • [31] L. V. der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
  • [32]

    P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in

    25th International Conference on Machine Learning.   ACM, 2008, pp. 1096–1103.
  • [33] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995.