1 Introduction
Mathematical morphology is strongly related to the problem of data representation. Applying a morphological filter can be seen as a test on how well the analyzed element is represented by the set of invariants of the filter. For example, applying an opening by a structuring element tells how well a shape can be represented by the supremum of translations of . The morphological skeleton [14, 17] is a typical example of description of shapes by a family of building blocks, classically homothetic spheres. It provides a disjunctive decomposition where components  for example, the spheres  can only contribute positively as they are combined by supremum. A natural question is the optimality of this additive decomposition according to a given criterion, for example its sparsity  the number of components needed to represent an object. Finding a sparse disjunctive (or partbased) representation has at least two important features: first, it allows saving resources such as memory and computation time in the processing of the represented object; secondly, it provides a better understanding of this object, as it reveals its most elementary components, hence operating a dimensionality reduction that can alleviate the issue of model overfitting. Such representations are also believed to be the ones at stake in human object recognition [18].
Similarly, the question of finding a sparse disjunctive representation of a whole database is also of great interest and will be the main focus of the present paper. More precisely, we will approximate such a representation by a nonnegative, sparse linear combination of nonnegative components, and we will call additive this representation. Given a large set of images, our concern is then to find a smaller set of nonnegative image components, called dictionary, such that any image of the database can be expressed as an additive combination of the dictionary components. As we will review in the next section, this question lies at the crossroad of two broader topics known as sparse coding and dictionary learning [13].
Besides a better understanding of the data structure, our approach is also more specifically linked to mathematical morphology applications. Inspired by recent work [1, 20], we look for image representations that can be used to efficiently calculate approximations to morphological operators. The main goal is to be able to apply morphological operators to massive sets of images by applying them only to the reduced set of dictionary images. This is especially relevant in the analysis of remote sensing hyperspectral images where different kinds of morphological decomposition, such as morphological profiles [15] are widely used. For reasons that will be explained later, sparsity and nonnegativity are sound requirements to achieve this goal. What is more, whereas the representation process can be learned offline on a training dataset, we need to compute the decomposition of any new sample online. Hence, we take advantage of the recent advances in deep, sparse and nonnegative autoencoders to design a new framework able to learn partbased representations of an image database, compatible with morphological processing.
The existing work on nonnegative sparse representations of images are reviewed in Section 2, that stands as a baseline and motivation of the present study. Then we present in Section 3 our method before showing results on two image datasets (MNIST [9] and Fashion MNIST [21]) in Section 4, and show how it compares to other deep partbased representations. We finally draw conclusions and suggest several tracks for future work in Section 5. The code for reproducing our experiments is available online^{1}^{1}1For code release, visit https://gitlab.telecomparistech.fr/imagespublic/asymae_morpho.
2 Related work
2.1 Nonnegative sparse mathematical morphology
The present work finds its original motivation in [20], where the authors set the problem of learning a representation of a large image dataset to quickly compute approximations of morphological operators on the images. They find a good representation in the sparse variant of Nonnegative Matrix Factorization (sparse NMF) [7], that we present hereafter.
Consider a family of images (binary or grayscale) , , …, of pixels each, aggregated into a data matrix (the row of is the transpose of
seen as a vector). Given a feature dimension
and two numbers and in , a sparse NMF of with dimension , as defined in [7], is any solution of the problem(1) 
where the second constraint means that both and have nonnegative coefficients, and the third constraint imposes the degree of sparsity of the columns of and lines of respectively, with the function defined by
(2) 
Note that takes values in . The value characterizes vectors having a unique nonzero coefficient, therefore the sparsest ones, and the vectors whose coefficients all have the same absolute value. Hoyer [7] designed an algorithm to find at least a local minimizer for the problem (1), and it was shown that under fairly general conditions (and provided the norms of and are fixed) the solution is unique [19].
In the terminology of representation learning, each row of contains the encoding or latent features of the input image , and holds in its rows a set of images called the dictionary. In the following, we will use the term atom images or atoms to refer to the images of the dictionary. As stated by Equation (1), the atoms are combined to approximate each image of the dataset. This combination also writes as follows:
(3) 
The assumption behind this decomposition is that the more similar the images of the set, the smaller the required dimension to accurately approximate it. Note that only values need to be stored or handled when using the previous approximation to represent the data, against the values composing the original data.
By choosing the sparse NMF representation, the authors of [20] aim at approximating a morphological operator on the data by applying it to the atom images only, before projecting back into the input image space. That is, they want , with defined by
(4) 
The operator in Equation (4) is called a partbased approximation to . To understand why nonnegativity and sparsity allow hoping for this approximation to be a good one, we can point out a few key arguments. First, sparsity favors the support of the atom images to have little pairwise overlap. Secondly, a sum of images with disjoint supports is equal to their (pixelwise) supremum. Finally, dilations commute with the supremum and, under certain conditions that are favored by sparsity it also holds for the erosions. To precise this, let us consider a flat, extensive dilation and its adjoint antiextensive erosion , being a flat structuring element. Assume furthermore that for any with , . Then on the dataset , and are equal to their approximations as defined by Equation (4), that is to say:
and similarly, since for extensive, we also get It follows that the same holds for the opening . The assumption we just made is obviously too strong and unlikely to be verified, but this example helps realize that the sparser the nonnegative decomposition, the more disjoint the supports of the atom images and the better the approximation of a flat morphological operator.
As a particular case, in this paper we will focus on partbased approximations of the dilation by a structuring element , expressed as:
(5) 
that we will compare with the actual dilation of our input images to evaluate our model, as shown in Figure 1.
2.2 Deep autoencoders approaches
The main drawback of the NMF algorithm is that it is an offline process, the encoding of any new sample with regards to the previously learned basis requires either to solve a computationally extensive constrained optimization problem, or to release the NonNegativity constraint by using the pseudoinverse
of the basis. The various approaches proposed to overcome this shortcoming rely on Deep Learning, and especially on deep autoencoders, which are widely used in the representation learning field, and offer an
online representation process.An autoencoder, as represented in Figure 2
, is a model composed of two stacked neural networks, an encoder and a decoder whose parameters are trained by minimizing a loss function. A common example of loss function is the mean square error (MSE) between the input images
and their reconstructions by the decoder :(6) 
In this framework, and when the decoder is composed of a single linear layer (possibly followed by a nonlinear activation), the model approximates the input images as:
(7) 
where is the encoding of the input image by the encoder network, and respectively the bias and weights of the linear layer of the decoder, and
the (possibly nonlinear) activation function, that is applied pixelwise to the output of the linear layer. The output
is called the reconstruction of the input image by the autoencoder. It can be considered as a linear combination of atom images, up to the addition of an offset image and to the application of the activation function . The images of our learned dictionary are hence the columns of the weight matrix of the decoder. We can extend the definition of partbased approximation, described in Section 2.1, to our deeplearning architectures, by applying the morphological operator to these atoms , …, , as pictured by the “dilated decoder” in Figure 2. Note that a central question lies in how to set the size of the latent space. This question is beyond the scope of this study and the value of will be arbitrarily fixed (we take ) in the following.The NNSAE architecture, from Lemme et al. [11], proposes a very simple and shallow architecture for online partbased representation using linear encoder and decoder with tied weights (the weight matrix of the decoder is the transpose of the weight matrix of the encoder). Both the NCAE architectures, from HosseiniAsl et al. [6] and the work from Ayinde et al. [2] that aims at extending it, drop this transpose relationship between the weights of the encoder and of the decoder, increasing the capacity of the model. Those three networks enforce the nonnegativity of the elements of the representation, as well as the sparsity of the image encodings using various techniques.
2.2.1 Enforcing sparsity of the encoding
The most prevalent idea to enforce sparsity of the encoding in a neural network can be traced back to the work of H. Lee et al. [10]. This variant penalizes, through the loss function, a deviation of the expected activation of each hidden unit (i.e. the output units of the encoder) from a low fixed level . Intuitively, this should ensure that each of the units of the encoding is activated only for a limited number of images. The resulting loss function of the sparse autoencoder is then:
(8) 
where the parameter
sets the expected activation objective of each of the hidden neurons, and the parameter
controls the strength of the regularization. The function can be of various forms, which were empirically surveyed in [22]. The approach adopted by the NCAE [6] and its extension [2]rely on a penalty function based on the KLdivergence between two Bernoulli distributions, whose parameters are the expected activation and
respectively, as used in [6]:(9) 
The NNSAE architecture [11] introduces a slightly different way of enforcing the sparsity of the encoding, based on a parametric logistic activation function at the output of the encoder, whose parameters are trained along with the other parameters of the network.
2.2.2 Enforcing nonnegativity of the decoder weights
For the NMF (Section 2.1) and for the decoder, nonnegativity results in a partbased representation of the input images. In the case of neural networks, enforcing the nonnegativity of the weights of a layer eliminates cancellations of input signals. In all the aforementioned works, the encoding is nonnegative since the activation function at the output of the encoder is a sigmoid. In the literature, various approaches have been designed to enforce weight positivity. A popular approach is to use an asymmetric weight decay, added to the loss function of the network, to enact more decay on the negative weights that on the positive ones. However this approach, used in both the NNSAE [11] and NCAE [6] architectures, does not ensure that all weights will be nonnegative. This issue motivated the variant of the NCAE architecture [2, 11], which uses either the rather than the norm, or a smoothed version of the decay using both the and the norms. The source code of that method being unavailable, we did not use this more recent version as a baseline for our study.
3 Proposed model
We propose an online partbased representation learning model, using an asymmetric autoencoder with sparsity and nonnegativity constraints.As pictured in Figure 3, our architecture is composed of two networks: a deep encoder and a shallow decoder (hence the asymmetry and the name of AsymAE we chose for our architecture). The encoder network is based on the discriminator of the infoGAN architecture introduced in [4]
, which was chosen for its average depth, its use of widely adopted deep learning components such as batchnormalization
[8], 2Dconvolutional layers [5]and leakyRELU activation function
[12]. It has been designed specifically to perform interpretable representation learning on datasets such as MNIST and FashionMNIST. The network can be adapted to fit to larger images. The decoder network is similar to the one presented in Figure 2. A LeakyReLU activation has been chosen after the linear layer. Its behavior is the same as the identity for positive entries, while it multiplies the negative ones by a fixed coefficient . This activation function has shown better performances in similar architectures [12]. The sparsity of the encoding is achieved using the same approach as in [2, 6] that consists in adding to the previous loss function the regularization term described in Equations (8) and (9).We only enforced the nonnegativity of the weights of the decoder, as they define the dictionary of images of our learned representation and as enforcing the nonnegativity of the encoder weights would bring nothing but more constraints to the network and lower its capacity. We enforced this nonnegativity constraint explicitly by projecting our weights on the nearest points of the positive orthant after each update of the optimization algorithm (such as the stochastic gradient descent). The main asset of this other method that does not use any additional penalty functions, and which is quite similar to the way the NMF enforces nonnegativity, is that it ensures positivity of all weights without the cumbersome search for good values of the parameters the various regularization terms in the loss function.
4 Experiments
To demonstrate the goodness and drawbacks of our method, we have conducted experiments on two wellknown datasets MNIST [9] and Fashion MNIST [21]. These two datasets share common features, such as the size of the images (), the number of classes represented (), and the total number of images (), divided in a training set of images and a test set of images. We compared our method to three baselines: the sparseNMF [7], the NNSAE [11], the NCAE [6]. The three deeplearning models (AsymAE (ours), NNSAE and NCAE) were trained until convergence on the training set, and evaluated on the test set. The sparseNMF algorithm was ran and evaluated on the test set. Note that all models but the NCAE may produce reconstructions that do not fully belong to the interval . In order to compare the reconstructions and the partbased approximation produced by the various algorithms, their outputs will be clipped between 0 and 1. There is no need to apply this operation to the output of NCAE as a sigmoid activation enforces the output of its decoder to belong to . We used three measures to conduct this comparison:

the reconstruction error, that is the pixelwise mean squared error between the input images of the test dataset and their reconstruction/approximation : ;

the sparsity of the encoding, measured using the mean on all test images of the sparsity measure (Equation 2): ;

the approximation error to dilation by a disk of radius 1, obtained by computing the pixelwise mean squared error between the dilation by a disk of radius 1 of the original image and the partbased approximation to the same dilation, using the learned representation: .
The parameter settings used for NCAE and the NNSAE algorithms are the ones provided in [6, 11]. For the sparseNMF, a sparsity constraint of was applied to the encodings and no sparsity constraint was applied on the atoms of the representation. For our AsymAE algorithm, was fixed for the sparsity objective of the regularizer of Equation (9), and the weight of the sparsity regularizer in the loss function in Equation (8) was set to for MNIST and for FashionMNIST. Various other values have been tested for each algorithm, but the improvement of one of the evaluation measures usually came at the expense of the two others. Quantitative results are summarized in Table 1. Reconstructions by the various approaches of some sample images from both datasets are shown in Figure 4.
Both the quantitative results and the reconstruction images attest the capacity of our model to reach a better tradeoff between the accuracy of the reconstruction and the sparsity of the encoding (that usually comes at the expense of the former criteria), than the other neural architectures. Indeed, in all conducted experiments, varying the parameters of the NCAE and the NNSAE as an attempt to increase the sparsity of the encoding came with a dramatic increase of the reconstruction error of the model. We failed however to reach a tradeoff as good as the sparseNMF algorithm that manages to match a high sparsity of the encoding with a low reconstruction error, especially on the FashionMNIST dataset. The major difference between the algorithms can be seen in Figure 5 that pictures 16 of the 100 atoms of each of the four learned representations. While sparseNMF manages, for both datasets, to build highly interpretable and clean partbased representations, the two deep baselines build representations that picture either too local shapes, in the case of the NNSAE, or too global ones, in the case of the NCAE. Our method suffers from quite the same issues as the NCAE, as almost full shapes are recognizable in the atoms. We noticed through experiments that increasing the sparsity of the encoding leads to less and less local features in the atoms. It has to be noted that the Asymmetric Weight Decay regularization used by the NCAE and NNSAE models allows for a certain proportion of negative weights. As an example, up to of the pixels of the atoms of the NCAE model trained on the FashionMNIST dataset are nonnegative, although their amplitude is lower than the average amplitude of the positive weights. The amount of negative weights can be reduced by increasing the corresponding regularization, which comes at the price of an increased reconstruction error and less sparse encodings. Finally Figure 6 pictures the partbased approximation to dilation by a structuring element of size one, computed using the four different approaches on ten images from the test set. Although the quantitative results state otherwise, we can note that our approach yields a quite interesting partbased approximation, thanks to a good balance between a low overlapping of atoms (and dilated atoms) and a good reconstruction capability.
5 Conclusions and future works
We have presented an online method to learn a partbased dictionary representation of an image dataset, designed for accurate and efficient approximations of morphological operators. This method relies on autoencoder networks, with a deep encoder for a higher reconstruction capability and a shallow linear decoder for a better interpretation of the representation. Among the online partbased methods using autoencoders, it achieves the stateoftheart tradeoff between the accuracy of reconstructions and the sparsity of image encodings. Moreover, it ensures a strict (that is, non approximated) nonnegativity of the learned representation. These results would need to be confirmed on larger and more complex images (e.g. color images), as the proposed model is scalable. We especially evaluated the learned representation on an additional criterion, that is the commutation of the representation with a morphological dilation, and noted that all online methods perform worse than the offline sparseNMF algorithm. A possible improvement would be to impose a major sparsity to the dictionary images an appropriate regularization. Additionally, using a morphological layer [3, 16] as a decoder may be more consistent with our definition of partbased approximation, since a representation in the algebra would commute with the morphological dilation by essence.
Acknowledgments:
This work was partially funded by a grant from Institut MinesTelecom and MINES ParisTech.
References
 [1] Angulo, J., VelascoForero, S.: Sparse mathematical morphology using nonnegative matrix factorization. In: Soille, P., Pesaresi, M., , Ouzounis, G.K. (eds.) 10th International Symposium on Mathematical Morphology and Its Application to Signal and Image Processing (ISMM). vol. LNCS 6671, pp. 1–12 (2011)
 [2] Ayinde, B.O., Zurada, J.M.: Deep learning of constrained autoencoders for enhanced understanding of data. CoRR abs/1802.00003 (2018)

[3]
Charisopoulos, V., Maragos, P.: Morphological perceptrons: Geometry and training algorithms. pp. 3–15 (04 2017).
https://doi.org/10.1007/9783319572406_1  [4] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR abs/1606.03657 (2016)
 [5] Dosovitskiy, A., Springenberg, J.T., Brox, T.: Learning to generate chairs with convolutional neural networks. CoRR abs/1411.5928 (2014)

[6]
HosseiniAsl, E., Zurada, J.M., Nasraoui, O.: Deep learning of partbased representation of data using sparse autoencoders with nonnegativity constraints. IEEE Transactions on Neural Networks and Learning Systems
27(12), 2486–2498 (2016)  [7] Hoyer, P.O.: Nonnegative matrix factorization with sparseness constraints. CoRR cs.LG/0408058 (2004)
 [8] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015)
 [9] LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010), http://yann.lecun.com/exdb/mnist/
 [10] Lee, H., Ekanadham, C., Ng, A.Y.: Sparse deep belief net model for visual area v2. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems 20, pp. 873–880 (2008)
 [11] Lemme, A., Reinhart, R.F., Steil, J.J.: Online learning and generalization of partsbased image representations by nonnegative sparse autoencoders. Neural Networks 33, 194–203 (2012)

[12]
Maas, A.L.: Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning (2013)
 [13] Mairal, J., Bach, F.R., Ponce, J.: Sparse modeling for image and vision processing. CoRR abs/1411.3230 (2014)
 [14] Maragos, P., Schafer, R.: Morphological skeleton representation and coding of binary images. IEEE Transactions on Acoustics, Speech, and Signal Processing 34(5), 1228–1244 (1986)
 [15] Pesaresi, M., Benediktsson, J.A.: A new approach for the morphological segmentation of highresolution satellite imagery. IEEE Transactions on Geoscience and Remote Sensing 39(2), 309–320 (2001)
 [16] Ritter, G., Sussner, P.: An introduction to morphological neural networks. vol. 4, pp. 709 – 717 vol.4 (09 1996). https://doi.org/10.1109/ICPR.1996.547657
 [17] Soille, P.: Morphological image analysis: principles and applications. Springer Science & Business Media (2013)
 [18] Tanaka, K.: Columns for complex visual object features in the inferotemporal cortex: clustering of cells with similar but slightly different stimulus selectivities. Cerebral Cortex 13 1, 90–9 (2003)
 [19] Theis, F.J., Stadlthanner, K., Tanaka, T.: First results on uniqueness of sparse nonnegative matrix factorization. In: 13th IEEE European Signal Processing Conference. pp. 1–4 (2005)
 [20] VelascoForero, S., Angulo, J.: NonNegative Sparse Mathematical Morphology, vol. 202, chap. 1. Elsevier Inc.Academic Press (2017)
 [21] Xiao, H., Rasul, K., Vollgraf, R.: FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
 [22] Zhang, L., Lu, Y.: Comparison of autoencoders with different sparsity regularizers. In: International Joint Conference on Neural Networks (IJCNN). pp. 1–5 (2015)
Comments
There are no comments yet.