1 Introduction
Artificial neural networks were introduced as mathematical models for biological neural networks [24]. The basic component is a
linear perceptron
which is a linear combination of weights with biases followed by a nonlinear function called activation function. Such components (usually called a layer) can then be concatenated eventually leading to very complex functions named deep artificial neural networks (DNNs) [7]. Activation function can also be seen as an attached function between two layers in a neural network. Meanwhile, in order to get the learning in a DNNs, one needs to update the weights and biases of the neurons on the basis of the error at the output. This process involves two steps, a
BackPropagation from prediction error and a Gradient Descent Optimization to update parameters [7]. The most famous activation function is the Rectified Linear Unit (ReLU) proposed by
[27], which is simply defined as . A clear benefit of ReLU is that both the function itself and its derivatives are easy to implement and computationally inexpensive. However, ReLU has a potential loss during optimization because the gradient is zero when the unit is not active. This could lead to cases where there is a gradientbased optimization algorithm that will not adjust the weights of a unit that was never initially activated. An approach purely computational motivated to alleviate potential problems caused by the hard zero activation of ReLU, proposed a leaky ReLU activation [18]: . A simple generalisation is the Parametric ReLU proposed by [11], defined as , where is a learnable parameter. In general, the use of piecewiselinear functions as activation function has been initially motivated by neurobiological observations; for instance, the inhibiting effect of the activity of a visualreceptor unit on the activity of the neighbouring units can be modelled by a line with two segments [10]. On the other hand, for the particular case of structured data as images, a translation invariant DNN called Deep Convolutional Neural Networks (DCNN) is the most used architecture. In the conventional DCNN framework interspersed convolutional layers and pooling layers to summarise information in a hierarchical structure. The common choice is the pooling by a maximum operator called maxpooling, which is particularly well suited to the separation of features that are very sparse [3].As far as these authors know, that morphological operators have been used in the context of DCNNs following the paradigm of replacing lineal convolutions by nonlinear morphological filters [31, 34, 15, 26, 13], or hybrid variants between linear and morphological layers [30, 32, 14, 33]. Our contribution is more in the sense of [5] where the authors show favourable results in quantitative performance for some applications when seeing the maxpooling operator as a dilation layer. However, we go further to study both nonlinear activation and maxpooling operators in the context of morphological representation theory of nonlinear operators. Finally, in the experimental section, we compare different propositions in a practical case of training a multilayer CNNs for classification of images in several databases.
2 ReLU activation and maxpooling are morphological dilations
2.1 Dilation and Erosion
Let us consider a complete lattice , where and and are respectively its supremum and infimum. A lattice operator is called increasing operator (or isotone) is if it is orderpreserving, i.e., , . Dilation and erosion are lattice operators which are increasing and satisfy
Dilation and erosion can be then composed to obtain other operators [12]. In this paper, we also use morphological operators on the lattice of functions with the standard partial order . The supconvolution and infconvolution of function by structuring function are given by
(1)  
(2) 
2.2 ReLU and maxpooling
Let us now consider the standard framework of onedimensional^{1}^{1}1The extension to dimensional functions is straightforward. signals on DCNNs where any operator is applied on signals supported on a discrete grid subset of . The ReLU activation function [27] applied on every pixel of an image is defined as
(3) 
The Maxpooling operator of pooling size
and strides
, maps an image of pixels onto an image of by taking the local maxima in a neighbour of size , and moving the window elements at a time, skipping the intermediate locations:(4) 
where if belongs to the neighbour of size centred in and otherwise. There are other operations in DCNN which use the maximum operation as main ingredient, namely the Maxout layer [8] and the Maxplus layer (morphological perceptron) [4, 37].
From the definition of operators, it is straightforward to prove the following proposition
Proposition 1
ReLU activation function and maxpooling are dilation operators on the lattice of functions.
Proof
Using the standard partial ordering , we note that both ReLU and maxpooling are increasing:
They commute with supremum operation
These two operators are both also extensive, i.e., . ReLU is also idempotent, i.e., . Then ReLU is both a dilation and a closing.
Remark 1: Factoring activation function and pooling. The composition of dilations in the same complete lattice can often be factorized into a single operation. One can for instance define a nonlinear activation function and pooling dilation as
where denotes a local neighbour, usually a square of side . Note that that analysis does not bring any new operator, just the interpretation of composed nonlinearities as a dilation.
Remark 2: Positive and negative activation function, symmetric pooling. More general ReLUlike activation functions also keep a negative part. Let us consider the two parameters , we define ReLU as
In the case when , one has
(5) 
Note that the Leaky ReLU [18] corresponds to and . The Parametric ReLU [11] takes and learned along with the other neuralnetwork parameters. More recently [17] both and are learned in the ACtivateOrNot (ACON) activation function, where a softmax is used to approximate the maximum operator.
Usually in CNNs, the maxpooling operator is used after activation, i.e., , which is spatially enlarging the positive activation and removing the negative activation. It does not seem coherent with the goal of using the pooling to increase spatial equivariance and hierarchical representation of information. It is easy to "fix" that issue by using a more symmetric pooling based on taking the positive and negative parts of a function. Given a function , it can be expressed in terms of its positive and negative parts , i.e., , with and , where both and are nonnegative functions. We can now define a positive and negative maxpooling. The principle is just to take a maxpooling to each part and recompose, i.e.,
We note that (2.2) is selfdual and related to the dilation on an infsemilattice [16]. However, in the general case of (2.2) by learning both ,
(7) 
is not always selfdual.
3 Algebraic theory of minimal representation for nonlinear operators and functions
In the following section, we present the main results about representation theory of nonlinear operators from Matheron [23], Maragos [20] and BannonBarrera [1] (MMBB).
3.1 MMBB representation theory on nonlinear operators
Let us consider a translationinvariant (TI) increasing operator . The domain of the functions considered here is either or , with the additional condition that we consider only closed subsets of . We consider first the set operator case applied on and functions .
Kernel and basis representation of TI increasing set operators. The kernel of the TI operator is defined as the following collection of input sets [23]: , where denotes the origin of .
Theorem 3.1 (Matheron (1975) [23])
Consider set operators on . Let be a TI increasing set operator. Then
where the dual set operator is and is the transpose structuring element.
The kernel of is a partially ordered set under set inclusion which has an infinity number of elements. In practice, by the property of absorption of erosion, that means that the erosion by contains the erosions by any other kernel set larger than and it is the only one required when taking the supremum of erosions. The morphological basis of is defined as the minimal kernel sets [20]:
A sufficient condition for the existence of is for to be an upper semicontinuous operator. We also consider closed sets on .
Theorem 3.2 (Maragos (1989) [20])
Let be a TI, increasing and upper semicontinuous set operator^{2}^{2}2Upper semicontinuity meant with respect to the hitmiss topology. Let be any decreasing sequence of sets that converges monotonically to a limit set ,i.e., and ; that is denoted by .
An increasing set operator on is upper semicontinuous if and only if implies that .
. Then
Kernel and basis representation of TI increasing operators on functions. Previous set theory was extended [20] to the case of mappings on functions and therefore useful for signal or greyscale image operators. We focus on the case of closed functions , i.e., its epigraph is a closed set. In that case, the dual operator is and the transpose function is . Let
be the kernel of operator . As for the TI set operators, a basis can be obtained from the kernel functions as its minimal elements with respect to the partial order , i.e.,
This collection of functions can uniquely represent the operator.
Theorem 3.3 (Maragos (1989) [20])
Consider an upper semicontinuous operator acting on an upper semicontinuous function ^{3}^{3}3A function is upper semicontinuous (u.s.c) (resp. lower semicontinuous (l.s.c.)) if and only if, for each and (resp. ) implies that (resp. ) for all in some neighbourhood of . Similarly, is u.s.c. (resp. l.s.c.) if and only if all its level sets are closed (resp. open) subsets of . A function is continuous iff is both u.s.c and l.s.c. . Let be its basis and the basis of the dual operator. If is a TI and increasing operator then it can be represented as
(8)  
(9) 
The converse is true. Given a collection of functions such that all elements of it are minimal in , the operator is a TI increasing operator whose basis is equal to .
For some operators, the basis can be very large (potentially infinity) and even if the above theorem represents exactly the operator by using a full expansion of all erosions, we can obtain an approximation based on smaller collections or truncated bases and . Then, from the operators and the original is bounded from below and above, i.e., . Note also that in the case of a non minimal representation by a subset of the kernel functions larger than the basis, one just gets a redundant still satisfactory representation.
The extension to TI non necessarily increasing mappings was presented by Bannon and Barrera in [1], which involves a supremum of an operator involving an erosion and an antidilation. This part of the MatheronMaragosBannonBarrera (MMBB) theory is out of the scope of this paper.
3.2 MaxMin representation for Piecewiselinear functions
Let us also remind the fundamental results from the representation theory by Ovchinnikov [28, 29] which is rooted in a Boolean and lattice framework and therefore related to the MMBB theorems. Just note that here we focus on a representation for functions and previously it was a representation of operators on functions. Let be a smooth function on a closed domain . We are going to represent it by a family of affine linear functions
which are tangent hyperplanes to the graph of
. Namely, for a point , one defines(10) 
where
is the gradient vector of
at . We have the following general result about the representation of piecewiselinear (PL) functions as maxmin polynomial of its linear components.Theorem 3.4 ([9][2][29])
Let be a PL function on a closed convex domain and be the set of the linear components of , with . There is a family of subsets of set such that
(11) 
Conversely, for any family of distinct linear functions the above formula defines a PL function.
The expression is called a maxmin (or lattice) polynomial in the variable . We note that a PL function on is a “selector” of its components , i.e., there is an such that . The converse is also true, with functions linearly ordered over [29].
Let us also mention that from this representation we can show that a PL function is representable as a difference of two concave (equivalently, convex)PL functions [29]. More precisely, let note , with a concave function. We are reminded that sums and minimums of concave functions are concave. One have , therefore
4 Morphological universal activation functions
Using the previous results, we can state the two following results for the activation function and the pooling by increasing operators. Additionally, a proposed layer used in the experimental section is formulated.
4.1 Universal representation for activation function and pooling
Proposition 2
Any piecewiselinear activation function can be universally expressed as
(12) 
where is a PL convex function.
Proposition 3 (Pooling)
Any increasing pooling operator can be universally expressed as
(13) 
where is a family of structuring functions defining by transpose the basis of the dual operator to .
In both cases, there is of course a dual representation using the maximum of erosions. The dilation operator of type plays a fundamental role in multiplicative morphology [12].
Remark: Tropical polynomial interpretation. The maxaffine function is a tropical ^{4}^{4}4Tropical geometry is the study of polynomials and their geometric properties when addition is replaced with a minimum operator and multiplication is replaced with ordinary addition. polynomial such that in that geometry, the degree of the polynomial corresponds to the number of pieces of the PL convex function. The set of such polynomials constitutes the semiring of tropical polynomials. Tropical geometry in the context of lattice theory and neural networks is an active area of research [22] [21] [25], however those previous works have not considered the use of minimal representation of tropical polynomials as generalised activation functions.
Remark: Relationships to other universal approximation theorems. These results on universal representation of layers in DCNN are related to study the capacity of neural networks to be universal approximators for smooth functions. For instance, both maxout networks [8] and maxplus networks [37] can approximate arbitrarily well any continuous function on a compact domain. The proofs are based on the fact that [35] continuous PL functions can be expressed as a difference of two convex PL functions, and each convex PL can be seen as the maximum of affine terms.
Tropical formulation of ReLU networks has shown that a deeper network is exponentially more expressive than a shallow network [36]. To explore the expressiveness of networks with our universal activation function and pooling layer respect to the deepness of DCNN is therefore a fundamental relevant topic for future research.
4.2 MorphoActivation Layer
We have now all the elements to justify why in terms of universal representation theory of nonlinear operators ReLU and maxpooling can be replaced by a more general nonlinear operator defined by a morphological combination of activation function, dilations and downsampling, using a maxplus layer or its dual.
More precisely, we introduce two alternative architectures of the MorphoActivation layer (Activation and Pooling Morphological Operator) either by composition or as follows:
(14) 
(15) 
where
In the context of an endtoend learning DCNN, the parameters , and structuring functions
are learnt by backpropagation
[34]. The learnable structuring functions play the same role as the kernel in the convolutions. Note that one can have , the pooling does not involve downsampling. We note that in a DCNN network the output of each layer is composed of the affine function , where is the weight matrix (convolution weights in a CNN layer) and the bias, and the activation function , i.e., , where is acting elementwise. Using our general activation (12), we obtain thatand therefore the bias has two terms which are learnt. We propose therefore to consider in our experiments that is set to zero since its role will be replaced by learning the .
5 Experimental Section
Firstly, to illustrate the kind of activation functions that our proposition can learn, we use the MNIST dataset as a ten class supervised classification problem and an architecture composed of two convolutional layers and a dense layer for reducing to the number of classes. The activation functions that we optimise by stochastic gradient descent have as general form
, which corresponds to (14) and (15) where , i.e., without pooling. We have initialised all the activation functions to be equal to as it is illustrated in Fig.1(left). The accuracy of this network without any training is . Surprisingly when one optimises ^{5}^{5}5We use ADAM optimizer with a categorical entropy as loss function, a batch size of 256 images and a learning rate of 0.001.
only the parameters of activation functions the network accuracy increases to the acceptable performance of and a large variability of activation functions are found Fig.1(center). This is a way to assess the expressive power^{6}^{6}6The expressive power describes neural networks ability to approximate functions. of the parameter of the activation as it has been proposed in [6]. Additionally, an adequate separation among classes is noted by visualising the projection to twodimensional space of the last layer via the tSNE [19] algorithm. Of course, a much better accuracy () and interclass separation is obtained by optimising all the parameters of the network Fig.1(right).Secondly, we compare the performance of (2.2), (7), (14) and (15) following the common practice and train all the models using a training set and report the standard topone error rate on a testing set. We use as architecture a classical twolayer CNN (without bias for (14) and (15)) with 128 filters of size () per layer, and a final dense layer with dropout. After each convolution the different propositions are used to both produce a nonlinear mapping and reduce spatial dimension via pooling stride of two. As a manner of comparison, we include the case of a simple ReLU activation followed by a MaxPool with stride two. The difference in topone error rate on a testing set is reported in Table 1 for CIFAR10, CIFAR100 and FashionMNIST databases. These quantitative results shown in propositions (2.2) and (7) do not seem to improve the performance in the explored cases. Additionally, (15) performs better than (14), and it improves the accuracy in comparison with our baseline in all the considered databases.
Fashion MNIST  CIFAR10  CIFAR100  
MaxPool(ReLU)  93.11  78.04  47.57  
Selfdual Relu in (2.2)  2.11  20.12  31.14  
(7)  0.95  1.75  4.39  
MorphoActivation in (14)  N=2  N=3  N=4  N=2  N=3  N=4  N=2  N=3  N=4 
M=2  0.06  0.05  0.1  0.42  0.02  0.02  0.44  0.7  0.4 
M=3  0.14  0.14  0.06  0.57  0.4  0.35  0.56  0.49  0.61 
M=4  0.02  0.08  0.01  0.05  0.62  0.5  0.41  0.35  0.73 
MorphoActivation in (15)  N=2  N=3  N=4  N=2  N=3  N=4  N=2  N=3  N=4 
M=2  0.04  0.16  0.12  1.84  2.02  1.49  3.31  3.5  3.45 
M=3  0.08  0.09  0.12  2.39  1.96  1.82  3.48  3.55  3.86 
M=4  0.02  0.09  0.03  2.49  2.25  2.13  3.47  3.73  3.58 
6 Conclusions and Perspectives
To the best of our knowledge, this is the first work where nonlinear activation functions in deep learning are formulated and learnt as maxplus affine functions or tropical polynomials. We have also introduced an algebraic framework inspired from mathematical morphology which provides a general representation to integrate the nonlinear activation and pooling functions.
Besides more extended experiments on the performance on advanced DCNN networks, our next step will be to study the expressivity power of the networks based on our morphological activation functions. The universal approximation theorems for ReLU networks would just be a particular case. We conjecture that the number of parameters we are adding on the morphological activation can provide a benefit to get more efficient approximations of any function with the same width and depth.
Acknowledgements
This work has been supported by Fondation Mathématique Jacques Hadamard (FMJH) under the PGMOIRSDI 2019 program. This work was granted access to the Jean Zay supercomputer under the allocation 2021AD011012212R1.
References
 [1] Banon, G.J.F., Barrera, J.: Minimal representations for translationinvariant set mappings by mathematical morphology. SIAM Journal on Applied Mathematics 51(6), 1782–1798 (1991)
 [2] Bartels, S.G., Kuntz, L., Scholtes, S.: Continuous selections of linear functions and nonsmooth critical point theory. Nonlinear Analysis: Theory, Methods & Applications 24(3), 385–407 (1995)
 [3] Boureau, Y.L., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in visual recognition. In: ICML. pp. 111–118 (2010)
 [4] Charisopoulos, V., Maragos, P.: Morphological perceptrons: geometry and training algorithms. In: ISMM. pp. 3–15. Springer (2017)

[5]
Franchi, G., Fehri, A., Yao, A.: Deep morphological networks. Pattern Recognition 102, 107246 (2020)
 [6] Frankle, J., Schwab, D.J., Morcos, A.S.: Training batchnorm and only batchnorm: On the expressive power of random features in cnns. ICLR (2021)
 [7] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016)
 [8] Goodfellow, I., WardeFarley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: ICML. pp. 1319–1327 (2013)
 [9] Gorokhovik, V.V., Zorko, O.I., Birkhoff, G.: Piecewise affine functions and polyhedral sets. Optimization 31(3), 209–221 (1994)
 [10] Hartline, H.K., Ratliff, F.: Inhibitory interaction of receptor units in the eye of limulus. The Journal of General Physiology 40(3), 357–376 (1957)

[11]
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: IEEE ICCV. pp. 1026–1034 (2015)
 [12] Heijmans, H.J.A.M.: Theoretical aspects of graylevel morphology. IEEE Transactions on Pattern Analysis & Machine Intelligence 13(06), 568–582 (1991)
 [13] Hermary, R., Tochon, G., Puybareau, É., Kirszenberg, A., Angulo, J.: Learning grayscale mathematical morphology with smooth morphological layers. Journal of Mathematical Imaging and Vision (2022)
 [14] Hernández, G., Zamora, E., Sossa, H., Téllez, G., Furlán, F.: Hybrid neural networks for big data classification. Neurocomputing 390, 327–340 (2020)
 [15] Islam, M.A., Murray, B., Buck, A., Anderson, D.T., Scott, G.J., Popescu, M., Keller, J.: Extending the morphological hitormiss transform to deep neural networks. IEEE Transactions on NNs and Learning Systems 32(11), 4826–4838 (2020)
 [16] Keshet, R.: Mathematical morphology on complete semilattices and its applications to image processing. Fundamenta Informaticae 41(12), 33–56 (2000)
 [17] Ma, N., Zhang, X., Liu, M., Sun, J.: Activate or not: Learning customized activation. In: IEEE CVPR. pp. 8032–8042 (2021)
 [18] Maas, A.L., Hannun, A.Y., Ng, A.Y., et al.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. ICML. vol. 30, p. 3 (2013)

[19]
Van der Maaten, L., Hinton, G.: Visualizing data using tSNE. Journal of Machine Learning Research 9(11) (2008)
 [20] Maragos, P.: A representation theory for morphological image and signal processing. IEEE Tran. on Pattern Analysis and Machine Intel. 11(6), 586–599 (1989)
 [21] Maragos, P., Charisopoulos, V., Theodosis, E.: Tropical geometry and machine learning. Proceedings of the IEEE 109(5), 728–755 (2021)
 [22] Maragos, P., Theodosis, E.: Multivariate tropical regression and piecewiselinear surface fitting. In: ICASSP. pp. 3822–3826. IEEE (2020)
 [23] Matheron, G.: Random sets and integral geometry. John Wiley & Sons (1974)
 [24] McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics 5(4), 115–133 (1943)
 [25] Misiakos, P., Smyrnis, G., Retsinas, G., Maragos, P.: Neural network approximation based on hausdorff distance of tropical zonotopes. In: ICLR. pp. 0–8 (2022)
 [26] Mondal, R., Dey, M.S., Chanda, B.: Image restoration by learning morphological openingclosing network. MMTheory and Applications 4(1), 87–107 (2020)

[27]
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010)
 [28] Ovchinnikov, S.: Boolean representation of manifolds functions. J. Math. Anal. Appl. 263, 294–300 (2001)
 [29] Ovchinnikov, S.: Max–min representations of piecewise linear functions. Beiträge Algebra Geom. 43, 297–302 (2002)
 [30] Pessoa, L.F., Maragos, P.: Neural networks with hybrid morphological/rank/linear nodes: a unifying framework with applications to handwritten character recognition. Pattern Recognition 33(6), 945–960 (2000)
 [31] Ritter, G.X., Sussner, P.: An introduction to morphological neural networks. In: 13th International Conf. on Pattern Recognition. vol. 4, pp. 709–717. IEEE (1996)
 [32] Sussner, P., Campiotti, I.: Extreme learning machine for a new hybrid morphological/linear perceptron. Neural Networks 123, 288–298 (2020)
 [33] Valle, M.E.: Reduced dilationerosion perceptron for binary classification. Mathematics 8(4), 512 (2020)
 [34] VelascoForero, S., Pagès, R., Angulo, J.: Learnable EMD based on mathematical morphology. SIAM Journal on Imaging Sciences 15(1), 23–44 (2022)
 [35] Wang, S.: General constructive representations for continuous piecewiselinear functions. IEEE Trans. on Circuits and Systems I 51(9), 1889–1896 (2004)
 [36] Zhang, L., Naitzat, G., Lim, L.H.: Tropical geometry of deep neural networks. In: International Conference on Machine Learning. pp. 5824–5832. PMLR (2018)
 [37] Zhang, Y., Blusseau, S., VelascoForero, S., Bloch, I., Angulo, J.: Maxplus operators applied to filter selection and model pruning in NNs. In: ISMM. pp. 310–322 (2019)