I Introduction
Deep learning techniques have renewed the perspectives of the research and industrial community. This success is due to the fact that deep learning [1]
, a branch of machine learning that favors multilayered neural networks, can learn datadriven features and classifiers, all at once. Specifically, just one deep network is capable of learning features and classifiers (at the same time and in different layers) and adjust this learning, in processing time, giving more importance to one layer than another depending on the problem. Since encoding the features in an efficient and robust fashion is the key for generating discriminatory models for any applications related to images, this feature learning step (performed by optimizing the network parameters) is a great advantage when compared to typical methods (such as low and midlevel descriptors
[2, 3]), given it can learn adaptable and specific feature representations depending on the data.Among all deep learningbased techniques, a specific approach originally proposed for images, called Convolutional Network (ConvNets) [1], achieved stateoftheart in several applications, including image classification [4, 5]
, object and scene recognition
[6, 7, 8, 9, 10, 11, 12], and many others. This network is essentially composed of convolutional layers [1]that process the input using optimizable filters, which are actually responsible to extract features from the data. Despite using nonlinear activation functions and pooling layers to bring some nonlinearity to the learning process, those convolutional layers only perform linear operations over the input data, ignoring nonlinear processes and the relevant information captured by them. For instance, suppose one desires a neuron that always extracts the minimum of the neighborhood defined by its learnable filter. As presented in Figure
1, despite having a myriad of possible configurations, the ConvNet are not able to perform such nonlinear operation and, therefore, cannot produce the expected output.Concerning the image characteristics, nonlinear operations are able to cope with some properties better than the linear ones, being preferable in some applications. Precisely, in some scenarios (such as the remote sensing one), images do not have a clear concept of perspective (i.e., fore and background) with all objects (or pixels) having equivalent importance. In these cases, borders and corners can be considered salient and fundamental features that should be preserved in order to help distinguish objects (mainly small ones). However, linear transformations (as performed by the convolutional layers) weight the pixels (with respect to the neighborhood) blurring edges and losing this notion of borders and corners. Hence, in these applications, edgeaware filters, such as nonlinear operations, can be considered a better option than linear ones
[13], since they are able to preserve those relevant features.Supported by this property, some nonlinear operations are still very popular being considered stateoftheart in some cases. A successful nonlinear filter are the morphological operations [14]. Such processes are considered effective tools to automatic extract features but preserving essential characteristics (such as corners and borders), being still widely employed and current stateoftheart in several applications in which such properties are considered fundamental [15, 16, 17, 18]. Although successful in some scenarios, morphological operations have a relevant drawback: a structuring element (filter used to define the neighborhood that must be taken into account during the processing) must be defined and provided for the process. In typical scenarios, since different structuring elements may produce distinct results depending on the data, it is imperative to design and evaluate many structuring elements in order to find the most suitable ones for each application, an expensive process that does not guarantee a good descriptive representation.
Encouraged by the current scenario, in this paper, we propose a novel method for deep feature learning, called Deep Morphological Network (DeepMorphNet), which is capable of doing nonlinear morphological operations while performing the feature learning step (by optimizing the structuring elements). This new network, strongly based on the ConvNets (because of the similarity between the operation performed by the morphological transformations and convolutional layers), would aggregate the benefits of morphological operations while overcoming the aforementioned drawback by learning the structuring element during the training process. Particularly, the processing of each layer of this proposed technique can be divided into three steps: (i) the first one employs depthwise convolutions
[19] to rearrange the input pixels according to the (maxbinary) filters (that represent the structuring elements), (ii) the second one uses depthwise pooling to select the pixel (based on erosion and dilation operations) and generate an eroded or dilated outcome, and (iii) the last one employs pointwise convolutions [19] to combine the generated maps in order to produce one final morphological map (per neuron). This whole process resembles the depthwise separable convolutions [19] but using binary filters and one more step (depthwise pooling) between the convolutions.The remainder of this paper is organized as follows. Section II presents related work while Section III introduces background concepts required for the understanding of this work. The proposed method is presented in Section IV. The experimental setup is presented in Section V while Section VI presents the obtained results. Finally, Section VII concludes the paper.
Ii Related Work
As introduced, nonlinear operations, such as the morphological ones [14], have the ability to preserve borders and corners, which may be essential properties in some scenarios. Supported by this, several applications, based on images in which aforementioned properties are essential, have exploited the benefits of morphological operations, such as image analysis [20, 17, 21, 18], classification [22, 23, 24], segmentation [15, 16, 25, 26, 27, 28], and so on.
Some of these techniques [20, 21, 17, 27] are strongly based on mathematical morphology. These approaches process the input images using only morphological operations. The produced outcomes are then analyzed in order to extract highlevel semantic information related to the input images, such as borders, area, geometry, volume, and more. Other works go further [15, 16, 29, 25, 30, 18, 24]
and use the advantages of morphology to extract robust features that are employed as input to standard machine learning techniques (such as Support Vector Machines, decision trees, and extreme learning machine) in order to perform image analysis, classification, and segmentation. Usually, in these cases, the input images are processed using several different morphological transformations each one employing distinct structuring element in order to improve the diversity of the extracted features. All these features are then concatenated and used as input for the machine learning techniques.
More recently, ConvNets [1] started achieving outstanding results, mainly in applications related to images. Therefore, it would be more than natural for researchers to propose works combining the benefits of ConvNets and morphological operations. In fact, several works [22, 23, 26, 28] tried to combine these techniques in order to create a more robust model. Some works [26, 28] employed morphological operations as a preprocessing step in order to extract the first set of discriminative features. In these cases, the structuring elements of the morphological operations are not learned and predefined structuring elements are employed. Those techniques use such features as input for a ConvNet responsible to perform the classification. Based on the fact that morphology generates interesting features that are not captured by the convolutional networks, such works achieved outstanding results on pixelwise classification.
Other works [22, 23] introduced morphological operations into neural networks, creating a framework in which the structuring elements are optimized. Masci et al. [22]
proposed a convolutional network that aggregates morphological operations, such as pseudoerosion, pseudodilation, pseudoopening, pseudoclosing, and pseudotophats. Specifically, their proposed network uses the counterharmonic mean
[31], which allows the convolutional layer to perform its traditional linear process, or approximations of morphological operations. They show that the approach produces outcomes very similar to traditional (not approximate) morphological operations. Mellouli et al. [23] performed a more extensive validation of the previous method, proposing different deeper networks that are used to perform image classification. In their experiments, the proposed network achieved promising results for two datasets of digit recognition.In this work, we proposed a new method for deep feature learning that is able to perform several morphological operations (including erosion, dilation, openings, closing, tophats, and an approximation of geodesic reconstructions). Two main differences may be pointed out between the proposed approach and the aforementioned works: (i) differently from [22, 23], the technique really carries out morphological operations without any approximation (except for the reconstruction), and (ii) to the best of our knowledge, this is the first approach to implement (approximate) morphological geodesic reconstruction within deeplearning based models.
Iii Mathematical Morphology
As aforementioned, morphological operations, commonly employed in the image processing area, are strongly based on mathematical morphology. Since its introduction to the image domain, these morphological operations have been generalized from the analysis of a single band image to hyperspectral images made up of hundreds of spectral channels and has become one of the stateoftheart techniques for a wide range of applications [14]. This study area includes several different operations (such as erosion, dilation, opening, closing, tophats, and reconstruction), which can be applied to binary and grayscale images in any number of dimensions [14].
Formally, let us consider a grayscale 2D image as a mapping from the coordinates () to the pixelvalue domain (). Most morphological transformations process this input image using a structuring element (SE) (usually defined prior to the operation). A SE can be defined as a function that returns the set of neighbors of a pixel . This neighborhood of pixels is taken into account during the morphological operation, i.e., while convolving the image . Normally, a SE is defined by two components: (i) shape, which is usually a discrete representation of continuous shapes, such as square, circle, (ii) center, that identifies the pixel on which the SE is superposed when probing the image. Figure 2 presents some examples of common SEs employed in the literature. As introduced, the definition of the SE is of vital importance for the process to extract relevant features. However, in literature [25, 30], this definition is performed experimentally, an expensive process that does not guarantee a good descriptive representation.
After its definition, the SE can be then employed in several morphological processes. Most of these operations are usually supported by two basic morphological transformations: erosion and dilation . Such operations receive basically the same input: an image and the structuring element . While erosion transformations process each pixel using the supremum (the smallest upper bound, ) function, as formally denoted in Equation 1, the dilation operations process the pixels using the infimum (the greatest lower bound, ) function, as presented in Equation 2. Intuitively, these two operations probe an input image using the SE, i.e., they position the structuring element at all possible locations in the image and analyze the neighborhood pixels. This process, very similar to the convolution procedure, outputs another image with regions compressed or expanded (depending on the operation). Some examples of erosion and dilation are presented in Figure 3, in which it is possible to notice the behavior of each operation. Precisely, erosion affects brighter structures while dilation influences darker ones (with respect to the neighborhood defined by the SE).
(1) 
(2) 
If we have an ordered set, then the erosion and dilation operations can be simplified. This is because the infimum and the supremum are respectively equivalent to the minimum and maximum functions when dealing with ordered sets. In this case, erosion and dilation can be defined as presented in Equations 3 and 4, respectively.
(3) 
(4) 
Based on these two fundamental transformations, other more complex morphological operations may be computed. The morphological opening, denoted as and defined in Equation 5, is simply an erosion operation followed by the dilation (using the same structuring element) of the eroded output. In contrast, a morphological closing of an image, formally defined in Equation 6, is a dilation followed by the erosion (using the same SE) of the dilated output. Intuitively, while an erosion would affect all brighter structures, an opening flattens bright objects that are smaller than the size of the SE and, because of dilation, mostly preserves the bright large areas. A similar conclusion can be drawn for darker structures when closing is performed. Examples of this behavior can be seen in Figure 3. It is important to highlight that by using erosion and dilation transformations, opening and closing perform geodesic reconstruction in the image. Operations based on this paradigm belongs to the class of filters that operate only on connected components (flat regions) and cannot introduce any new edge to the image. Furthermore, if a segment (or component) in the image is larger than the SE then it will be unaffected, otherwise, it will be merged to a brighter or darker adjacent region depending upon whether a closing or opening is applied. This process is crucial because it avoids the generation of distorted structures, which is obviously an undesirable effect.
(5) 
(6) 
Other important morphological operations are the tophats. Tophat transform is an operation that extracts small elements and details from given images. There are two types of tophat transformations: (i) the white one , defined in Equation 7, in which the difference between the input image and its opening is calculated, and (ii) the black one, denoted as and defined in Equation 8, in which the difference between the closing and the input image is performed. White tophat operation preserves elements of the input image brighter than their surroundings but smaller than the SE. On the other hand, black tophat maintains objects smaller than the SE with brighter surroundings. Examples of these two operations can be seen in Figure 3.
(7) 
(8) 
Another important morphological operation based on erosions and dilations is the geodesic reconstruction. There are two types of geodesic reconstruction: by erosion and by dilation. For simplicity, only the former one is formally detailed here, however, the latter one can be obtained, by duality, using the same reasoning. The geodesic reconstruction by erosion , mathematically defined in Equation 9, receives two parameters as input: an image and a SE . The image (also referenced in this operation as mask image) is dilated by the SE () creating the marker image (), responsible for delimiting which objects will be reconstructed during the process. A SE (usually with any elementary composition [14]) and the marker image are provided for the reconstruction operation . This transformation, defined in Equation 10, reconstructs the marker image (with respect to the mask image ) by recursively employing geodesic erosion (with the elementary SE ) until idempotence is reached (i.e., ). In this case, a geodesic erosion , defined in Equation 11, consists of a pixelwise maximum operation between an eroded (with elementary SE ) marker image and the mask image . As aforementioned, by duality, a geodesic reconstruction by dilation can be defined, as presented in Equation 12. These two crucial operations try to preserve all large (than the SE) objects of the image removing bright and dark small areas, such as noises. Some examples of these operations can be seen in Figure 3.
(9) 
(10) 
(11) 
(12) 
Note that geodesic reconstruction operations require an iterative process until the convergence. This procedure can be expensive, mainly when working with a large number of images (a common scenario when training neural networks [1]). An approximation of such operations, presented in Equations 13 and 14 can be achieved by performing just one transformation over the marker image with a large (than the SE used to create the marker image) structuring element. In other words, suppose that is the SE used to create the marker image, then , the SE used in the reconstruction step, should be larger than , i.e., . This process is faster since only one iteration is required, but may lead to worse results, given that the use of a large filter can make the reconstruction join objects that are close in the scene (a phenomenon known as leaking [14]).
(13) 
(14) 
Although all previously defined morphological operations used a grayscale image , they could have employed a binary image or even an image with several channels. In this case, morphological operations would be applied to each input channel independently and separately, generating an outcome with the same number of input channels.
Iv Deep Morphological Networks
In this section, we present the proposed network, called Deep Morphological Networks (or simply DeepMorphNets), capable of doing morphological operations while optimizing the structuring elements. Technically, this new network is strongly based on ConvNets mainly because of the similarity between the morphological operations and convolutional layers, since both employ a very similar processing operation. Therefore, this new network seeks to efficiently combine morphological operations and deep learning, aggregating the ability to learn certain important types of image properties (such as borders and corners) of the former and the feature learning step of the latter. Such combination would bring advantages that could assist several applications in which borders and shape are considered essential. However, there are several challenges in fully integrating morphological operations and deep learningbased methods, especially convolutional neural networks.
In this specific case, a first challenge is due to the convolutional layers and their operations. Precisely, such layers, the basis of ConvNets, extract features from the input data using an optimizable filter by performing only linear operations not supporting nonlinear ones. Formally, let us assume a 3D input of a convolutional layer as a mapping from coordinates () to the pixelvalue domain ( or ). Analogously, the trainable filter (or weight) of such layer can be seen as a mapping from 3D coordinates () to the realvalued weights (). A standard convolutional layer performs a convolution (denoted here as ) of the filter over the input , according to Equation 15. Note that the output of this operation is the summation of the linear combination between input and filter (across both space and depth). Also, observe the difference between this operation and the morphological ones stated in Section III. This shows that the integration between morphological operations and convolutional layers is not straightforward.
(15) 
Another important challenge is due to the optimization of nonlinear operations by the network. Technically, in ConvNets, a loss function
is defined to allow the evaluation of the network’s current state and its optimization towards a better state. Nevertheless, the objective of any network is to minimize this loss function by adjusting the trainable parameters (or filters) . Such optimization is traditionally based on the derivatives of the loss function w.r.t. the weights. For instance, suppose Stochastic Gradient Descent (SGD)
[1] is used to optimize a ConvNet. As presented in Equation 16, the optimization of the filters (towards a better state) depends directly on the partial derivatives of the loss function w.r.t. the weights (employed with a predefined learning rate). Those partial derivatives are usually obtained using the backpropagation algorithm
[1], which is strongly supported by the fact that all operations of the network are easily derivable (w.r.t. the filters), including the convolution presented in Equation 15. However, this algorithm can not be directly applied to nonlinear operations, such as the presented morphological ones, because those operations do not have easy, integrable derivatives.(16) 
Overcoming such challenges, we propose a novel network, based on ConvNets, that employs depthwise and pointwise convolution with depthwise pooling to recreate and optimize morphological operations, from basic to complex ones. First, Section IVA introduce the basic concepts used as a foundation for the proposed Deep Morphological Network. Section IVB presents the proposed neurons responsible to perform morphological operations. The proposed morphological layer, composed of the proposed neurons, is presented in Section IVC. The optimization of the filters (also called structuring elements) of such layers is explained in Section IVD. Finally, the proposed DeepMorphNet architectures are introduced in Section IVE.
Iva Basic Morphological Framework
The combination of morphological operations and deep learning is subject to an essential condition: the new technique should be capable of conserving the endtoend learning strategy, i.e., it should integrate the current training procedure. The reason for this condition is twofolded: (i) to extract the benefits of the feature learning step (i.e., optimization of the filters) from deep learning, and (ii) to allow the combination of morphological operations with any other existing operation explored by deep learningbased approaches. Towards such objective, we have proposed a new framework, capable of performing morphological erosion and dilation, based on operations that meet this condition, i.e., neurons based on this framework can be easily integrated into the standard training process. The processing of this framework can be separated into two steps. The first one employs depthwise convolution [19] to perform a delimitation of features, based on the neighborhood (or filter). As defined in Equation 17, this type of convolution differs from standard ones since it handles the input depth independently, using the same filter to every input channel. In other words, suppose that a layer performing depthwise convolution has filters and receives an input with channels, then the processed outcome would be an image of channels, since each th filter would be applied to each th input channel. The use of depthwise convolution simplifies the introduction of morphological operations into the deep network since the linear combination performed by this convolution does not consider the depth (as in standard convolutions presented in Equation 15). This process is fundamental for the recreation of morphological operations since such transformations can only process one single channel at a time (as aforementioned in Section III).
(17) 
However, just using this type of convolution does not allow the reproduction of morphological transformations, given that a spatial linear combination is still performed by this convolutional operation. To overcome this, all filters
are first converted into binary and then used in the depthwise convolution operation. Precisely, this binarization process, referenced hereafter as maxbinarize, activates only the highest value of the filter, i.e., only the highest value is considered active, while all others are deactivated. Formally, the maxbinarize
is a function that receives as input the realvalued weights and processes them according to Equation 18, where is the indicator function, that returns 1 if the is true and 0 otherwise. This process outputs a binary version of the weights, denoted here as , in which only the highest value (in ) is activated (in ). By using this binary filter , the linear combination performed by depthwise convolution can be seen as a simple operation that preserves the exact value of the single pixel activated by this binary filter.(18) 
However, only preserving one pixel w.r.t. the binary filter is not enough to reproduce the morphological operations, since they usually operate over a neighborhood (defined by the SE ). In order to reproduce this neighborhood concept in the depthwise convolution operation, we decompose each filter into several ones, that when superimposed retrieve the final SE . More specifically, suppose a filter with size . Since only one position can be activated at a time (because of the aforementioned binarization process), this filter has a total of possible activation variations. Suppose also a SE with size . As explained in Section III, such SE defines the pixel neighborhood and can have any feasible configuration. Considering each position of this SE independently, each one can be considered activated (when that position of the neighborhood should be taken into account) or deactivated (when the neighboring position should not be taken into account). Therefore, a SE of size has possible configurations when considering each position separately. Based on all this, a set of maxbinary filters with size is able to cover all possible configurations of a SE with the same size, i.e., with this set, it is possible to recreate any feasible configuration of a SE. Precisely, a set of filters with size can be seen as a decomposed representation of the neighborhood concept (or of the SE) given that those filters (with only a single activated position) can be superimposed in order to retrieve any possible neighborhood configuration defined by the SE. Supported by this idea, any SE can be decomposed into filters, each one with size and only one activated value. By doing this, the concept of neighborhood introduced by the SE can be exploited in depthwise convolution. Particularly, a set of filters can be converted into binary weights (via the aforementioned maxbinarize function ) and then, used to process the input data. When exploited by Equation 17, each of these binary filter will preserve only one pixel which is directly related to one specific position of the neighborhood. Thus, technically, this first step recreates, in depth, the neighborhood of a pixel delimited by a SE , which is essentially represented by binary filters of size .
(19) 
Since the SE was decomposed in depth, in order to retrieve it, a depthwise operation, presented in Equation 19, must be performed over the binary filters . Analogously, a depthwise operation is also required to retrieve the final outcome, i.e., the eroded or dilated image. This is the second step of this proposed framework, which is responsible to extract the relevant information based on the depthwise neighborhood. In this step, an operation, called depthwise pooling , performs a pixel and depthwise process over the outcomes (of the decomposed filters), producing the final morphological outcome. This pooling operation is able to actually output the morphological erosion and dilation by using pixel and depthwise minimum and maximum functions, as presented in Equations 20 and 21, respectively. Note that the reproduction of morphological operations using minimum and maximum functions is only possible because the set created with each pixel position along the channels can be considered an ordered set (similar to the definition presented in Section III). The outcome of this second step is the final (eroded or dilated) feature map that will be exploited by any subsequent process.
(20) 
(21) 
Equations 22 and 23 compile the two steps performed by the proposed framework for morphological erosion and dilation, respectively. This operation, denoted here as , performs a depthwise convolution (first step), which uses (binary) filters that decompose the representation of the neighborhood concept introduced by SEs, followed by a pixel and depthwise pooling operation (second step), outputting the final morphological (eroded or dilated) feature maps. Note the similarity between these functions and Equations 3 and 4 presented in Section III. The main difference between these equations is in the neighborhood definition. While in the standard morphology, the neighborhood of a pixel is defined spatially (via SE ), in the proposed framework, the neighborhood is defined along the channels due to the decomposition of the SE into several filters and, therefore, minimum and maximum operations also operate over the channels.
(22) 
(23) 
A visual example of the proposed framework being used for morphological erosion is presented in Figure 4. In this example, the depthwise convolution has 4 filters with size which actually represent a unique SE. The filters are first converted into binary using the maxbinarize function , presented in Equation 18. Then, each binary filter is used to process (step 1, blue dashed rectangle) each input channel (which, for simplicity, is only one in this example) using Equation 17. In this process, each binary filter outputs an image in which each pixel has a direct connection to the one position activated in that filter (that, actually, represents a neighborhood position activated in the SE ). The output is then processed (step 2, green dotted rectangle) via a pixel and depthwise minpooling (according to Equation 20) to produce the final eroded output. Note that the binary filters , when superimposed (using Equation 19), retrieve the final SE . The dotted line shows that the processing of the input with the superimposed SE using the standard morphological erosion ( presented in Equation 3) results in the same eroded output image produced by the proposed morphological erosion.
IvB Morphological Processing Units
The proposed framework is the foundation of all proposed morphological processing units (or neurons). However, although the proposed framework is able to reproduce morphological erosion and dilation, it has an important drawback: since it employs depthwise convolution, the number of outcomes can grow exponentially, given that, as previously explained, each input channel is processed independently by each processing unit. Thus, in order to overcome this issue and make the proposed technique more scalable, we propose to use a pointwise convolution [19] to force each processing unit to output only one image (or feature map). Particularly, any neuron proposed in this work has the same design with two parts: (i) the core operation (fundamentally based on the proposed framework), in which the processing unit performs its morphological transformation outputting multiple outcomes, and (ii) the pointwise convolution [19], which performs a pixel and depthwise (linear) combination of the outputs producing only one outcome. Observe that though the pointwise convolution performs a depthwise combination of the multiple outcomes, it does not learn any spatial feature, since it employs a pixelwise (or pointwise) operation, managing each pixel separately. This design allows the morphological neuron to have the exact same input and output of a standard existing processing unit, i.e., it receives as input an image with any number of bands and outputs a single new representation. It is interesting to notice that this processing unit design employs depthwise and pointwise convolution [19], resembling very much the depthwise separable convolutions [19], but with extra steps and binary decomposed filters. Next Sections explain the core operation of all proposed morphological processing units. Note that these neurons were conceived to be equivalent in terms of operations. Therefore, all of them have, exactly, two operations (based on the proposed framework). Also observe that, although not mentioned in the next Sections, the pointwise convolution is present in all processing units proposed in this work.
IvB1 Composed Processing Units
The newly introduced framework allows a deep network to perform erosion and dilation, the two basic operations of morphology. However, instead of using such operations independently, the first proposed processing unit is based on both morphological transformations. The socalled composed processing units, which are totally based on the proposed framework, have in their core a morphological erosion followed by a dilation (or viceversa), without any constraint on the weights (i.e., on the SE). The motivation behind the composed processing unit is based on the potential of the learned representation. While erosion and dilation can learn simple representations, the combination of these operations is able to capture more complex information. Formally, Equations 24 and 25 present the two possible configurations of the morphological composed neurons. It is important to notice that the weights ( and ) of each operation of this neuron are independent. Aside from this, a visual representation of one type of composed neuron can be seen in Figure (a)a.
(24) 
(25) 
IvB2 Opening and Closing Processing Units
Aside from implementing morphological erosion and dilation, the proposed framework is also able to support the implementation of other, more complex, morphological operations (or their approximations). The most intuitive and simple transformations to be implemented are the opening and closing. As stated in Section III, an opening is simply an erosion operation followed by a dilation of the eroded output (Equation 5), while closing is the reverse operation (Equation 6). In both cases, the two basic operations (erosion and dilation or viceversa) use the same SE . Based on this, the implementation of the opening and closing processing units, using the proposed framework, is straightforward. Precisely, the core of such neurons is very similar to that of the composed processing units, except that in this case a tie on the filters of the two basic morphological operations is required in order to make them use the same weights, i.e., the same SE . A visual representation of the proposed opening neuron, presented in Figure (b)b, allows a better view of the operation. Formally, Equations 26 and 27 define the opening and closing morphological neurons, respectively. Note the similarity between these functions and Equations 5 and 6.
(26) 
(27) 
IvB3 Tophat Processing Units
The implementation of other, more complex, morphological operations is a little more tricky. This is the case of the tophat operations, which require both the input and processed data to generate the final outcome. Therefore, for such operations, a skip connection [1, 32] (based on the identity mapping) is employed to support the forwarding of the input data, allowing it to be further processed. The core of the tophat processing units is composed of three parts: (i) an opening or closing morphological processing unit (depending on the type of the tophat), (ii) a skip connection, that allows the forwarding of the input data, and (iii) a subtraction function that operates over the data of both previous parts, generating the final outcome. A visual concept of the white tophat neuron is presented in Figure (c)c. Such operation and its counterpart (the black tophat) are formally defined in Equations 28 and 29, respectively.
(28) 
(29) 
IvB4 Geodesic Reconstruction Processing Units
Similarly to the previous processing units, the geodesic reconstruction also requires the input and processed data in order to produce the final outcome. Hence, the implementation of this important operation is also based on skip connections. Aside from this, as presented in Section III, reconstruction operations require an iterative process. Although this procedure is capable of producing better outcomes, its introduction in a deep network is not straightforward (given that each process can take a different number of iterations). Supported by this, the reconstruction processing units proposed in this work are an approximation, in which just one transformation over the marker image is performed (and not several iterations). Particularly, the input is processed by two basic morphological operations (without any iteration) and an elementwise max or minoperation (depending on the reconstruction) is performed over the input and processed images. Such concept is formally presented in Equations 30 and 31 for reconstruction by erosion and dilation, respectively. A visual representation of the processing unit for reconstruction by erosion is presented in Figure (d)d. Note that the SE used in the reconstruction of the marker image (denoted in Section III by ) is a dilated version of the SE employed to create such image, i.e., the SE exploited in the second morphological operation is a dilated version of the SE employed in the first transformation.
(30) 
(31) 
IvC Morphological Layer
After defining the processing units, we are able to formally specify the morphological layers, which provide the essential tools for the creation of the DeepMorphNets. Similar to the standard convolutional layer, this one is composed of several processing units. However, the proposed morphological layer has two main differences when conceptually compared to the standard one. The first one is related to the neurons that compose the layers. Precisely, in convolutional layers, the neurons are able to perform the convolution operation. Though the filter of each neuron can be different, the operation performed by each processing unit in a convolutional layer is a simple convolution. On the other hand, there are several types of morphological processing units, from opening and closing to geodesic reconstruction. Therefore, a single morphological layer can be composed of several neurons that may be performing different operations. This process allows the layer to produce distinct (and possibly complementary) outputs, increasing the heterogeneity of the network and, consequently, the generalization capacity. The second difference is the absence of activation functions. More specifically, in modern architectures, convolutional layers are usually composed of a convolution operation followed by an activation function (such as ReLU
[33]), that explicitly maps the data into a nonlinear space. In morphological layers, there are only processing units and no activation function is employed.Figure 6 presents the concept of a single morphological layer. Observe that each neuron is performing a specific operation and outputting only one feature map. Also, note that, although the input has channels, supported by the pointwise convolution, each neuron outputs only one feature map. Hence, the number of outputted maps is directly connected to the number of neurons in that layer. In Figure 6, the layer has neurons that, consequently, produce feature maps.
IvD Optimization
Aside from defining the morphological layer, as introduced, we must optimize its parameters, i.e., the filters . Since the proposed morphological layer uses common (derivable) operations already employed in other existing deep learningbased methods, the optimization of the filters is straightforward. In fact, the same traditional existing techniques employed in the training of any deep learningbased approach, such feedforward, backpropagation and Stochastic Gradient Descent (SGD) [1], can also be used for optimizing a network composed of morphological layers.
The whole training procedure is detailed in Algorithm 1. Given the training data , the first step is the feedforward, comprised in the loop from line 2 to 12. In line 4, the weights of the first depthwise convolution are converted into binary (according to Equation 18
). Then, in line 5, the first depthwise convolution is performed, while the first depthwise pooling is executed in line 6. The same operations are repeated in line 8 to 10 for the second depthwise convolution and pooling. Finally, in line 11, the pointwise convolution is carried out. After the forward propagation, the total error of the network can be estimated. With this error, the gradients of the last layer can be directly estimated (line 14). These gradients can be used by the backpropagation algorithm to calculate the gradients of the inner layers. In fact, this is the process performed in the
second training step, comprised in the loop from line 15 to 22. It is important to highlight that during the backpropagation process, the gradients are calculated normally, using realvalued numbers (and not binary). Precisely, lines 16 and 17 are responsible for the optimization of the pointwise convolution. Line 16 propagates the error of a specific pointwise convolution to the previous operation, while in line 17 calculates the error of that specific pointwise convolution operation. The same process is repeated for the second and then for the first depthwise convolutions (lines 1819 and 2021, respectively). Note that during the backpropagation, the depthwise pooling is not optimized since this operation has no parameters and only passes the gradients to the previous layer (similar to the backpropagation employed in the maxpooling layers commonly explored in ConvNets). The
third and last step of the training process is the update of the weights and optimization of the network. This process is comprised in the loop from line 24 to 28. Observe that, for simplicity, Algorithm 1 uses SGD to optimize the network, however, any other optimization algorithm could be exploited. For a specific layer, line 25 updates the weights of the pointwise convolution while lines 26 and 27 update the parameters of the first and second depthwise convolutions, respectively.IvE DeepMorphNet Architecture
With all the fundamentals defined, we can finally specify the DeepMorphNet architectures exploited in this work. Particularly, three networks, composed of morphological and fully connected layers, were proposed for image classification. Although such architectures have distinct designs, the pointwise convolutions exploited in the morphological layers have always the same configuration: kernel
, stride 1, and no padding. Furthermore, all networks receive as input images with
pixels, use crossentropy as loss function, and Stochastic Gradient Descent as optimization algorithm [1].The first network, presented in Figure (a)a, is the simplest one, having just a unique layer composed of one morphological opening (with kernel size of , stride 1 and padding 5 for both depthwise convolutions). In fact, this architecture was only designed to be used with the proposed synthetic datasets (as introduced in Section VA1). Because of this, such network is referenced hereafter as DeepMorphSynNet. Technically, this network was only conceived to validate the learning process of the proposed framework as explained in Section VIA.
The second proposed network, presented in Figure (b)b, is a morphological version of the famous LeNet architecture [34]. In virtue of this, such network is called here as DeepMorphLeNet. Formally, this architecture is composed of two morphological and three fully connected layers. The first layer has 6 neurons equally divided into the two types of composed processing units ( and ). The processing units of this first layer have kernels of size , and stride and padding equal 2 (for both depthwise convolutions). The second morphological layer has 16 neurons: 3 of the first type of composed processing units , 3 of the second type of composed neurons , 3 reconstruction by erosion and 2 by dilation , 2 white and 3 black top hats . This layer has kernel filters of size , with stride 1, and padding 2 for both depthwise convolutions. After the morphological layers, three fully connected ones are responsible to learn highlevel features and perform the final classification. The first layer has 120 neurons while the second has 84 processing units. Both layers use ReLUs [33] as the activation function. Finally, the last fully connected has the number of neurons equal the number of class of training dataset and uses softmax as the activation function.
To analyze the effectiveness of the technique in a more complex scenario, we proposed a larger network based on the famous AlexNet [4] architecture. However, in order to have more control of the trainable parameters, the proposed morphological version of the AlexNet architecture [4], called DeepMorphAlexNet and presented in Figure (c)c, has the same number of layers but less neurons in each layer. The morphological first layer has 8 processing units: 4 related to the first type of composed neurons and 4 related to the second type of composed neurons . Furthermore, in this layer, the kernels have size of (for both depthwise convolutions) and stride of 1 and 5, and padding of 5 and 2, for the first and second depthwise convolutions, respectively. The second layer has 24 neurons, with 4 for each of the following operations: first () and second () types of composed processing units, reconstruction by erosion and by dilation , white and black top hats. In this layer, the kernels have size of , with stride 1, and padding 2 for both depthwise convolutions. The third morphological layer has 48 neurons equally divided into the two types of composed processing units ( and ). This layer has the kernels of size (for both convolutions) and stride 1 and 3, and padding 1 and 0, for the first and second depthwise convolutions, respectively. The fourth layer has 32 neurons: 6 of the first () and second () types of composed processing units, 5 reconstruction by erosion and 5 by dilation , 5 white and 5 black top hats. Moreover, this layer has, for both convolutions, the following parameter configuration: kernel filters of size , and stride and padding equal 1. The fifth (and last) morphological layer has 32 neurons also equally divided into the two types of composed processing units ( and ). This layer has kernel of size (for both convolutions), and stride of 1 and 2, and padding of 1 and 0, for the first and second depthwise convolutions, respectively. In the end, three fully connected layers are responsible to learn highlevel features and perform the final classification. Similarly to the DeepMorphLeNet, the first two layers have 512 neurons (with ReLUs [33] as the activation function) while the last one has the number of neurons equal the number of classes of the training dataset (with softmax as activation function).
V Experimental Setup
In this section, we present the experimental setup. Section VA presents datasets employed in this work. Baselines are described in Section VB while the experimental protocol is introduced in Section VC.
Va Datasets
Four datasets were employed to validate the proposed DeepMorphNets. Two synthetic ones were exclusively designed to check the feature learning of the proposed technique. Then, to really verify the potential of DeepMorphNets, other two image classification datasets were selected: UCMerced Landuse [35] and WHURS19 [36]. It is important to highlight that the images of these last two datasets are resized in order to fit the requirements of the proposed architectures.
VA1 Synthetic Datasets
As introduced, two simple image classification datasets were designed in this work to validate the feature learning process of the proposed DeepMorphNets. In order to allow such validation, these datasets were created so that it is possible to define, a priori, the optimal structuring element (i.e., the SE that would produce the best results) for a classification scenario. Hence, in this case, the validation would be performed by comparing the learned structuring element with the optimal one, i.e., if both SEs are similar, then the proposed technique is able to perform well the feature learning step.
Specifically, both datasets are composed of 1,000 grayscale images with a resolution of pixels (a common image size employed in famous architecture such as AlexNet [4]) equally divided into two classes.
The first dataset has two classes of squares: (i) a small one, with pixels, and (ii) a large one, with pixels. In this case, a simple opening with a square structuring element larger than but smaller than should erode the small squares while keeping the larger ones, allowing the network to perfectly classify the dataset.
More difficult, the second synthetic dataset has two classes of rectangles. The first class has shapes of pixels while the other one has rectangles of . This case is a little more complicated because the network should learn a structuring element based on the orientation of one the rectangles. Particularly, it is possible to perfectly classify this dataset using a single opening operation with one of the following types of SEs: (i) a rectangle of at least 7 pixels of width and height larger than 3 but smaller than 7 pixels, which would erode the first class of rectangles and preserve the second one, or (ii) a rectangle with a width larger than 3 but smaller than 7 pixels and height larger than 7 pixels, which would erode the second class of rectangle while keeping the first one.
VA2 UCMerced Landuse Dataset
This publicly available dataset [35] is composed of 2,100 aerial scene images, each one with pixels and 0.3meter resolution per pixel. These images, obtained from different US locations, were manually classified into 21 classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts. As can be noticed, this dataset has highly overlapping classes such as the dense, medium, and sparse residential classes which mainly differs in the density of structures. Samples of these and other classes are shown in Figure 8.












VA3 WHURS19 Dataset
This public dataset [36] contains 1,005 highresolution images with pixels divided into 19 classes (approximately 50 images per class), including: airport, beach, bridge, river, forest, meadow, pond, parking, port, viaduct, residential area, industrial area, commercial area, desert, farmland, football field, mountain, park and railway station. Exported from Google Earth, that provides highresolution satellite images up to half a meter, this dataset has samples collected from different regions all around the world, which increases its diversity but creates challenges due to the changes in resolution, scale, orientation, and the illumination of the images. Figure 9 presents examples of some classes.












VB Baselines
For all datasets, two baselines were used. The first baseline is a standard convolutional version of the proposed DeepMorphNet. Precisely, this baseline recreates the exact morphological architecture using the traditional convolutional layer (instead of depthwise and pointwise convolutions) but preserving all remaining configurations (such as filter sizes, padding, stride, etc). Furthermore, differently from the morphological networks, this baseline makes use of maxpooling layers between the convolutions, which makes it very similar to the traditional architectures of the literature [34, 4]. The second baseline, referenced hereafter with the prefix “Depth”, is exactly the DeepMorphNet architecture but without using binary weights and depthwise pooling. This baseline reproduces the same DeepMorphNet architecture using only depthwise and pointwise convolutions (i.e., depthwise separable convolution [19]) without binary weights.
Differently from the morphological networks, both baselines use Rectified Linear Units (ReLUs)
[33]as activation functions and batch normalization
[37] (after each convolution). It is important to that, despite the differences, both baselines have the exact same number of layers and feature maps of the base DeepMorphNet. We believe that this conservation allows a fair comparison between the models given that the potential representation of the networks is somehow the same.VC Experimental Protocol
For synthetic datasets, a simple protocol was employed. Particularly, in this case, the whole dataset is randomly divided into three sets: training (composed of 60% of the instances), validation and test (each one composed of 20% of the dataset samples). Once determined, these sets are used throughout the experiments for all networks and baselines. Results of this protocol are reported in terms of the average accuracy of the test set.
For the other datasets, fivefold crossvalidation was conducted to assess the accuracy of the proposed algorithm. In this protocol, the dataset is split into five folds with almost the same size, where each one is balanced according to the number of images per class, giving diversity to each set. Those folds are processed in five different runs where, at each run, three distinct folds are used as the training set, one as validation (used to tune the parameters of the network) and the remaining one is used as the test set. The final results are the mean of the average accuracy (for the test set) of the five runs followed by its corresponding standard deviation.
All networks proposed in this work were implemented using Torch
^{1}^{1}1Torch is a scientific computing framework with wide support for machine learning algorithms available at http://torch.ch/ (as of May 2019).. This framework is more suitable due to its support to parallel programming using CUDA, an NVidia parallel programming based on Graphics Processing Units. All experiments were performed on a 64 bit Intel i7 5930K machine with 3.5GHz of clock, 64GB of RAM memory and a GeForce GTX Titan X Pascal with 12GB of memory under a 9.0 CUDA version. Ubuntu version 18.04.1 LTS was used as operating system.Vi Results and Discussion
In this section, we present and discuss the obtained outcomes. Section VIA presents the results of the synthetic datasets while Section VIB discusses about the experiments over the image classification datasets.
Via Synthetic Datasets
As explained in Section VA1, two synthetic datasets were proposed in this work to validate the feature learning of the deep morphological networks. Furthermore, as introduced, both datasets can be perfectly classified using one opening with specific structuring elements. Supported by this, the basic DeepMorphSynNet (Figure (a)a), composed of one opening neuron, can be used to validate the feature learning process of the proposed technique, given that this network has the capacity of perfectly classifying the datasets as long as it successfully learns the SE.
Given the simplicity of these datasets, aside from the methods describe in Section VB
, we also employed as baseline a basic architecture composed uniquely of a classification layer. Specifically, this network has one layer that receives a linearized version of the input data and outputs the classification. The proposed morphological network, as well as the baselines, were tested for both synthetic datasets using the same configuration, i.e., learning rate, weight decay, momentum, and number of epochs of 0.01, 0.0005, 0.9, and 10, respectively.
Results for the synthetic square dataset are presented in Table I. Among the baselines, the worst results were generated by the ConvNets while the best outcome was produced by the network composed of a single classification layer (86.50%). A reason for this is that the proposed dataset does not have much visual information to be extracted by the convolution layers. Hence, in this case, the raw pixels themselves are able to provide relevant information (such as the total amount of pixels of the square) for the classification. However, the result yielded by the best baseline (the network composed of the classification layer) was worse than the result generated by the proposed morphological network. Precisely, the DeepMorphSynNet yielded a 100% of average accuracy, perfectly classifying the whole test set of this synthetic dataset. As introduced in Section VA1, in order to achieve this perfect classification, the opening would require a square structuring element larger than but smaller than pixels. As presented in Figure (a)a, this was exactly the structuring element learned by the network. Moreover, as introduced, with this SE, the opening would erode the small squares while keeping the larger ones. This was the exact outcome of the morphological network, as presented in Figure 11.
Method  Average Accuracy (%) 

Classification Layer  86.50 
ConvNet  58.00 
DepthConvNet  60.00 
DeepMorphSynNet  100.00 


Results for the synthetic rectangle dataset are presented in Table II. Differently from the synthetic square dataset, the network composed of a single classification layer produced the worst outcome while the convolutional architectures yielded perfect results (100%). This difference may be justified by the fact that this is a more complex dataset that has two classes with equal shapes but differing in other relevant properties (such as the orientation) that may be extracted by the convolution layers. Also different from previous outcomes, in this case, the proposed DeepMorphSynNet and best baselines produced the same results (100% of average accuracy), perfectly classifying this synthetic dataset. As introduced in Section VA1, to perform this perfect classification, the opening operation (of the DeepMorphSynNet) would require a specific SE that should have the same orientation of one of the rectangles. As presented in Figure (b)b, this is the SE learned by the morphological network. With such filter, the opening operation would erode the one type of rectangles while keeping the other, the exact outcome presented in Figure 12.
Method  Average Accuracy (%) 

Classification Layer  52.00 
ConvNet  100.00 
DepthConvNet  100.00 
DeepMorphSynNet  100.00 


Results obtained with the synthetic datasets show that the proposed morphological networks are able to optimize and learn interesting structuring elements. Furthermore, in some scenarios, such as those simulated by the synthetic datasets, the DeepMorphNets have achieved promising results. A better analysis of the potential of the proposed technique is performed in the next section using real (not synthetic) datasets.
ViB Image Classification Datasets
For the image classification datasets, aside from the methods describe in Section VB, we also employed as baseline an approach, called hereafter as Static SEs, that reproduces the exactly morphological architectures but with static (nonoptimized) structuring elements. In this case, each neuron has the configuration based on the 5 most common structuring elements (presented in Figure 2
). The features extracted by these static neurons are classified by a Support Vector Machine (SVM). The idea behind this baseline is to have a lower bound for the morphological network, given that this proposed approach should be able to learn better structuring elements and, consequently, produce superior results.
Aside from this, for both datasets, all networks were tested using essentially the same configuration, i.e., bath size, learning rate, weight decay, and momentum of 16, 0.01, 0.0005, and 0.9, respectively. The only difference in the experiments is related to the number of epochs. Precisely, the LeNetbased architectures were trained using 500 epochs while for the AlexNetbased networks, 2,000 epochs were employed.
ViB1 UCMerced Landuse Dataset
Results for the UCMerced Landuse dataset are reported in Table III. In this case, all networks outperformed their respective lower bounds (generated by the Static SEs), an expected outcome given the feature learning step performed by the deep learningbased approaches. Considering the LeNetbased networks, the best result, among the baselines, was produced by the architecture based on depthwise separable convolutions [19], i.e., the DepthLeNet. The proposed DeepMorphLeNet produced similar results when compared to this baseline, which shows the potential of the proposed technique that optimizes the morphological filters to extract salient and relevant features.
The exact same conclusions can be drawn from the AlexNetbased networks. Specifically, the best baseline was the DepthAlexNetbased, that produces 73.141.43% of average accuracy. However, again, the proposed DeepMorphAlexNet yielded competitive results when compared to this baseline (76.861.97% of average accuracy). These results show that morphological operations are able to learn useful features. Some of these feature maps of the AlexNetbased architectures are presented in Figure 14. Note the difference between the characteristics learned by the distinct networks. In general, the morphological network is able to preserve different features when compared to the ConvNets.
Method 





Static SEs  19.462.06      
LeNet [34]  53.290.86  4.42  1.2  
DepthLeNet  54.811.25  5.04  6.2  
DeepMorphLeNet (ours)  56.521.74  6.04  7.8  
Static SEs  28.212.64      
AlexNetbased [4]  72.621.05  6.50  8.5  
DepthAlexNetbased  73.141.43  7.47  101.1  
DeepMorphAlexNet (ours)  76.861.97  10.50  209.9 
In order to better evaluate the proposed morphological network, a convergence analysis for the UCMerced Landuse dataset is presented in Figure (a)a. Note that the DeepMorphNets are slower to converge compared to the other networks. A reason for that is the number of trainable parameters. As presented in Table III, the DeepMorphNets have more parameters and, therefore, are more complex to train. However, given enough training time, all networks converge very similarly, which confirms that the proposed DeepMorphNets are able to effectively learn interesting SEs and converge to a suitable solution.


Input Image  Layer 1  Layer 2  Layer 3  Layer 4  Layer 5 
ViB2 WHURS19 Dataset
Table IV presents the results related to the WHURS19 dataset. Again, as expected, all architectures outperformed their respective lower bounds (generated by the Static SEs). Considering the LeNetbased networks, the best result, among the baselines, was produced by the LeNet architecture [34]. The proposed DeepMorphLeNet generated competitive results when compared to this baseline, corroborating with previous conclusions about the ability of the proposed technique to optimize the morphological filters and extract important features.
As for the UCMerced Landuse dataset, the exact same conclusions can be drawn from the AlexNetbased networks. In this case, the best baseline was the AlexNetbased network, that produces 64.382.93% of average accuracy. However, the proposed DeepMorphAlexNet produced similar results when compared to this baseline (68.202.75% of average accuracy). These results reaffirm the previous conclusions related to the ability of the morphological networks to capture interesting information. Figure 15 presents some feature maps learned by the AlexNetbased networks for the WHURS19 Dataset. Again, it is remarkable the difference between the features extracted by the different architectures. Overall, the DeepMorphNet is capable of learning specific and distinct features when compared to the ConvNets.
As for the previous dataset, a convergence analysis for the WHURS19 dataset is presented in Figure (b)b. Again, although the DeepMorphNets are slower to converge (mainly due to the number trainable parameters, as presented in Table IV), they are able to achieve similar results if enough time is provided for training the model, corroborating with previous conclusions.
Method 





Static SEs  15.172.83      
LeNet [34]  48.262.01  4.42  0.6  
DepthLeNet  47.192.43  5.04  3.1  
DeepMorphLeNet (ours)  52.912.60  6.04  3.7  
Static SEs  25.332.95      
AlexNetbased [4]  64.382.93  6.50  4.7  
DepthAlexNetbased  63.272.14  7.47  44.7  
DeepMorphAlexNet (ours)  68.202.75  10.50  99.8 
Input Image  Layer 1  Layer 2  Layer 3  Layer 4  Layer 5 
Vii Conclusion
In this paper, we proposed a novel method for deep feature learning, called Deep Morphological Network (DeepMorphNet), that is able to do morphological operations while optimizing their structuring elements toward a better solution. This proposed DeepMorphNet is composed of morphological layers, which are strongly based on a framework that is basically consisted of depthwise convolutions (that process the input with decomposed binary weights) and pooling layers. Such framework provides support for the creation of the basic morphological layers that perform erosion and dilation. This basic layer, in turn, allows the creation of other more complex layers, which may perform opening, dilation, (black and white) tophat, and (an approximation of) geodesic reconstruction (by erosion or dilation). The proposed approach is trained endtoend using standard algorithms employed in deep learningbased networks, including backpropagation and Stochastic Gradient Descent (SGD) [1].
Experiments were first conducted on two synthetic datasets in order to analyze the feature learning of the proposed technique as well as its efficiency. Results over these datasets have shown that the proposed DeepMorphNets are able to learn relevant structuring elements. In fact, the method could learn the expect (a priori) structuring element. Furthermore, the proposed approach learned a perfect classification of both datasets outperforming or producing equal results when compared to the ConvNets. This result shows the potential of DeepMorphNets, which are able to learn important filters that are also different from those learned by the ConvNets.
After this first analysis, the DeepMorphNet was analyzed using two remote sensing image classification datasets: (i) UCMerced Landuse Dataset [35], composed of aerial highresolution scenes in the visible spectrum, and (ii) WHURS19 Dataset [36], composed of highresolution very distinct scenes. For both datasets, the DeepMorphNets produced competitive results when compared to the convolutional networks. This outcome corroborates with the previous conclusion, that the morphological networks are capable of learning relevant filters.
The presented conclusions open some opportunities towards a better use and optimization of morphological operations. In the future, we plan to test larger networks, better analyze the combination of DeepMorphNets and ConvNets, optimize the use of trainable parameters, and test DeepMorphNets in different applications.
References
 [1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
 [2] R. M. Kumar and K. Sreekumar, “A survey on image feature descriptors,” International Journal of Computer Science and Information Technologies, vol. 5, pp. 7668–7673, 2014.
 [3] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in IEEE International Conference on Computer Vision. IEEE, 2003, pp. 1470–1477.

[4]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 1106–1114.  [5] K. Nogueira, O. A. Penatti, and J. A. dos Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.
 [6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE/CVF Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
 [7] K. Nogueira, M. Dalla Mura, J. Chanussot, W. R. Schwartz, and J. A. dos Santos, “Learning to semantically segment highresolution remote sensing images,” in International Conference on Pattern Recognition. IEEE, 2016, pp. 3566–3571.
 [8] G. Li and Y. Yu, “Visual saliency detection based on multiscale deep cnn features,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5012–5024, 2016.
 [9] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 [10] H. Lee and H. Kwon, “Going deeper with contextual cnn for hyperspectral image classification,” IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4843–4855, 2017.
 [11] M. Zhang, W. Li, and Q. Du, “Diverse regionbased cnn for hyperspectral image classification,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 2623–2634, 2018.
 [12] K. Nogueira, M. Dalla Mura, J. Chanussot, W. R. Schwartz, and J. A. dos Santos, “Dynamic multicontext segmentation of remote sensing images based on convolutional networks,” IEEE Transactions on Geoscience and Remote Sensing, 2019, (to appear).
 [13] H. Tanizaki, Nonlinear filters: estimation and applications. Springer Science & Business Media, 2013.
 [14] J. Serra and P. Soille, Mathematical morphology and its applications to image processing. Springer Science & Business Media, 2012, vol. 2.
 [15] M. Fauvel, J. Chanussot, J. A. Benediktsson, and J. R. Sveinsson, “Spectral and spatial classification of hyperspectral data using svms and morphological profiles,” in IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2007, pp. 4834–4837.
 [16] J. Xia, M. Dalla Mura, J. Chanussot, P. Du, and X. He, “Random subspace ensembles for hyperspectral image classification with extended morphological attribute profiles,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 9, pp. 4768–4786, 2015.
 [17] Y. Kimori, K. Hikino, M. Nishimura, and S. Mano, “Quantifying morphological features of actin cytoskeletal filaments in plant cells based on mathematical morphology,” Journal of theoretical biology, vol. 389, pp. 123–131, 2016.
 [18] Y. Seo, B. Park, S.C. Yoon, K. C. Lawrence, and G. R. Gamble, “Morphological image analysis for foodborne bacteria classification,” Transactions of the American Society of Agricultural and Biological Engineers, vol. 61, pp. 5–13, 2018.
 [19] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
 [20] Z. Yuqian, G. Weihua, C. Zhencheng, T. Jingtian, and L. LingYun, “Medical images edge detection based on mathematical morphology,” in IEEE International Conference of the Engineering in Medicine and Biology Society. IEEE, 2006, pp. 6492–6495.
 [21] M. S. Miri and A. Mahloojifar, “Retinal image analysis using curvelet transform and multistructure elements morphology by reconstruction,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 5, pp. 1183–1192, 2011.
 [22] J. Masci, J. Angulo, and J. Schmidhuber, “A learning framework for morphological operators using counter–harmonic mean,” in International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. Springer, 2013, pp. 329–340.
 [23] D. Mellouli, T. M. Hamdani, M. B. Ayed, and A. M. Alimi, “Morphcnn: A morphological convolutional neural network for image classification,” in International Conference on Neural Information Processing. Springer, 2017, pp. 110–117.
 [24] S. R. Borra, G. J. Reddy, and E. S. Reddy, “Classification of fingerprint images with the aid of morphological operation and agnn classifier,” Applied computing and informatics, vol. 14, no. 2, pp. 166–176, 2018.
 [25] Y. Gu, T. Liu, X. Jia, J. A. Benediktsson, and J. Chanussot, “Nonlinear multiple kernel learning with multiplestructureelement extended morphological profiles for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 6, pp. 3235–3247, 2016.
 [26] E. Aptoula, M. C. Ozdemir, and B. Yanikoglu, “Deep learning with attribute profiles for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 12, pp. 1970–1974, 2016.
 [27] P. Kalshetti, M. Bundele, P. Rahangdale, D. Jangra, C. Chattopadhyay, G. Harit, and A. Elhence, “An interactive medical image segmentation framework using iterative refinement,” Computers in biology and medicine, vol. 83, pp. 22–33, 2017.
 [28] A. Wang, X. He, P. Ghamisi, and Y. Chen, “Lidar data classification using morphological profiles and convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 774–778, 2018.
 [29] F. Mirzapour and H. Ghassemian, “Improving hyperspectral image classification by combining spectral, texture, and shape features,” International Journal of Remote Sensing, vol. 36, no. 4, pp. 1070–1096, 2015.
 [30] T. Liu, Y. Gu, J. Chanussot, and M. Dalla Mura, “Multimorphological superpixel model for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 12, pp. 6950–6963, 2017.
 [31] P. S. Bullen, Handbook of means and their inequalities. Springer Science & Business Media, 2013, vol. 560.
 [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[33]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010, pp. 807–814.  [34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [35] Y. Yang and S. Newsam, “Bagofvisualwords and spatial extensions for landuse classification,” ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), 2010.
 [36] G.S. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Maître, “Structural highresolution satellite image indexing,” in ISPRS TC VII Symposium100 Years ISPRS, vol. 38, 2010, pp. 298–303.
 [37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.