During the previous era of high interest in artificial neural networks, in the late 1980s, morphological networks [davidson1990theory, wilson1989morphological] were introduced merely as an alternative to their linear counterparts. In a nutshell, they consist in changing the elementary neural operation from the usual scalar product to dilations, erosions and more general rank filters. Besides the practical achievements in this research line, which reached state-of-the-art results at its time [pessoa2000neural], it questioned the main practices in the field of artificial neural networks on crucial topics such as architectures and optimization methods, improving their understanding.
After the huge technological leap taken by deep neural networks in the last decade, understanding them remains challenging and insights can still be brought by testing alternatives to the most popular practices. However, since the emergence of highly efficient deep learning frameworks (Caffe, Torch, TensorFlow,etc.), morphological neural networks have been very little re-investigated [Maxplus, mondal2019dense]. Motivated by the promising results in [Maxplus], this paper proposes to extend and deepen the study on morphological neural networks with Max-plus layers. More precisely, we aim to validate and exploit a specific property, namely that Max-plus layer (i.e. a dilation layer) following a conventional linear layer tends to select a reduced number of filters from the latter, making the others useless.
Our findings and contributions are three-fold. First, we propose an efficient training framework for morphological neural networks with Max-plus blocks, and theoretically demonstrate that they are universal function approximators under mild conditions. Secondly, we perform extensive experiments and visualization analysis to validate the robustness and effectiveness of the filter selection property provided by Max-plus blocks. Thirdly, we successfully applied the resulting Max-plus blocks to the task of model pruning on different neural network architectures. Related work is briefly summarized in Section 2. Then we show how to introduce Max-plus operators in neural networks in Section 3. Experimental results are discussed in Section 4.
2 Related Work
Morphological neural networks were defined almost simultaneously in two different ways at the end of the 1980s [davidson1990theory, wilson1989morphological]. Davidson [davidson1990theory] introduced neural units that can be seen as pure dilations or erosions, whereas Wilson [wilson1989morphological] focused on a more general formulation based on rank filters, in which and
operators are two particular cases. Davidson’s definition interprets morphological neurons as bounding boxes in the feature space[Ritter96, Ritter03, Sussner11]
. In the latter studies, networks were trained to perform perfectly on training sets after few iterations, but little attention was drawn to generalization. Only recently, a backpropagation-based algorithm was adopted and improved constructive ones[zamora2017dendrite]. Still, the “bounding-box” approach does not seem to generalize well to test set when faced with high-dimensional problems like image analysis.
Wilson’s idea, on the other hand, inspired hybrid linear/rank filter architectures, which were trained by gradient descent [Pessoa98] and backpropagation [pessoa2000neural]. In this case, the geometrical interpretation of decision surfaces becomes much richer, and the resulting framework was successfully applied to an image classification problem. The previously mentioned study [Maxplus] is one of the latest in this area, introducing a hybrid architecture that experimentally shows an interesting property on network pruning. In this classification experiment [Maxplus]
, each Max-plus unit shows, after training, one or two large weight values compared to the others. At the inference stage, this non-uniform distribution of weight values induces a selection of important filters in the previous layer, whereas the other filters (the majority) are no longer used in the subsequent classification task. Therefore, after removal of the redundant filters, the network behaves exactly as it did before pruning. In this paper, we focus on the exploration of this property and show that it becomes more stable and effective provided that a proper regularization is applied.
Model pruning stands for a family of algorithms that explore the redundancy in model parameters and remove unimportant ones while not compromising the performance. Rapidly increased computational and storage requirements for deep neural networks in recent years have largely motivated research interests in this domain. Early works on model pruning dates back to [hassibi1993, lecun1990]
, which prune model parameters based on the Hessian of the loss function. More recently, the model pruning problem was addressed by first dropping the neuronal connections with insignificant weight value and then fine-tuning the resulting sparse network[han2015a]. This technique was later integrated into the Deep Compression framework [han2015b] to achieve even higher compression rate. The HashNets model [chen2015] randomly groups parameters into hash buckets using a low-cost hash function and performs model compression via parameter sharing. While previous methods achieved impressive performance in terms of compression rate, one notable disadvantage of these non-structured pruning algorithms lies in the sparsity of the resulting weight matrices, which cannot lead to speedup without dedicated hardwares or libraries.
Various structured pruning methods were proposed to overcome this subtlety in practical applications by pruning at the level of channels or even layers. Filters with smaller norm were pruned based on a predefined pruning ratio for each layer in [li2016]. Model pruning was transformed into an optimization problem in [luo2017] and the channels to remove were determined by minimizing next layer reconstruction error. A branch of algorithms in this category employs regularization to induce shrinkage and sparsity in model parameters. Sparsity constraints were imposed in [liu2017] on channel-wise scaling factors and pruning was based on their magnitude, while in [wen2016] group-sparsity was leverage to learn compact CNNs via a combination of and regularization. One minor drawback of these regularization-based methods is that the training phase generally requires more iterations to converge. Our approach also falls into the structured pruning category and thus no dedicated hardware or libraries are required to achieve speedup, yet no regularization is imposed during model training. Moreover, in contrast to most existing pruning algorithms, our method does not need fine-tuning to regain performance.
3 Max-plus Operator as a Morphological Unit
3.1 Morphological Perceptron
In traditional literature on machine learning and neural networks, a perceptron[MLP]
is defined as a linear computational unit, possibly followed by a non-linear activation function. Among all popular choices of activation functions, such as logistic function, hyperbolic tangent function and rectified linear unit (ReLU) function, ReLU[ReLU] generally achieves better performance due to its simple formulation and non-saturating property. Instead of multiplication and addition, the morphological perceptron employs addition and maximum, which results in a non-linear computational unit. A simplified version [Maxplus] of the initial formulation [davidson1990theory, Ritter96] is defined as follows.
Given an input vector(with ), a weight vector , and a bias , the morphological perceptron computes its activation as:
where (resp. ) denotes the -th component of (resp. ).
This model may also be referred to as perceptron since it relies on the semi-ring with underlying set . It is a dilation on the complete lattice with the Pareto ordering.
3.2 Max-plus Block
Based on the formulation of the morphological perceptron, we define the Max-plus block as a standalone module that combines a fully-connected layer (or convolutional layer) with a Max-plus layer [Maxplus]. Let us denote the input vector of the fully-connected layer222This formulation can be easily generalized to the case of convolutional layers., the input and output vectors of the Max-plus layer respectively by , and , whose components are indexed by , and , respectively. The corresponding weight matrices are denoted by and . Then the operation performed in this Max-plus block is (see Figure 1):
Note that the bias vector of the fully-connected layer (convolutional layer) is removed in our formulation, since its effect overlaps with that of the weight matrix. In addition, the bias vector of the Max-plus layer is shown to be ineffective in practice and is therefore not used here.
3.3 Universal Function Approximator Property
The result presented here is very similar to the approximation theorem on Maxout networks333Note that the classical universal approximation theorems for neural networks (see for example [hornik1989multilayer]) do not hold for networks containing max-plus units. [Maxout], based on Wang’s work [wang2004general]. As shown in [Maxout], Maxout networks with enough affine components in each Maxout unit are universal function approximators. Recall that a model is called a universal function approximator if it can approximate arbitrarily well any continuous function provided enough capacity. Similarly, provided that the input vector (or input feature maps) of the Max-plus layer may have arbitrarily many affine components (or affine feature maps), we show that a Max-plus model with just two output units in its Max-plus block can approximate arbitrarily well any continuous function of the input vector (or input feature maps) of the block on a compact domain. A diagram illustrating the basic idea of the proof is shown in Figure 1.
(Universal function approximator). A Max-plus model with two output units in its Max-plus block can approximate arbitrarily well any continuous function of the input of the block on a compact domain.
We provide here a sketch of the proof, which follows very closely the one of Theorem 4.3 in [Maxout]. By Proposition 4.2 in [Maxout], any continuous function defined on a compact domain can be approximated arbitrarily well by a piecewise linear (PWL) continuous function , composed of affine regions. By Proposition 4.1 in [Maxout], there exist two matrices and two vectors such that
where is the -th row of matrix and the -th coefficient of , . Now, Equation 3 is the output of the Max-plus block of Figure 1, provided is the matrix of the fully connected layer, and the two rows of . This concludes the proof.
In this section, we present experimental results for our Max-plus blocks by integrating them in different types of neural networks. All neural network models shown in this section are implemented with the open-source machine learning library TensorFlow [TF] and trained on benchmark datasets MNIST [CNN] or CIFAR-10 [CIFAR] depending on the model complexity and capability.
4.1 Filter Selection Property
In an attempt to reproduce and confirm the experimental results reported in [Maxplus], we first implemented a simple Max-plus model composed of a fully-connected layer with units followed by a Max-plus layer with ten units, namely a Max-plus block in our terminology, to perform image classification on MNIST dataset. In contrast to the original formulation in [Maxplus], the two bias vectors are removed for practical concerns explained in Section 3.2.
Table 1 summarizes the classification accuracy achieved by this simple model on the validation set of MNIST dataset for different values of in columns 3 to 5. Note that all the experiments contained in these 3 columns are conducted under the same training setting (initial learning rate, learning rate decay steps, batch size, optimizer, etc.) except for parameter initialization (each column corresponds to a different random seed). The performance of the original model in [Maxplus] is included in column 2 for comparison. As shown in Table 1, provided a proper initialization, our simple Max-plus model generally achieves an improved performance compared to the original model. More interestingly, through horizontal comparison across different runs, we find that the performance of this naive Max-plus model is highly sensitive to parameter initialization.
|units||Model [Maxplus]||Max-plus||Max-plus + dropout|
In order to gain more insight into this instability problem, we follow the approach of [Maxplus] and visualize the weight matrix of the Max-plus layer by splitting and reshaping it into 10 gray-scale images , each corresponding to a specific class (Figure 2, left). In addition we also visualized, as on the right hand side of Figure 2, the ten linear filters from the fully-connected layer that correspond to the maximum value of each weight vector , defined as:
Figure 2 shows specifically the weights of the Max-plus model that achieves an accuracy of with units (fifth column and last row in Table 1). We notice that there exists a severe filter-collision problem in the Max-plus block, namely different output units select the same linear filter in the fully-connected layer to compute their outputs. More specifically, class and class share the same (highlighted by red circles) in their weight vectors and consequently select the same linear filter (visualized filters for class and class are the same). This collision between output units directly leads to classification confusion (because the Max-plus layer is the last layer in this simple model) since the classes that employ the same linear filter are completely indistinguishable. Furthermore, we only observed this filter-collision problem in the experiments that achieve relatively poor performance compared to the others (same model with different initialization). This observation consistently verifies our hypothesis that filter-collision is at the root of lower performance.
In order to separate the output units that got stuck with the same linear filter, we applied a dropout regularization [srivastava2014dropout] to randomly switch off the neuronal connections between the fully-connected layer and the Max-plus layer during training. Empirically, we found this approach highly effective, as the performance of the dropout-regularized Max-plus model becomes much less sensitive to parameter initialization. Table 1 shows that the Max-plus model with dropout regularization achieves both a better performance and more stable results. We further validate the effect provided by dropout regularization through an additional experiment. With , we carried out 25 runs with different initializations for different dropout ratios. As shown in Figure 3, dropout reduces dramatically the variability across experiments. The interpretation of this improvement is that dropout forces each class to use more than one linear filter to represent its corresponding digit. This allows for a more general representation and limits the risk of collisions between classes. Interestingly, we observe a slight accuracy drop when dropout ratio surpasses a certain level. This indicates that a trade-off between stability and performance needs to be found, although a large range of dropout values (between 25% and 75%) show robustness to random seed values without penalizing the effectiveness on classification accuracy.
Whereas in general (with or without dropout) linear filters that correspond to large values of are similar to the images in Figure 2 (right), i.e. digit-like shape with high contrast, those corresponding to smaller values in the weight matrix are noisy or low-contrast images showing the shape of a specific digit. This means that these filters were activated by few training examples. Figure 4 (left) shows several linear filters of this kind.
The visualization above suggests that large values in correspond to linear filters of the fully-connected layer that contain rich information for the subsequent classification task, and hence that strongly respond to the samples of a specific class. If that holds, then it is likely that only these filters shall achieve the maximum in (2), and contribute to the classification output. Therefore, we might be able to reduce the complexity of a model while not degrading its performance by exploiting this filter selection property of Max-plus layers.
In order to further validate the stability of this connection before taking advantage of it, we also visualized the activations of the fully-connected layer as gray-scale images for several training examples and compare444For example, if a training example has label , then we compare its activation vector with the weight vector . them to the visualization of . As shown in Figure 4 (right), there is a clear correspondence between the maximum activation value of a training example with label and the largest two or three values of the weight vector , which indicates that the linear filters corresponding to large values in the weight matrix are effectively used for the subsequent classification task.
4.2 Application to Model Pruning
Now that our approach to filter selection via Max-plus layers is proved to be quite effective and stable, we formalize our model pruning strategy as follows: given a fixed threshold , for each weight vector , we only keep the values that are larger than and the linear filters that correspond to these retained values. Therefore, if in total linear filters are kept in the pruned model, then the remaining parameters in the Max-plus layer no longer form a weight matrix but a weight vector of size
, where each entry corresponds to a linear filter in the fully-connected layer. The pruned fully-connected layer and the pruned Max-plus layer combined together perform a standard linear transformation followed by a maximum operation over uneven groups. Note that the pruning process is conducted independently for each output unitof the Max-plus block (), thus the number of retained linear filters in the pruned model may vary from one class to another. From now on, we shall call these selected linear filters by active filters and the others by non-active filters. Figure 5 shows a graphical illustration of the comparison between the original model and the pruned model.
We tried different pruning levels on this simple Max-plus model by varying the threshold and tested the pruned models on the validation set and test set of MNIST dataset. We plotted the resulting classification accuracy in function of the number of active filters in Figure 6. The performance of a single-layer softmax model and a single-layer Maxout model (number of affine components in each Maxout unit is two) is also provided for comparison.
As we can see on the diagram, the performance of the pruned Max-plus model is quite inferior to that of a single-layer softmax model when only one active filter is allowed to be selected for each class, i.e. the threshold is fixed to 1.0. However, as we relaxed the constraint on the number of total active filters by decreasing the threshold, the accuracy recovers rapidly and approaches that of the unpruned Max-plus model in a monotonic way. With exactly 24 active filters retained in the pruned model, we achieve a full-recovery of the original Max-plus model performance, which means that the other 120 linear filters do not contribute to the classification task. Moreover, we can achieve comparable performance as the 2-degree Maxout model with roughly the same amount of parameters, which again validates the effectiveness of our Max-plus models.
With the same method, we successfully performed model pruning on a much more challenging CNN model by replacing the last fully-connected layer with a Max-plus layer. In order to facilitate the training of deep Max-plus model, we resort to transfer learning by initializing the two convolutional layers with pre-trained weights. The pruned Max-plus model achieves slightly better performance than the CNN model while reducingof parameters of the second last fully-connected layer and eliminating the last fully-connected layer compared to the CNN model. Note that we could achieve a full-recovery of the unpruned Max-plus model performance with only ten active filters in this case, namely one linear filter for each output unit. Table 2 summarizes the architectures of the CNN model, the unpruned Max-plus model and the pruned Max-plus model, along with their classification accuracy on the test set of CIFAR-10 dataset.
4.3 Comparison to Maxout Networks
It is noticeable that the pruned Max-plus network differs from Maxout networks only in their grouping strategy for maximum operations. Maxout networks impose this rigid constraint in a way that each Maxout unit has an equal number of affine components while Max-plus networks are more tolerant in this respect. Figure 7 shows a graphical illustration of this comparison.
This higher flexibility hence endows Max-plus blocks with the capability of adapting the number of active filters used for each output unit accordingly. If a latent concept (say digit 1, which can be easily confused with digit 7) is considerably tougher to capture than some other concepts (say digits 3, 4 and 8, each of which has a relatively unique shape among the ten Arabic numbers), then the Max-plus layer will select more linear filters to abstract it than for the others. For example, the partition of the 24 active filters for the ten digit classes in Section 4.2 is , which is consistent with our point. This adaptive behavior of the filter selection property of Max-plus blocks makes the pruned Max-plus network more computationally efficient (fewer model parameters, smaller run-time memory footprint and faster inference) and is highly desirable in real-life applications.
5 Conclusions and Future Work
In this work we went a step further on a very new and promising topic, namely the reduction of deep neural networks with Max-plus blocks. Our experiments show strong evidence that model pruning via this method is compatible with high performance when a proper dropout regularization is applied during training. This was tested on data and architectures of variable complexity. Just as interesting as the obtained results are the many questions raised by these new insights. In particular, training these architectures is a challenging task which requires a better understanding on them, both theoretical and practical. We observed that training a deep model containing a Max-plus block is not straightforward, as we needed to resort to transfer learning. New optimization tricks will be needed to train deep architectures with several Max-plus blocks. The extension to a convolutional version of Max-plus blocks is also an open question, which we hope can be addressed based on the elements provided by this work.
Acknowledgements. This work was partially funded by a grant from Institut Mines Telecom.