Approximation of dilation-based spatial relations to add structural constraints in neural networks

02/22/2021 ∙ by Mateus Riva, et al. ∙ Universidade de São Paulo Télécom Paris Sorbonne Université Université Paris-Dauphine 0

Spatial relations between objects in an image have proved useful for structural object recognition. Structural constraints can act as regularization in neural network training, improving generalization capability with small datasets. Several relations can be modeled as a morphological dilation of a reference object with a structuring element representing the semantics of the relation, from which the degree of satisfaction of the relation between another object and the reference object can be derived. However, dilation is not differentiable, requiring an approximation to be used in the context of gradient-descent training of a network. We propose to approximate dilations using convolutions based on a kernel equal to the structuring element. We show that the proposed approximation, even if slightly less accurate than previous approximations, is definitely faster to compute and therefore more suitable for computationally intensive neural network applications.



There are no comments yet.


page 4

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, the computer vision and image processing literature has met a veritable deluge of papers applying and exploring neural network-based techniques to a wide array of fields and problems. While these techniques often show remarkable accuracy and return high-quality results, they invariably require large amounts of data to train the vast array of parameters of a typical neural network. In domains where the amount of annotated data is limited, unless mitigated by other techniques, this “data-hunger” may result in sub-optimal neural networks performance. Regularization is a useful tool for allowing greater generalization from learning on limited datasets. In particular, regularization can be performed by introducing domain-specific constraints based on reliable expectations on the nature of the data, such as structural constraints.

In many domains in computer vision, such as e.g. medical imaging, the task of semantic segmentation of objects in a scene may be guided by the use of structural constraints based on prior knowledge (e.g. on the spatial organization of the objects). Imposing such constraints as a form of regularization may help a machine learning system in leveraging the structural nature of the images for improving segmentation and analysis quality, while requiring a smaller quantity of data. For doing so, it is necessary to have both a structural model that encodes the prior information, and a method for usefully embedding this structural model into the machine learning process.

Many useful structural constraint models are based on mathematical morphology operators (in particular, dilation), which are however non-differentiable over the entire domain and thus unable to be properly used in the context of gradient-descent learning – the backbone of most neural network training methods. With the intention of leveraging these useful models for the improvement of neural network training, we propose a comparative study of convolutional-based approximations of the mathematical morphology dilation, as the first step towards realizing this ambition.

In this work we present a comparative study of differentiable approximations of the dilation operator for spatial relationship encoding in the context of neural network training. We propose a simple, non-parametrized approximation through the use of the convolution, and compare it with known approximations such as mean operators.

2 Related Work

2.1 Fuzzy mathematical morphology for structural modeling

The main inspiration for this work was the review on fuzzy spatial relationships in [bloch_fuzzy_2005]

. The representative power of fuzzy spatial relationships has great potential for application to modern deep learning spatial awareness techniques, such as multi-task spatial relationship learning 

[murugesan_psi-net:_2019] or spatial-based loss regularization addition [Simantiris2020-STSP]

. The fuzzy sets theory is an appropriate framework for information representation and processing, taking into account its intrinsic imprecision. In the present context for structural scene understanding, imprecision can pertain to both objects, which are then represented as fuzzy sets in the spatial domain

, and to spatial relations, which will then hold to some degree. Since the output of a typical semantic segmentation neural network is a set of per-pixel class probabilities, there is a small paradigm shift to see these as an ensemble of fuzzy sets with further constraints. Several spatial relations can be modeled using fuzzy mathematical morphology, in particular fuzzy dilation. The main idea is to represent the semantics of a given relation as a fuzzy structuring element

in the spatial domain, and dilate a reference object by this structuring element in order to define the region of space where the relation to this object is satisfied. The fuzzy dilation is defined as [IB:FSS-09]: , where is a t-norm, and is the degree to which the relation to is satisfied at point . Several relations can be defined according to this general principle. In this work we will focus on closeness and directional relations.

2.2 Differentiable approximations of mathematical morphology operators

A differentiable approximation of requires an approximation of the function. Following Dubois and Prade [Dubois1985-IS-aggregation, Sect. 1.2.1], the can be approximated as the limit case of the generalized mean (for any ):


Thus we can rewrite the dilation operation as:


where is a derivable t-norm (such as the product), and we can approximate it using a (moderately) large positive value of in Equation 2. A value of as low as produces errors below in Equation 1 when and are scalars. Note that in practice, is a bounded finite domain, and has a finite support, hence the integral converges.

Another common approximation is to use the limit cases of the counter-harmonic mean (CHM) as approximations of the fundamental morphological erosion and dilation operators 

[Angulo2010-ACIVS-morphconv-chm, Masci2013-MMASIP-morphconv-chm]. Additionally, this approximation has been applied already in deep learning contexts [Mellouli2019-NNLS-morphconv-chm]. The CHM at the -th power of an image with structuring element is defined as:


where is the image with every pixel elevated to the power and denotes the convolution operation. The morphological dilation , can thus be seen as the limit case of the CHM:


and can be approximated by taking a positive value of in Equation 3. A value of as low as produces errors below for the counter-harmonic mean of two scalars.

3 Methodology

As in [bloch_fuzzy_2005], we propose to encode expected a priori relative positioning as a set of fuzzy mathematical morphology structuring elements – each element representing the semantics of a relationship. Each relationship is defined between two specific objects, which we call source and target. The dilation of the source by the structuring element results in a fuzzy “map” that is high-valued in regions obeying the expected relative positioning, as can be seen in Figure 1. The intersection of this map with the target object allows us to compute the satisfaction degree of the relationship.

Source Kernel Relational Map
Figure 1: Examples of a source, a structuring element (here encoding “to the left of”) and the relational map produced by the dilation of the source object by this structuring element.

3.1 Fuzzy Spatial Relationships

In this paper we focus, as examples, on two interesting relationships: closeness and directional relative position between two objects [bloch_fuzzy_2005].

3.1.1 Closeness.

One of the simple types of relationship between objects is “closeness” or proximity, that is, how distant one object is from the other. While there are several measures and metrics that propose to solve this problem, the relational maps offer an interesting solution: the dilation of the source by a circular structuring element of radius will result in a relational map with positive values up to from the source, thus encoding the relationship “at most pixels from the source”. This extends directly to a non-crisp structuring element, modeling the intrinsic imprecision of the concept “close to”, where the membership of a point is a decreasing function of the distance to the origin. The degree to which a target object is “close to” the source is then measured as a degree of intersection with this map. The use of a ring shaped structuring element may also additionally enforce a minimal distance between objects as the inner radius of the structuring element.

3.1.2 Directional Relative Position.

The directional relative position between objects allows us to encode an expected configuration of a scene, particularly useful in domains where scenes are highly constrained such as anatomical images. Encoding such a relationship can be done via the dilation of the source object with a structuring element defined at point as a decreasing function of the angle between the segment joining the origin and and the line in the desired direction (see Figure 1 for the left direction).

3.1.3 Defining Fuzzy Spatial Relationships.

Given a relationship , where is an image containing the source object, is an image containing the target object, a relationship structuring element (or kernel) which encodes the relationship of the source with a target object and their domain spaces (for simplifying calculation, we assume all domains are shared111Please note that this does not impact the generalizability of the proposed algorithm.), we can build the dilation-based relational map of , , by dilating with . The dilation-based intersected relational map is obtained by the element-wise multiplication – here denoted as – of with . Finally, the satisfaction degree of the relationship is given by the normalized relational score , obtained by the division of the sum of all values in by the sum of all values in :


3.2 Differentiable Approximation

In the context of neural network learning, these maps or scores may be used to compute an additional loss term to guide the training. However, as the loss needs to be differentiable with respect to the objects memberships to be predicted, the morphological dilation must be approximated by a differentiable operator.

For the purpose of encoding the spatial relationship, the dilation can be approximated by a simple convolution, with the kernel being the structuring element flipped across all axes. For clarity, we will refer to both kernels as

. The convolution is a practical choice in the context of neural networks. Not only convolutions are highly optimized, due to the popularity of convolutional neural networks (CNNs), which allow for faster calculation, but also the memory consumption decreases when compared to mean approximations (due to the power functions). Additionally, the convolutional approximation is hyperparameter-free.

Given a relationship as defined above, we can build the convolution-approximated relational map of , , by convolving by . The approximated intersected relational map and normalized relational score are then derived as before:


Such scores, for several , are then included in the loss function, with the aim of maximizing them. This optimization is done by differentiating the loss function with respect to every object membership function. This proposed approximation is indeed differentiable, as a composition of differentiable functions. This allows its use in gradient descent-based learning algorithms. Obviously, only derivatives of with respect to and will be non-zero if the relation involves only objects and . Let us now detail the derivative calculations. The derivative of the approximated normalized relational score with respect to , for an object , is:


The derivative of the approximated intersected relation map is:

Case .

As mentioned above, the derivative is equal to 0.

Case .

By noting that , Equation 10 becomes:


and we get:

Case .

The first term in Equation 10 is equal to 0, and we have:


Equation 10 then becomes , and finally we get:


4 Experiments

To illustrate that the proposed approximation behaves similarly to the original dilation-based spatial relation encoding, in the context of machine learning, and particularly as a component in a loss function, we compare the behavior of the relational score obtained by the dilation-based approach, , to the score obtained by the convolutional approximation, , and in an analogous fashion for the score obtained by the CHM approximation, , and by the generalized mean approximation, .

4.1 Comparison – Experimental Setup

Given a previously specified source and relationship kernel , we compute the relational maps (dilation-based), (convolution approximation), (CHM approximation) and (generalized mean approximation). Subsequently, the relational scores , , and will be computed for a target placed on all possible positions . Our expectation is that all functions will behave similarly in similar regions of , with the exception of the source region, which the dilations have a stronger tendency to include. We use the relations “to the right of” a disk of radius 5 pixels placed at the left of the central horizontal line, at coordinates , closeness with respect to a disk of radius 5 pixels placed at the image center, with a crown-shaped kernel, and an additional “insidness” relation to a square of side 50 pixels, using a simple dot kernel.

For all experiments, the sources and targets were placed inside a image, values were for CHM and generalized mean. We obtained the dilation-based and approximations-based relational scores for a target at all positions .

4.1.1 Experimental Results.

For each experiment conducted, we display, for all techniques, the relational maps (in Figure 2) and the heatmap of relational score values per target position (in Figure 3). Additionally, we directly compare the curves of all relational score functions at the cut obtained at the center of the X-axis and of the Y-axis (in Figure 4).

Source Kernel
Difference to :
Close to
Difference to :
Far from
Difference to :
Inside of
Difference to :
Figure 2: Comparison of the relational maps obtained using different techniques. See text for notations. The green outlines represent the source position in the relational map. The absolute difference to the dilation-based map can be seen for each method. Kernels are rescaled for visualization purpose.
Close to
Far from
Inside of
Figure 3: Comparison of the difference of the relational scores obtained using different techniques. The green outlines represent the source position in the relational map.
Mid-X cut Mid-Y cut
Close to
Far from
Inside of
Figure 4: Comparison of the relational scores obtained using different techniques. The first column is the value of the relational score heatmap at the mid-X-axis cut for all techniques; the second column is for the mid-Y-axis cut. The gray zone represents overlap with the source region. and are visually indistinguishable in all plots.
Relative position.

Results for the relative position experiment are shown in the first row of Figures 2, 3 and 4. From both the heatmaps and the curves shown, it can be seen that the convolutional approximation results in a slight “feather” effect being applied to the farthest points of the valid target region. Additionally, on almost all points is attenuated w.r.t. but displays the same behavior. The CHM approaches the dilation almost perfectly, with the generalized mean approximation following behind. Note that the region inside the source object is, as expected, strong in , and but not particularly strong in . This is due to the dilation implicit encoding of the “inside” relationship (identity in this case) and may not be desired if we want to penalize relationships inside the source.


Results for the closeness experiment are shown in the second row of Figures 2, 3 and 4. As for directional relations, here has also attenuated but similar behavior to , with the notable exception of the source region. In this experiment, even without the extensiveness (as the kernel does not contain the origin), the effect of the implicit encoding of “inside” by the dilation-based score is highly noticeable on , and .


Results for the farness experiment are shown in the third row of Figures 2, 3 and 4. As the kernel is too distant from the source for any implicit encoding of “inside” to happen, the behavior of w.r.t. is far more consistent. Interestingly, the most notable difference occurs when is close to the source, as it quickly decreases to zero and plateaus there.


Results for the insideness experiment are shown in the fourth row of Figures 2, 3 and 4. When ignoring the question of the implicit encoding of the “inside” relationship by the dilation-based approach – by explicitly encoding said relationship – the most noteworthy observation left from the results is the feathering effect seen in the other experiments.

4.2 Time and Hyperparameter Comparison

In order to demonstrate the lighter computational load placed by the convolution operator when compared to the mean approximations, as well as the advantages of its parameter-free implementation, we computed the mean squared error w.r.t. for different values of for each mean approximation, on images of size and of size . We also measured the execution time of the computation of for different resolutions, with a fixed value of . Both experiments used the “to the right of” relationship as example. Figure 5 highlights the need of the mean based approximations to set an appropriate value of the parameter . It can be seen that an improper choice of may result in sub-optimal results, a problem that the parameter-less convolutional approach does not have. Figure 6 shows that the convolutional approximation vastly outperforms the other approximations in terms of computational time. Additionally, as resolution increases, the gap in computational time increases exponentially. Given the necessity of executing these approximations several times for vast amounts of data when training a neural network, and the modern tendency of using input images with high resolutions, these gains in speed are absolutely crucial for completing training in feasible time.

Image size
Figure 5: Results for the hyperparameter comparison experiment, for images of size and . Dashed lines represent parameter-free methods (dilation and convolutional). Values above plateau close to zero.
Figure 6: Comparison of the running time of all methods, for different sizes of square images. For the mean approximations, we set .

5 Conclusion

We have shown that dilation-based structural constraints can be well enough approximated by a simple, hyper-parameter-free convolution, when used within the context of neural network learning. Additionally, a contextualized comparison with classic methods of the literature has been performed, showcasing their capabilities in this scenario and allowing for a wider variety of choices given computational requirements.

This is the first step towards building a neural network capable of taking advantage of known a priori

structural relationships, encoded using relational maps, to improve training quality and speed. In the context of neural network training, all assumptions of differentiability made are covered by modern automatic differentiation methods such as those in-built with TensorFlow or PyTorch. From the studied approximations, a working version of a structure-aware training technique for artificial neural networks can be implemented, adding extra terms to the loss function based on the relational scores produced between objects with known expected relationships.