Further Generalizations of the Jaccard Index

10/18/2021
by   Luciano da F. Costa, et al.
Universidade de São Paulo
0

Quantifying the similarity between two sets constitutes a particularly interesting and useful operation in several theoretical and applied problems involving set theory. Aimed at quantifying the similarity between two sets, the Jaccard index has been extensively used in the most diverse types of problems, also motivating respective generalizations. The present work addressew further generalizations of this index, including its modification into a coincidence index capable of accounting also for the level of interiority of the sets, an extension for sets in continuous vector spaces, the consideration of weights associated to the involved set elements, the generalization to multiset addition, densities and generic scalar fields, as well as a means to quantify the joint interdependence between random variables. The also interesting possibility to take into account more than two sets was also addressed, including the description of an index capable of quantifying the level of chaining between three sets. Several of the described and suggested generalizations have been illustrated with respect to numeric case examples. It is also posited that these indices can play an important role while analyzing and integrating datasets in modeling approaches and pattern recognition activities.

READ FULL TEXT VIEW PDF

Authors

page 7

11/02/2021

On Similarity

The objective quantification of similarity between two mathematical stru...
02/22/2022

Extremes for stationary regularly varying random fields over arbitrary index sets

We consider the clustering of extremes for stationary regularly varying ...
09/20/2020

Extremal Indices in the Series Scheme and their Applications

We generalize the concept of extremal index of a stationary random seque...
09/07/2019

On the clustering of correlated random variables

In this work, the possibility of clustering correlated random variables ...
12/09/2020

Estimation of first-order sensitivity indices based on symmetric reflected Vietoris-Rips complexes areas

In this paper we estimate the first-order sensitivity index of random va...
03/12/2020

Persistence of the Conley Index in Combinatorial Dynamical Systems

A combinatorial framework for dynamical systems provides an avenue for c...
03/01/2021

From Quantifying Vagueness To Pan-niftyism

In this short paper, we will introduce a simple model for quantifying ph...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite its seeming simplicity, set theory underlies a substantial portion of the mathematical and physical sciences, while being also extensively used in virtually every area of human activity. In fact, set theory concepts are so ubiquitous as to be incorporated into language and daily conversations. When one says “I will buy bananas and potatoes and tomatoes,” it is actually the set operation of union that it is being meant. Interestingly, the tenuous border between set theory and propositional logic is often blurred by humans (see [1]).

Another concept that is as ubiquitously employed in every human activity regards the concepts of similarity and distance between two entities. Mathematically, this can be related to quantifying in an objective manner several types of similarity between two or more mathematical structures such as scalars, sets, vectors, matrices, functions, densities, graphs, etc. This can be done in several manners, which frequently take into account the respective type of structure. For instance, vectors are often compared in terms of their inner product, and several similarity indices (e.g. [2]) have been suggested for comparing matrices with binary features.

One approach to the similarity between two sets that has attracted particular attention as a consequence of its interesting characteristics, being therefore employed extensively, is the Jaccard or Tanimoto index (e.g. [3, 4]). In addition to being constrained within the interval , the Jaccard index is also relatively simple and requires little computational expenses. Besides its vast range of applications (e.g. [5, 3, 6, 7, 8]), the Jaccard index has also motivated some extensions and generalizations, including its adaptation to multisets (e.g. [9, 10]).

Given the popularity of the Jaccard index, as well as its appealing characteristics, it would be particularly useful if it could be adapted to as many as possible other mathematical structures. The present work aims at developing further possible generalizations of the Jaccard index.

We start by focusing on the relative limitation of this index to reflect to which level one set is contained into the other, and a respective adaptation of the Jaccard index is then proposed to address this limitation that involves another measurement between two sets here called interiority index. More specifically, we define the coincidence index between two sets as corresponding to the product of the respective Jaccard and Interiority indices.

We then approach the adaptation of the Jaccard index to take into account sets corresponding to regions in continuous spaces such as . It was shown that this can be immediately accommodated into the standard Jaccard (and also the coincidence) indices by having the regions area in place of the sizes of sets. In addition to allowing useful graphical characterizations of the Jaccard and coincidence indices, this extension to continuous sets also paves the way to dealing with densities and scalar fields. In particular, we develop a related graphical construct to illustrate the relationship between the Jaccard, interiority and coincidence indices.

Another interesting possibility covered in the present work concerns the characterization of the similarity of sets whose elements have been assigned to weights expressing the respective relevance. This was achieved by a simple modification of the Jaccard and coincidence indices.

Next, we address the particularly interesting question of adapting the Jaccard index to become capable of comparing densities and functions in continuous spaces

, which correspond to generic scalar fields on those domains. This is achieved by extending the multiset version of the Jaccard index to incorporate integrals of the minimum and maximum operations along the respective space. The potential of the approach, which is conceptually and computationally simple, was then illustrated with respect to comparing probability density functions as well as more generic functions corresponding to two sinusoidals as well as two real-world images.

Another issue of special relevance that has been addressed regards the inherent, but not often considered, relationship between the quantification of the similarity between two probability densities with the also ample subject of characterizing joint variations of two random variables. Two particular problems are addressed. First, we observe on the possibility of using joint variation measurements. especially the Pearson correlation coefficient, as the means for comparing two densities. Then, we show that the multiset Jaccard adaptation to densities and functions can be effectively applied to quantify the joint relationship between two random variables, be it in terms of discrete observations or while taking into account their probability densities describing standardized versions of the involved variables.

The last topic approach described in the present work concerns the possibility of generalizing the Jaccard index to deal with more than 2 sets. We argue that there are two main ways in which this problem can be addressed. First, it is possible to have any of the two sets involved in the Jaccard index to correspond to generic combinations of any number of sets, obtained by using set operations. Alternatively, more than two sets can be actually considered as arguments of extended Jaccard indices. The latter possibility has been illustrated through the development of a generalization of the Jaccard index capable of quantifying the degree of chaining between three sets, as intermediated by one of them.

The article concludes by discussing the particularly important role of indices such as those discussed and suggested here for the ubiquitous activities of model building and pattern recognition. Some prospects for future developments are also provided.

2 The Basic Jaccard Index

The basic Jaccard index can be simply expressed as:

(1)

where and are any two sets to be compared.

It is interesting to keep in mind that, though not frequently specified, the universe set of and can be conveniently taken to be equal to .

The Jaccard distance can be immediately derived from the Jaccard index by making:

(2)

It should be kept in mind that this approach can be immediately extended to any other similarity index bound between 0 and 1.

It is also interesting to observe that it is possible to modify the Jaccard index so as to reflect the effective size of the intersection of the two sets after penalization by the elements that are left out. This can be done as:

(3)

The Jaccard index can be immediately generalized to multisets or bags (e.g. [11, 12]), which are basically sets in which repeated elements are allowed. The multisets and can be represented as respective vectors , , where is the total number of possible distinct elements in the universe defined by the union of the two multisets, and corresponds to the multiplicity of element in the multiset . The Jaccard index for multisets then becomes:

(4)

with .

As an example, let’s consider and . If we have the set of possible elements organized into the indexing vector , we will obtain and . Observe that the order of elements in is immaterial to our analysis. The, we have:

(5)

As a consequence, this adaptation of the Jaccard index allows it to be applied also to vectors, matrices, and graphs. In the case of matrices, the Jaccard equation can be further modified as:

(6)

Observe that many other mathematical structures, such as matroids, tensors, etc., can be compared by further adapting the above equation.

3 Interiority and Coincidence Indices

As illustrated in the previous sections, and also by the relatively extensive related literature, the Jaccard index provides an intuitive and logical manner to quantify the similarity between two discrete or continuous set. Yet, there is one particular situation, illustrated in Figure 1, which is not accounted for by this index.

Figure 1: Two distinct situations involving two sets and that yield the same Jaccard index value of . However, the two sets in (b) are much more compatible because is a subset of and therefore shares all its elements.

As it can be easily verified, both the situations depicted in Figure 1 lead to the same Jaccard index . However, the situation in (b) can be deemed to be quite distinct because, in this case, the set is completely contained in to the point of becoming a subset, i.e. . In other words, all elements of are shared with the set . This is not the case in the situation (a), for both sets and have elements that are not shared.

It therefore follows that it would be interesting to obtain a modification of the Jaccard index that could distinguish between these two situations. A possible approach is described as follows.

We start by considering an index capable of quantifying how much a set is interior to another. Let and be any two sets. The henceforth called interiority index can be written as:

(7)

It can be verified that . Its minimum value is observed when is completely separated from , i.e. . The maximum value is reached when any of the sets is completely contained into the other. In other words, there is no need to specify which of the two sets is being considered as being internal to the other.

By comparing Equations 9 and 9, it follows that:

(8)

and it can also be verified that:

(9)

The verification of similarity accounted for by the Jaccard index can be conveniently combined with the interiority index simply by considering their respective product, i.e.:

(10)

which is the same as:

(11)

In some specific cases it is also possible to use these two indices separately, defining a corresponding tuple

. It may also interesting to take the square root of the coincidence in order to obtain a more uniform distribution of values.

4 Continuous Sets

The henceforth described approach holds for , but we shall consider the plane vector space . It is possible to associate sets to the points of this space in any possible manner, such as , which is a discontinuous in , or , which defines a continuous region.

Though the Jaccard index can be immediately applied to any of these sets, it is of particular interest to our developments to consider sets configurations corresponding to simple connected regions of such as those illustrated in Figure 2.

Figure 2: The three most relevant situations to be considered when comparing two sets: (a) no intersection; (b) partial interesection; (c) complete intersection.

In this case, the size of the sets can be conveniently substituted by the respective areas, indicated as , , , and , which can be immediately used in Equation 9.

The three cases in Figure 2 also corresponds to the most representative situations when comparing two sets. In Figure 2(a), we have two separated sets, which results in null intersection, suggesting minimal similarity between the two sets. The situation depicted in (c) can be understood as leading to the maximum similarity that can be achieved with the sets and . Figure 2(b) illustrates a frequently found situation in which there is some intersection between the sets. In this case, it would be expected that the similarity increases with the intersection area.

The situation represented in Figure 2(b) actually incorporates the two other situations as limit cases. Consider the diagram shown in Figure 3, involving two square regions and , with respective sides and , .

Figure 3: A construction representing all possible situations regarding the similarity of two sliding squares and with sides and , respectively. Without loss of generality, we assume that . Any of these situations can be specified by just two parameters: the relative position and the relative size . This construction allows us to better understand the behavior of the Jaccard and other similarity indices covered in this work.

The relative position, and also the similarity, of the two sets can be completely controlled in terms of the relative position parameter , with . As increases, the two squares progressively separate, therefore becoming less similar.

There is only one other parameter that needs to be specified in order to completely represent the situation in Figure 3, namely the relative sizes of the two regions , with .

The area of the intersection and union of the two sets can now be conveniently expressed in terms of and as:

(12)
(13)

We can now rewrite the Jaccard index as:

(14)

Figure 4 presents the Jaccard index for two rectangles, as developed above, in terms of several configurations of the parameters and .

Figure 4: The Jaccard (a), interiority (b), and coincidence (c) indices obtained for the geometrical construction illustrated in Figure 3. The heat map increases from yellow to brown. The incorporation of the interiority level into the Jaccard index leads to a more comensurated distribution of the level sets. The maximum value the Jaccard and continuity indices are to be found at the lower righthand corner of the respective plots. The values of the interiority index increase linearly from the top to the bottom diagonal in (b).

5 Addition-Based Multiset Jaccard Index

The multiset Jaccard index can be further generalized by taking into account the sum of the two sets and instead of their respective union, which leads to:

(15)

with .

The interesting feature of this index is that it takes into account the situations where the multiple instances in the multisets need to be taken into account at its fullest when combining the sets.

As an example, let’s consider that and . Then, we have that:

(16)

and:

(17)

Figure 6 depicts the results obtained for the additive multiset Jaccard index considering the same construction as described in Figure 3.

Figure 5: The scalar field obtained for the additive multiset Jaccard index applied to the situations defined in Fig. 3. The result is substantially similar to the respective basic Jaccard index shown in Fig. 4(a), except for a less steep increase toward the peak (equal to 1) at the lower righthand corner.

The additive multised Jaccard index can be immediately combined with the respective interiority index to yield the additive multiset coincidence index.

6 Weighted Discrete Elements

The Jaccard and coincidence indices can be readily adapted to cope with cases in which the elements of sets and

have been assigned respective weights corresponding to their relative importance in each specific problem. This situation can be approached by using ordered pairs to represent each of the elements in

associated to its respective weigh, i.e. . The Jaccard index then becomes:

(18)

with .

As an example, let and . It follows that .

(19)

Thus, in spite of in this particular example the intersection being limited to a single of the possible elements, the Jaccard index resulted relatively high as a consequence of the large weigh associated to the element .

Observe that the weighted version of the Jaccard index is not the same as the Jaccard index adapted to multisets, as the latter case does not involve the sum of weights. However, it is possible to considered weighted multisets, in which case the Jaccard index becomes:

(20)

7 Continuous Densities and Scalar Fields

The weighted version of the Jaccard and coincidence index discussed in the previous sections paves the way for considering also sets corresponding to densities, such as probability density functions, as well as completely generic functions and scalar fields. One of the main problem to be overcome here is that densities often have infinite support, meaning that they extend over infinite ranges in their respective space.

The problem of comparing two distributions is particularly important in many theoretical and applied areas, having motivated great interest and the proposal of several respective approaches (e.g. [13, 14]).

One interesting perspective that can be used to adapt the Jaccard and coincidence indices so as to allow comparison of densities is developed as follows.

We start by representing a generic continuous function in terms of a respective discretization, with resolution , as illustrated in Figure LABEL:fig:discrete.

Figure 6: A generic density function being discretized with resolution , so that it can be represented as a vector . The integral of the original and respective discretization are assumed to be 1, so that they are both normalized as densities.

The density becomes the vector . Now, in a vector the order of the elements is all important, but it is also possible to relax this constraint and represent the discretized function as the respective multiset:

(21)

where is the multiplicity of the element generalized to take real values. In addition, we have also assumed, for simplicity’s sake, that the discretization takes place on points, which are henceforth understood as the support of both the function and the multiset.

Now, let and correspond to the multiplicity of the two sets obtained by discretization of two density functions and assuming the same support. Though the order of the coordinates has been lost, the respective multiset Jaccard can be nevertheless obtained as:

(22)

This index is then capable of expressing the similarity between the two original densities up to the resolution . The above reasoning extends immediately to discrete probability densities, and the Dirac delta approach can be applied conveniently here.

By making , we then obtain:

(23)

where is the support of the two multisets and .

The above result extends immediately to density functions on higher dimensional domains as:

(24)

Which provides a means to apply the multiset Jaccard index to continuous or discrete density functions for any number of random variables.

As an example, lets consider the two density functions and depicted in Figure 10(a).

Figure 7: Two probability density functions and (a), with respective intersection and union as shown in (b) and (c). This situation yields a Jaccard index equal to 0.09257. The maximum value 1 is obtained whenever the two densities are identical.

The respective intersection and union of these two densities, obtained by using the minimum and maximum operation between the elements of pair of values, are presented in Figures 10(a) and (b), respectively. The Jaccard index, obtained by dividing the area of the interesection curve by the area of the union, yielded a value of 0.09257.

The Jaccard index can be also adapted to quantify the separation of two groups of points, or clusters. The basic idea here would be to represent each of the clusters in terms of the joint probability density and then apply the Jaccard index over them by considering the densities as the respective multiplicity of every element. This method can be applied to any number of involved features.

Though we have so considered both and to correspond to non-negative scalar fields with hypervolume 1, it is actually possible to employ the Jaccard and coincidence indices to quantify the similarity between any two scalar fields and sharing the same domain. The same Equation 24 can be used in this case.

However, the minimum of the so obtained index is no longer guaranteed to be non-negative, but this property can be resumed by first subtracting the smallest value of the two functions from them, assuming the comparison to be invariant to joint translation of the two functions by a given offset. Another possibility, which is adopted in the following example, is to calculate all intersections (and unions), normalize all the minimum (maximum) values found, subtract the respective minimal value, and only them sum the results to get intersection (and union) terms to be divided one by the other so as to obtain the multiset Jaccard index.

For instance, let’s calculate the multisets Jaccard index for the functions and for a complete period , as illustrated in Figure 8.

Figure 8: The cosine and sine functions along a whole period can be compared by using the Jaccard or coincidence indices adapted to multisets. In this case, we get a Jaccard index equal to 0.476.

As a consequence of the symmetry of these two functions in the adopted domain, we have that the area of the intersection is identical to the area of the union multiplied by . Hence, we have a null Jaccard (and also coincidence) index value. It can be immediately realized that the comparison of two identical sines or cosines (and actually of any two identical functions) will yield 1 as result for both the Jaccard and coincidence indices.

A further example of the Jaccard index adapted to multidimensional scalar fields, namely a gray level image, also incorporating the respective scatterplot representation of the paired multiplicities is provided in Figure 9.

Figure 9: A gray level image of flowers (a) was mixed with random noise uniformly distributed between and , resulting in the noisy image shown in (b). The resulting scatterplot is depicted in (c), including the identity line defining the two regions for calculation of the scalar field intersection and union, from which a respective Jaccard index of was obtained, reflecting a relatively high similarity between the two scalar fields.

8 Joint Variations

Though the similarity between probability density functions and the joint variation (e.g. variance and Pearson correlation coefficient) between two random variables can be understood as being two related activities, they are typically understood in a somewhat separated manner.

The former approach is predominantly concerned as a comparison of the respective models of the associated random variables as provided by their respective probability density functions. As implied by its own name, the concept of joint variation involves quantifying the joint tendency of two random variables to vary together, not focusing on the similarity between the two respective densities.

Yet, these two objectives are analogous, especially when joint variation is taken in a normalized manner as when using the Pearson correlation coefficient. More specifically, we have that this coefficient can be understood as corresponding to the variance provided the samples of the two sets have been first standardized. By standardization it is henceforth understood that, given a random variable , we apply the following random variable transformation:

(25)

The standardization has the effect of normalizing the dispersions of a random variables, so that the its variance becomes 1 while the average is 0. It can also be verified that a standardized random variable will present most of its observations within the interval .

For our discussion, the interesting property allowed by the standardization of a random variable is that it becomes non-dimensional and, as such, can be directly compared to other random variables. It is this characteristic of standardized random variables, and their observations, that paves the way to quantify the similarity of probability density functions by using the Pearson correlation coefficient, which therefore becomes another interesting similarity index.

In the case of a set of observations of two standardized random variables, the Pearson correlation coefficient becomes:

(26)

When two standardized random variables and are taken jointly, they define a scatterplot

providing a useful illustration about the interrelationship between the two considered values. This scatterplot can be immediately understood as corresponding to the joint probability density of the two random variables, which may be kernel expanded to obtain an estimation of the respective counterpart.

Interestingly, the properties of the standardization of the samples values suggest that it may be possible to obtain an alternative joint variation index based on the Jaccard or coincidence indices. A possible method to do this involves the application of the Jaccard or coincidence indices adapted to multisets (Equation 5) on the standardized version of the observations (when dealing with samples) or densities (when considering the respective probability density models) represented as parametric Dirac delta curves.

In this case, the multiplicities correspond to the relative frequency of each joint observation , within a common support. It is of particular interest to observe that the consideration of joint variation from the perspective of the Jaccard index leads to the insight that this index can be applied even when the two sets and are not directly accessible

, as it is the case with joint distributions, except for some specific situations such as statistical independence. All that is need to apply the Jaccard and all its extensions including those proposed in this work, is to have a joint probability distribution on two variables. Consequently, it becomes a subject of particular importance to extend the Jaccard index to higher dimensions, and a possible approach is described in Section 

LABEL:sec:multiple.

The resulting value should express how the two random variables tend to be related one another. Maximum index result will be obtained for cases characterized by Pearson correlation coefficient resulting equal to one or minus one. The minimum value of the Jaccard or coincidence indices will be bound by 0, indicating that the two random variables do not present any joint relationship.

A characteristic that differentiates the calculation of joint variation by using the above Jaccard index adaptation from the most frequently adopted Pearson correlation coefficient concerns the fact that the Jaccard index-based approach tends to be less sensitive to more distant outliers, which are known to strongly affect more traditional joint variation indices. However, if required, the Jaccard index-based approach can immediately be modified so that the Dirac deltas become weighted by their respective distance to the identity line, or in some other desired manner.

The above described method lends itself to the interesting graphical interpretation illustrated in Figure LABEL:fig:graphic.

Figure 10: The two probability densities and in Figure 10 shown as a parametric curve in the respective scatterplot. In case of discrete densities, they can be represented in terms of discrete points (Dirac’s delta) related to the joint observations. Continuous densities can be expressed as respective parametric Dirac delta curves. It is also possible to assign weights to the mass distributions so obtained in the scatter plot, which may reflect the relative important to each specific problem or the repetition of observations. The identity line, shown in salmon, partitions the scatterplot space into the two regions and . The Jaccard index corresponds to the integral of the coordinates of the points within the region or coordinates when the point is in , which corresponds to the intersection of the densities, divided by the integral of the coordinates of the points in region or coordinates of the points in region , which contains the union points. In the case of the present example, we obtain a Jaccard index equal to 0.09257.

The diagonal line corresponds to the identity function dividing the region or comprised between and into the two subregions and .

9 Multiple Sets

We have so far considered indices applied to two sets or entities. There are two basic ways in which more sets can be taken into account. The first one is simply to understand that each of the two sets and are obtained by set operation combinations among several other sets. for instance, we may have and . We may write:

Observe that there is absolute no restriction on these functions, except that they are not both empty sets.

The Jaccard index for the example above can be expressed as:

Therefore, a vast range of possible combinations of diverse sets become possible, but they will ultimately always lead to two resulting sets and to be compared by the Jaccard or coincidence indices.

There is another interesting possibility to take into account more than 2 sets, and this corresponds to extending the Jaccard index, for instance in the case involving 3 sets, as:

with . This concept can be immediately extended to any number of sets.

The extension of the interiority index becomes:

It can be verified that this extended interiority index now quantifies how much the smallest of the sets is contained in the overall intersection. However, it does not take into account how the intermediate size set relates to the mutual intersection. This can be accomplished by introducing a second interiority index as:

The two obtained interiority indices can then be combined into a single respective index as:

with .

We can now define the coincidence index extended to three sets as:

A similar development applies to more than 3 sets.

The consideration of more than 2 sets in similarity index suggests other possible extensions of the Jaccard and coincidence indices. For instance, it becomes interesting not only to quantify the overall similarity between 3 sets, but also to develop indices capable of reflecting how these three sets are connected one another. Consider the situation depicted in Figure 11.

Figure 11: Three sets , and characterized by sequential, or chained intersections. In the suggested approach, is taken as a candidate reference for intermediating the other two sets through a chaining relationship.

This situation suggests that set intermediates the connection between the sets follows and , therefore establishing a chaining relationship. The Jaccard index with 2 sets cannot cope directly with this situation.

A possible index involving three sets that can quantify the chaining between 3 sets is:

As an example, let’s consider:

It folows that:

So, we have that:

(27)

and:

(28)

From which we obtain the chaining index value of:

which provides an interesting indication of the chaining between the sets , , and . Observe that the above described approach assumes that set has been adopted as a reference for implementing the chaining between and . More generic situations can be addressed by considering successive pairwise combinations.

It should be observe that it is possible that one of the intersections betwen and or is large enough to bias the above index. In these situations, it is possible to incorporate an additional index specifying a minimum overlap between both and as well as and .

Several other analogous chaining indices involving 3 or more sets or other structures are possible, leading to complementary properties.

10 The Jaccard and Coincidence Indices and Modeling

By allowing several types of mathematical structures to have their relationships being quantified in terms of respective indices, it becomes possible to objective and quantitatively address a wide range of theoretical and practical problems, while also catering for the consideration of stochasticity.

In addition, the several indices discussed and suggested in this work represent a valuable resource while developing models (e.g. [15]) through the combination of datasets as described in [1].

Then, we have several possibilities of applying these indices. For instance, a new dataset can be compared to those already modeled by using the similarity indices. Also of particular interest is to identify which combinations, through set operations, between the existing datasets associated to models are more likely to account for other datasets of interest, therefore providing insights about how respective models can be identified, related, or developed.

The discussed indices are also interesting from the perspective of characterizing, developing, validating and applying pattern recognition and deep learning approaches 

[16, 17, 18].

11 Concluding Remarks

Relationships between the several important mathematical structures — including sets, functions, vectors, densities, and graphs — are critically important in virtually all areas where mathematics is employed. Given its interesting features, the Jaccard index has been extensively employed in a large range of scientific and technological situations. Also as a consequence of its potential, the Jaccard index has been generalized in a variety of manners.

The present work aimed at generalizing further the Jaccard index. One of the first discussed possibilities consisted in using the interiority index, capable of quantifying how much a set is contained into another, as means to complement a identified limitation of the Jaccard index in taking into account the interiority of one set into the other . This index was then combined with the Jaccard index to yield the coincidence index, which is believed to provide a more strict quantification of the similarity between sets. The possibility to adopt the sum of multisets instead of the union was also addressed, with promising results for the situations where the multiplicity of the elements have to be fully taken into in account.

The possibility to apply the Jaccard and coincidence indices on continuous sets was then addressed by considering the areas of the involved regions in place of the number of elements in the involved sets. This adaptation of the Jaccard index allowed the consideration of density fields and functions, which was approached by using the Jaccard index for multisets. The potential of this generalization of the Jaccard index was then briefly illustrated with respect to probability density functions as well as in a comparison between the cosine and sine functions, which are not normalized and can take negative values, as well as a real-world image and a respective noise version.

The intrinsic relationship between similarity indices and statistical quantifications of joint variation between random variables was approached subsequently, and it has been argued that both the Pearson correlation coefficient can be used to compare two density functions, but also that a respective adaptation of the Jaccard and coincidence indices can also be used for that finality. We also discussed the interesting possibility to visualize the action of the Jaccard and coincidence indices with respect to the division of the data into two regions defined by the identity line in the scatterplot distribution. It was shown that the Jaccard index can be intuitively understood as integrating the density mass with each of these regions, followed by the respective division of the obtained values.

The also interesting situation of similarity and other indices considering three or more sets was then discussed, identifying the possibility to consider the two sets involved in the basic Jaccard and coincidence index as corresponding to the result of set operation combinations between any number of other sets. Another important extension was considered with respect to taking into account more than 2 sets as arguments for the similarity indices, which was illustrated in terms of a suggested index to quantify the chaining between three sets.

Several are the further possible works motivated by the concepts and methods reported and suggested in this work, a more complete list of which would be particularly extensive. Some of the possibilities include comparing the described indices with other indicators of similarity, the identification of other types of relationships that can be quantified when considering 3 or more sets and analogue generalizations of other interesting indices, as well as extending the described indices to other mathematical structures. In addition, as observed in Section 10, similarity and other indices such as those addressed here provide valuable means for developing and evaluating models of data as well as for several pattern recognition and deep learning tasks.

Acknowledgments.

Luciano da F. Costa thanks CNPq (grant no. 307085/2018-0) and FAPESP (grant 15/22308-2).

References

  • [1] L. da F. Costa. An ample approach to modeling. Researchgate, 2019. https://www.researchgate.net/publication/355056285_An_Ample_Approach_to_Data_and_Modeling. [Online; accessed 10-Oct-2021.].
  • [2] M. Brusco, J. D. Cradit, and D. Steinley. A comparison of 71 binary similarity coefficients: The effect of base rates. PLOS One, 16(4):e0247751, 2021.
  • [3] Wikipedia. Jaccard index. https://en.wikipedia.org/wiki/Jaccard_index. [Online; accessed 10-Oct-2021].
  • [4] P. Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société vaudoise des sciences naturelles, 37:547–549, 1901.
  • [5] Y. Yuan, M. Chao, and Y.-C. Lo. Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance. IEEE Transactions on Medical Imaging, 36(9):1876–1886, 2017.
  • [6] L. Hamers, Y. Hemeryck, G. Herweyers, M. Janssen, H. Ketters, H. Rousseau, and A. Vanhoutte. Similarity measures in scientometric research: The jaccard index versus salton’s cosine formula. Information Processing and Management, 25(3):315–318, 1989.
  • [7] L. Leydesdorff. On the normalization and visualization of author co-citation data: Salton’s cosine versus the jaccard index. Journal of the American Society for Information Science and Technology, 59(1):77–85, 2008.
  • [8] S. Park and D.-Y. Kim. Assessing language discrepancies between travelers and online travel recommendation systems: Application of the jaccard distance score to web data mining. Technological Forecasting and Social Change, 123:381–388, 2017.
  • [9] B. K. Samanthula and W. Jiang. Secure multiset intersection cardinality and its application to jaccard coefficient. IEEE Transactions on Dependable and Secure Computing, 13(5):591–604, 1989.
  • [10] D. Bacciu, A. Micheli, and A. Sperduti. Generative kernels for tree-structured data.

    IEEE Transactions on Neural Networks and Learning Systems

    , 29(10):4932–4946, 2018.
  • [11] J. Hein. Discrete Mathematics. Jones & Bartlett Pub., 2003.
  • [12] D. E. Knuth. The Art of Computing. Addison Wesley, 1998.
  • [13] S.-H. Cha. Comprehensive survey on distance/similarity measures between probability density functions. Intl. J. Math. Models and Meths. in Appl. Sci., 1(4):300–307, 2007.
  • [14] J. D. Loudin and H. E Miettinen. A multivariate method for comparing n-dimensional distributions. In PHYSTAT2003, SLAC, 2003.
  • [15] L. da F. Costa. Modeling: The human approach to science. Researchgate, 2019. https://www.researchgate.net/publication/333389500_Modeling_The_Human_Approach_to_Science_CDT-8. [Online; accessed 1-Oct-2020.].
  • [16] G. E. Hinton.

    Training products of experts by mini-mizing contrastive divergence.

    Neural computation, 14(8):1771–1800, 2002.
  • [17] J. Schmidhuber. Deep learning in neural networks:an overview. Neural networks, 61:85–117, 2015.
  • [18] H. F. de Arruda, A. Benatti, C. H. Comin, and L. da F. Costa. Learning deep learning. Researchgate, 2019. https://www.researchgate.net/publication/335798012_Learning_Deep_Learning_CDT-15. [Online; accessed 22-Dec-2019.].