Seed-Point Detection of Clumped Convex Objects by Short-Range Attractive Long-Range Repulsive Particle Clustering

Locating the center of convex objects is important in both image processing and unsupervised machine learning/data clustering fields. The automated analysis of biological images uses both of these fields for locating cell nuclei and for discovering new biological effects or cell phenotypes. In this work, we develop a novel clustering method for locating the centers of overlapping convex objects by modeling particles that interact by a short-range attractive and long-range repulsive potential and are confined to a potential well created from the data. We apply this method to locating the centers of clumped nuclei in cultured cells, where we show that it results in a significant improvement over existing methods (8.2 apply it to unsupervised learning on a difficult data set that has rare classes without local density maxima, and show it is able to well locate cluster centers when other clustering techniques fail.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 7

page 10

12/30/2017

Particle Clustering Machine: A Dynamical System Based Approach

Identification of the clusters from an unlabeled data set is one of the ...
12/30/2008

A New Clustering Algorithm Based Upon Flocking On Complex Network

We have proposed a model based upon flocking on a complex network, and t...
08/29/2006

Neural Network Clustering Based on Distances Between Objects

We present an algorithm of clustering of many-dimensional objects, where...
08/28/2018

Weighted total variation based convex clustering

Data clustering is a fundamental problem with a wide range of applicatio...
10/19/2016

Clustering by connection center evolution

The determination of cluster centers generally depends on the scale that...
06/18/2019

Multiple Testing Embedded in an Aggregation Tree to Identify where Two Distributions Differ

A key goal of flow cytometry data analysis is to identify the subpopulat...
03/05/2019

A Deep Learning based approach to VM behavior identification in cloud systems

Cloud computing data centers are growing in size and complexity to the p...

Code Repositories

SALR_Clustering

A method for locating the centers of partially overlapping convex objects and distributions.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Locating object and distribution centers is a fundamental step of both image processing and unsupervised machine learning or exploratory data mining. The automated analysis and classification of cells from microscopy images is an important field that includes both of these disciplines [1, 2, 3, 4, 5, 6]. In this field, the first step is often locating nuclei centers using image processing techniques, which allows for seed-point based segmentation algorithms[7, 8, 9, 10]

to determine the nuclei boundaries. Using the segmentation results, features of each nucleus or cell are measured and used to classify the cells into different groups (e.g. cancer/non-cancer, different phenotypes/cell-phase

[11, 12, 13, 5, 14, 15]). When searching for new groups (new phenotypes, new effects, …) or when it not possible for an expert to create a labeled data set, unsupervised machine learning, where groups are determined by looking for similarities in the data, must be used.

Locating nuclei centers and using unsupervised learning to locate cluster centers are, abstractly, very similar: they both try to locate the centers of partially-overlapping convex objects or distributions. However, the techniques for solving each of these problems are quite different and are not ideal for several reasons. The current methods for locating nuclei centers (radial voting [16, 10, 17, 18], gradient convergence and sliding band filters [19, 20], Laplacian of Gaussian filters [21, 22, 23, 24]) try to compute a surface, named the voting landscape, that has local extrema at the nuclei centers. The fundamental problem with these methods is that the extrema of the voting landscapes are not at the nuclei centers; further, there can be extra extrema, where false nuclei are detected, or missing extrema, where true nuclei are not detected. Since the segmentation methods will return one region per seed-point, it is critical that each nuclei have only one seed-point as close as possible to the nuclei center [22, 10]

. For unsupervised learning, there are many well known clustering techniques (K-means/fuzzy C-means

[25, 26, 27]

, expectation maximization/mixture of Gaussians

[28]

, hierarchical clustering

[29], mean-shift [30]), but a general drawback of most of these methods is that they produce cluster centers that are biased towards high density regions in the data, and many of them require the number of clusters (which is normally not known beforehand) as an input. Mean-shift clustering can be better in this regard; however, each cluster must have a local maximum in the point density.

In this work, we develop a new clustering method (SALR clustering) that can be used for locating both nuclei centers and the cluster centers in scatter point data, while at the same time addressing the above issues. The premise of our method, which is graphically described in Figure 1 and which takes cues from the fundamental physics of modeling classical Wigner crystals [31] and modeling the formation of clusters [32, 33], is that we can find the centers of overlapping convex regions by modeling the dynamics of particles trapped in an appropriate confining well.

Fig. 1: Premise of SALR clustering. ad, Intuition: schematic of the lowest energy configuration for a, 10 repulsive particles in a parabolic well (this is one of the two isomeric ground states[31]); b, three repulsive particles with three potential minima; c, seven repulsive particles with three potential minima; and d, seven SALR interacting particles with three potential minima. Contour levels represent potential energy. e, SALR interaction potential formed with a Gaussian attractive term and a repulsive term. The distance , named the attractive extent, is the distance where particles switch from being attractive to repulsive. f, Starting from a region of overlapping convex objects, a confining potential is created (third dimension corresponds to potential value); particles are randomly placed and the dynamics modeled; at the end the particles will form clusters located approximately at the center of each convex object.

Consider a set of repulsive particles confined to a region; these particles will try to move as far apart from each other as possible. Figure 1a shows this for 10 particles confined to a parabolic potential well. If the region the particles are confined to has several local minima, then the particles will preferentially locate as close as possible to these minima to lower the total energy. Assuming our goal is to have particles near each local minima and nowhere else, we must have the same number of particles as the number of potential minima, since two particles cannot be near each other, see Figure 1b,c. This is a problem as the number of potential minima is normally not known before hand, but it can be solved by modifying how the particles interact with each other such that particles near each other are attracted to each other instead of repulsed—this short-range attractive long-range repulsive (SALR) particle interaction potential can be seen in Figure 1e. Now, we can use (many) more particles than we expect there to be potential minima, and we can have a cluster of particles located at each potential minima, see Figure 1d.

Using this premise, we create a confining potential from the region of overlapping convex objects that has a higher energy at the region’s edges and a lower energy (valley) near the object centers, we randomly place many SALR particles in this region, and then we model the particle dynamics while slowly decreasing their speed; the final position of each cluster of particles will approximately give the center of each locally convex region, see Figure 1f and Supplemental Video 1.

We validate SALR clustering by applying it to the problem of locating nuclei centers, and show that it can significantly improve the location performance compared with previous methods and discuss how/why it is able to improve the performance. We also demonstrate the application of SALR clustering to unsupervised machine learning, and show that it can determine the correct number and position of clusters and better locate rare clusters that do not have local density maxima.

2 Modeling particle dynamics

The equations governing the particle dynamics are derived from the Hamiltonian of a set of interacting charged particles,

(1)

where , , , and represent the particle’s mass, momentum, position, and charge, respectfully, is a coupling constant, and is the distance between and . is the SALR particle interaction potential created with a Gaussian attractive term and a repulsive term

(2)

where is the distance between two of the particles and is a small constant value () to prevent the divergence at . An example of this potential is shown in Figure 1e. The values , , and in (2) are set by solving a least-squares problem so that the depth , the location of the potential minimum , and the distance at which the potential goes from being attractive to repulsive, referred to as the attractive extent, may be directly specified, see Figure 1e. For example, the correspondence between two sets of these parameters are and .

The particles’ equations of motion can be found from the Hamiltonian (1) to be

(3)
(4)

where

is the gradient operator acting on vector

.

Fig. 2: Locating nuclei centers method. 1) create a binary mask by filtering and thresholding the image. 2) create a confining potential (a 3D version of this potential is shown in Figure 1f). 3) initialize particle positions: create possible set of points (blue dots) based on the boundary curvature and choose one point per grid element of an overlaid grid (green points). 4) model the particle dynamics.

The system described so far conserves energy, which means that the particles/clusters will never stop moving. To find the lowest (or a low lying) energy state of the system, we slowly decrease the temperature of the system by damping the particles:

(5)

where is a time dependent damping constant. is not required to be time dependent, but faster convergence can be achieved by increasing the value of as the simulation goes on, e.g. .

Using (4) and (5) we can model the dynamics of the particles until the solution converges. We solve the set of differential equations using a Runge-Kutta (2,3) solver[34] (implemented with Matlab), which provides high enough tolerance to solve the problem and is faster than higher order methods. Convergence is determined by looking at the average particle speeds over a short time span (the last few time steps of the ODE solver), and if this average speed is below some threshold, then we stop the simulation. Averaging over a time span is used because the clusters of particles can have vibrational or rotational modes[35] that take a long time to die out (perhaps 30% more computation time), and using the average speed allows these modes to be effectively ignored.

In Supplemental Note 1, we make connections between our optimization method and simulated annealing and gradient descent that could lead to faster convergence.

3 Locating nuclei centers

3.1 Method

The process we developed for applying our method to locating nuclei centers is shown in Figure 2. We go through each step below, and recommend viewing Supplemental Video 1 for a visual description.

3.1.1 Filter & Threshold

After reading in the data, the (overlapping) foreground objects should be separated from the background with the goal of creating a binary mask such that for all in an object and for all not in an object. For the nuclei images used in this work, we filtered the images with a Gaussian blur () and then used an adaptive log-weighted Otsu threshold[36].

3.1.2 Construct confining potential

We use the binary mask to create a confining potential well, . Let the set of coordinates belonging to the object be denoted by . The confining potential is given by

(6)

where is the distance transform, and is the negation operator (). The distance transform takes a binary image and assigns to each pixel the Euclidean distance to the nearest non-zero pixel in the binary image. The first line of (6) states that the potential inside the object is given by 1 over the distance to the nearest object boundary pixel, and the second line states that the confining potential increases quadratically outside of the object. Note that the potential is inside the object and outside. After forming , it is smoothed with a Gaussian with pixel.

This confining potential is not scale invariant: as a object becomes larger, the potential will become more flat-bottomed, which will cause the particles to push each other closer to the object boundary, as is shown in Supplemental Figure 1a. Scale invariance can be added by implicitly scaling all objects so that their maximum distance transform values, , are equal to the same fixed value, (see Supplemental Figure 1b). This changes the confining potential on the interior of the object to be

(7)

Note that for a single ellipsoidal object, represents the semi-minor axis of the object; scaling the distance transform effectively means we are resizing each object to have the same semi-minor axis, .

3.1.3 Particle initialization

The next step is initializing the particles: determine the density/number of particles to use, set their mass and charge, and set their initial velocities and positions.

Number: The number of particles is determined by setting the particle density. The parameter we use to set the particle density is the Wigner-Seitz radius , which describes the amount of space taken up by each particle. The number of particles , in 2D, is then approximately

(8)

where is the area of the object (the number of elements in set ) and is the amount of space per particle.

Mass & charge: We set the mass of the particles to one and the particle charge to decrease with as . Larger values of result in particle clusters that interact with each other less and can move around more, smaller values of result in particle clusters that strongly interact and push each other closer to the object boundary.

Velocity: The particles are given random initial velocities according to , were is the initial speed and is a random unit vector.

Position: There are several possible methods of selecting the particles’ initial positions; we describe three methods which would each have their own benefit depending on the data.

i) Random: The most simple method, choose random points out of the set .

ii) Uniform random: To ensure we can find all of the seed-points, the particles’ initial locations should uniformly cover the domain of interest (). This can be accomplished by overlaying a lattice across the domain and placing one particle in each lattice unit cell. In order to have approximately particles, the lattice constant is chosen such that the lattice unit cell’s area is approximately the size of particle, in 2D. In 2D, we use a hexagonal lattice with lattice constant , and in higher dimensions we use a simple (hyper)cubic lattice. The location of the particle within each lattice cell is selected by choosing one random point from the points in that are also in each lattice cell, , where is the set of all coordinates in the ’th lattice cell.

iii) Convex hull transformed center of curvature (CvxHll CoC): Better results can be obtained if we can choose points that are approximately where the true seed-points are expected. For this, we compute a set of possible positions, , using the centers of curvature of each boundary vertex of the region , see Supplemental Note 2. For each lattice cell, we then choose one random point from the set, . Note that the centers of curvature can be far from the geometric center of elliptical regions since the long side of an ellipse has a smaller curvature (larger radius). This is addressed by modifying the boundary curvature to be the convex hull of each negative curvature region (See Supplemental Figure 2 for an example of this method.)

As a final note on position initialization, the confining potential can be used to tighten the constraint on where particles can be located. At the least, we require all initial positions to have a confining potential value less than 1 (the particles must start inside the object); however, we can further require that the confining potential value at all initial positions be in some range , which can be useful for both improving control over initial positions and reducing the number of particles used in the simulation.

3.1.4 Model particle dynamics

The particles are modeled using (4) and (5) until the solution converges. Box 4 in Figure 2a shows the particle positions after convergence; you can see clusters of particles are located at each of the nuclei centers. After the model has converged, we extract the center of each cluster by grouping together all particles within some distance of each other and then finding the mean location of each group. (This distance is empirically set and could be any distance and .) These cluster centers are the final seed-points returned by our method.

3.1.5 Reduction of work

It is not necessary to model the particle dynamics on all objects; if an object is convex, or it is smaller than a particle (), or only one particle would be modeled, then we do not run the simulation, but simply return the centroid of the object as the seed-point. We determine if an object is convex by testing that the all boundary curvature is less than some threshold, .

3.2 Results

Fig. 3: Example nuclei clumps. Examples of the nuclei clumps used for validation; the contrast for each object has been enhanced. The red dots represent the labeled nuclei centers, and the edge of the markers represent our expected confidence in their location ( pixel radius).
Fig. 4: Comparison metrics. a, Definition of true positive and false positive calculated seed-points and false negative seed-points. b, Example showing the calculation of the fractional distribution measure for three objects: one with the correct number (left), one with one too few (middle), and one with one too many (right) computed seed-points.

The data we use for validating our method is a set of 2,420 images of nuclei clumps with a total of 7,789 nuclei. The cells are epithelial cancer cells from the tongue (squamous cell carcinoma, SCC25) that we cultured, stained with 4’,6-diamidino-2-phenylindole (DAPI) to label their DNA, and then had imaged using a whole slide fluorescence reader, see Supplemental Note 3 for more details. The center of each of these 7,789 nuclei were manually labeled to create the truth data. Figure 3 shows an example of nine nuclei clumps with the labeled nuclei centers (more examples can be seen in Supplemental Figure 3).

We use two metrics to determine the performance of the computed nuclei center locations. The first is the

score, which ranges between 0 and 1 and is the harmonic mean of the precision and recall; a

value of 1 means that all seed-points (nuclei centers) were calculated perfectly, and a value of 0 means not a single seed-point was calculated correctly. It is computed as

(9)

where , , and are the number of true positives, false negatives, and false positives, respectively. A graphical definition of these terms is shown in Figure 4a: a calculated seed-point is if it is within a distance of to a truth point and it is the closest calculated seed-point to that truth point, otherwise it is ; a point is a truth point that does not have a calculated seed-point assigned to it. For example, the rightmost red X in Figure 4a is because there is another computed seed-point closer to the truth location than it.

The second metric we use measures the fractional distribution, , of objects (nuclei clumps) with the correct, too many, or too few calculated seed-points. For example, if , then 10% of all the objects have one less calculated seed-point than the true number. This metric is important as it directly measures the ability of a method to calculate the correct number of seed-points, even if the seed-points are not in the correct location. The fractional distribution of is given by

(10)

where is the number of objects (2,420), is the Kronecker delta, and and are the number of false positives and the number of false negatives for the ’th object. An example calculating is shown in Figure 4b. Depending on how the seed-points will be used, either or could be more important.

Parameter name Symbol Default value
Particle initialization CvxHll CoC
Wigner-Seitz radius 5
Confining potential depth 18
attractive extent 13
minimum location 2
depth -1
Max init. particle potential 1/5
Charge normalization 1/3
Damping rate
Initial speed 0.01
Mass m 1
Coupling constant k 1
TABLE I: Model parameter values used for our method in computing the results of Figure 5.

Using these two metrics, we compare the results of our method to seven other standard and leading methods for locating nuclei centers: multi-pass voting (MPV) by [16], two versions of single-pass voting by [10] (SPV) and by [18] (SPV), sliding-band filter (SBF) by [19], generalized Laplacian of Gaussian (gLoG) by [24], maximally stable extremal regions (MSER) by [37], and directly using the local maximum of the distance transform (Dist. Trans.). We optimized the parameters of each method to maximize the sum of and ; descriptions of the methods along with details on the parameters we optimized over and the final parameters used for each method can be seen in Supplemental Note 4. The parameters used for our method are shown in Table I. Many of these parameters are set empirically, and several were set by explicit optimization and are discussed in Supplemental Note 5.

Fig. 5: Methods comparison. a, versus for the seven methods. b, for for the seven methods.

We show the results of the comparison in Figure 5. Our method (SALR Clstr.) has both the best score (0.069 higher, 8.2%, than the next best method at and 0.028 higher, 3.0%, than the next best average score) and the best value (0.033 higher, 3.8%, than the next best method).

Fig. 6: Advanced filtering. a, Filter & threshold step of method: convolve a gLoG filter bank with an image to create a voting landscape, then threshold to create the binary mask. b, score using the previous gLoG method (shown in Fig. 5), using the entire gLoG filter bank (gLoG), and using only the symmetric gLoG filter (gLoG). c, results for the same three methods.

In particular, note that our method does much better than the distance transform; this is important because our confining potential is proportional to one over the distance transform. This means our method is not simply locating the local maxima of the distance transform, see Discussion.

Additional validation of our method’s scale invariance as well as the performance and computation time dependence on the initial particles starting locations and density are discussed/shown in Supplemental Note 6/Supplemental Figure 5 and Supplemental Note 7/Supplemental Figure 6.

3.3 Advanced filtering

The previous methods we compared to all use a voting landscape to determine the nuclei centers. We surmised that thresholding their voting landscape to create a binary mask and then using our method, instead of using the local maxima of the voting landscape, could improve the performance of the methods. We tried this using LoG filtering as it is likely the most easy to implement as well as the fastest.

In Figure 6a we show the Filter & Threshold step using gLoG. A gLoG filter bank is created that consists of a symmetric multiscale LoG filter (mLoG) and several asymmetric multiscale LoG filters. The voting landscape is formed by summing the convolution of each of these filters with the input image, and the binary mask is created with a simple threshold. In Figure 6b and 6c we show the and results of the original gLoG method (the same results as shown in Figure 5) and the results of using gLoG as the filter for our method (gLoG). You can see that using gLoG with our method results in a significant improvement in , by , and a large improvement in , by . Additionally, using only the symmetric mLoG as the filter for our method (mLoG) results in even better performance than the gLoG, with about a 2% increase in .

These results are important because much research does not use images where the nuclei can be easily segmented by thresholding (like the cultured cells in this work), but use colored histopathological (tissue) images, which are not easily thresholded. Our results show that the nuclei regions can be detected using gLoG filters and then accurate nuclei center locations and accurate nuclei count can be obtained by using our method.

4 Application to data clustering

In unsupervised machine learning and data mining, data often comes in the form of scatter point data where each point is an observation and each dimension is a different measured quantity. An example of 2D scatter point data is shown in the top of Figure 7a. The easiest way to apply our seed-point detection method to this type of data is by binning the data, where a grid is overlaid on the data and the number of points in each bin is counted. The result, shown in the bottom of Figure 7a, can be thought of as an image, and we can threshold it to create a binary mask and then proceed as before.

4.1 Simple 3D data

Fig. 7: SALR clustering for scatter point data. a, Example showing how scatter point data is binned. b, Isosurface plot of the point count in our 3D example; the axis planes show the projections of the isosurfaces to 2D. c, The confining potential created by thresholding the point count at 5 and then using (7). The black points are the results of running our seed-point calculation method 20 times, and the red points are the clusters formed from these results. The 2D view of the x-y plane shows the results of repeating this process 20 times. d, The red points show the result of running k-means clustering 20 times using 6, 7, and 8 clusters.

We first validate the use of our method for detecting the cluster centers of scatter point data by using a simple 3D distribution where we can use the well known k-means clustering. The point-count isosurfaces of the 3D distribution can be seen along with the projection of the isosurfaces to the 2D axis planes in Figure 7b. We first scale the z-axis so that the objects are approximately the same size along any direction (see Supplemental Note 8), we then threshold the point-count and create the confining potential, which is shown in Figure 7c.

We initialize particles for the simulation using a uniform random distribution with (resulting in 79 particles); all parameters other than particle density are the same as we used in the 2D images above (values shown in Table I). We simulated the particles 20 times (each time with new initial positions) and show the computed seed-points from each iteration as the black points in the 3D view of Figure 7c. The seed-points are well located in seven primary locations; the few points not at the primary locations can be reduced by tuning the particle damping rate (Supplemental Note 9); however, it is computationally faster to simply cluster the results of these 20 iterations and take all clusters with more than a few points as the final seed-points, which are shown as the large red dots in the 3D view. This process of clustering the results of several iterations is very robust; the 2D view in Figure 7c shows the results of repeating this process 20 times, and you can see the computed seed-points are all in the same location.

We check the validity of our seven seed-points by running k-means clustering with two replicates (as implemented by Matlab’s k-means++ routine) using 6, 7, and 8 clusters. We ran the clustering 20 times in each case, and show the results in Figure 7d. The stability of the results using 7 clusters (all 20 iterations produce the same seven positions), as compared with 6 or 8, imply that this 3D distribution does have seven clusters; and, the strong agreement in seed-point location between the k-means results and the results of our method imply that these are likely the correct locations.

These results indicate that our method can determine the correct number of seed-points and their correct location, and, in particular, could be a good choice for locating the centers of nuclei in 3D images, which this 3D distribution resembles.

Fig. 8: SALR clustering for real data. a, projections of the 5D point count down to dimensions 1, 3, and 4 and dimensions 1, 2, and 5. Black points are the results from each of the 5 iterations, red points are the final seed-points. Arrows label the three low density regions. b, results using k-means with 4, 5, and 6 clusters. The black arrow labels the one seed-point that approximately locates a low-density region; the circles show the two missing seed-points.

4.2 More complex 5D data

We next turn to a real data set with five dimensions and 2.7 million points; each point represents one cell’ nucleus (the same SCC25 cells as above) and the five features give a measure of the damage done to the nuclei. The data is scaled in each dimension by its standard deviation and then binned (about 33 bins in each dimension). Figure 

8a shows point-count isosurfaces for two (of the ten possible) 3D projections of the 5D data; the data has one very dense region in the center (which are undamaged nuclei) and maybe three long/thin low-density regions extending outwards from it. The long/thin regions mean that the distance transform cannot be used to create the confining potential (Supplemental Note 8). Instead, we use one over the point density for the confining potential; however, we must scale the magnitude of the potential gradient so that it is approximately the same order of magnitude as the particle repulsion (see Discussion below). In addition to this change, we introduce two generalizations to our method.

  • [leftmargin=*]

  • Distance metric: In higher dimensions, we found that we can get more stable results using the Minkowski distance metric where the distance between two points, and , is defined as

    (11)

    where is the dimension of the space and . If we set then we have standard Euclidean distance; and if we set , then , which, in our case, would mean that two particles are only attracted to each other if they are within in all dimensions. (Note that, in general, any distance metric can be used as long as the interaction potential is correctly defined.)

  • Isotropic scaling: We make a distinction between the data space (the N-D space the data is defined in) and the solver space (the N-D space we solve the equations of motion). We map between the two spaces with a simple scale factor defined to be , where is the attractive extent in data space and is the characteristic distance of the solver space, which, for this paper, can be directly interpreted as the attractive extent in the solver space.

    Introducing the solver space has two primary benefits: 1) If the attractive extent in data space should be , then it is not possible to accurately solve for the interaction potential parameters , , and (assuming and ). The solver space lets us scale up the problem to use an attractive extent of at which we can accurately solve for the parameters. 2) Changing the value of can precisely control the magnitude of the particle interaction (this could also be achieved using the coupling constant, , if one wanted).

    In Supplemental Figure 8, we show how the interaction potential and force in data space change as goes from 10 to 50. The interaction force is strongest at and decreases as increases.

We model particles five times (small black points in Figure 8a) and cluster the results (red points in Figure 8a). You can see that our method located the three low density extrusions (labeled by black arrows) very well, and our method even finds the very small, but distinct, cluster marked by the circle in right plot of Figure 8a. (Supplemental Video 2 shows an animation of the particle dynamics along with the comparison to k-means with 5 clusters.) The parameter values used in these simulations were as follows; uniform random distribution, , , , , , , , and the potential force was scaled so that the 99% value is 0.4. (The potential bound requirements helps to reduce the number of particles, which is 147, and to position the particles near the outskirts of the region, which is better, in general.)

As means of comparison, we used k-means clustering with two repetitions for 4, 5, and 6 clusters; we repeated this 20 times and show the results in Figure 8b

. The results using 4 or 5 clusters are stable (the clusters are in the same place in all 20 iterations), but are only able to approximately locate one of the three low density regions, while missing two of them (labeled by the circles). The reason k-means does not locate the low density regions well is that the cluster centers it finds are the locations which minimize the variance in each cluster, and the variance can be minimized by moving the cluster centers closer to the high density regions. (The same/similar reasoning also applies for fuzzy c-means or mixture of Gaussians.) This demonstrates an important feature of our method: our method can locate the center of low density regions (also called rare classes) even when the low density regions do not have local maximum.

Looking at the computation time, the k-means with 5 clusters, which was the fastest, takes sec per repetition; our method takes sec per iteration with sec overhead for binning, smoothing, and computing the confining potential gradient. (Note each repetition of k-means and each iteration of our method can run in parallel. Computer specifications are shown below.) Thus, our method can lead to an important speed increase for some data sets.

5 Discussion

It is interesting to consider how, and under what conditions, SALR clustering leads to these improvements. The answer comes by analyzing the force on the particles in the simulation: the confining potential exerts a force on the particles with a magnitude of (which is the slope of the confining potential in Figure 1f), and the particles repel each other with a force of about . If the force from the confining potential is much larger than the repulsive force, , then the particles will only find the local minimum of the confining potential—this would lead to the results labeled Dist. Trans. in Figure 5, and is the same thing that the previous methods do in finding the local extrema of the voting landscape or fining the local maxima of the point density (mean-shift). Alternatively, if the repulsive force is much larger than the confining force, , then the particles will ignore the confining potential minima and spread out—this would result in particle clusters separated by at least a distance that uniformly cover the region. In between these two regimes, when the two forces are approximately the same order of magnitude, , there is an interaction between the confining potential and the particle repulsion, and this is the regime we operate in and that leads to SALR clustering’s improved performance. (When using the distance transform to create the confining potential in our images, our confining force is , where is the distance to the nearest region boundary, and the repulsive force between the particles is also , where is the distance between particles.) This need for the forces to be approximately equal is why it is necessary to scale the magnitude of the potential gradient when the inverse point density is used as the confining potential.

Our method can be applied to even higher dimensional data sets, but it likely will not be possible to bin the data as we did above. Binning the data requires a large amount of space (the 5D data set, 37x30x28x34x36, takes MB as single precision), and is not necessary when the density is used as the confining potential. The benefit of binning the data is computational speed, as it allows us to pre-compute the density and its gradient everywhere. If the data cannot be binned, then the gradient of the density will need to be calculated (using a nearest neighbors range search) while solving the differential equations; this could be quite slow. Additionally, the magnitude of the gradient should be scaled; thus, before modeling the particle dynamics, a random sampling of the gradient magnitude will need to be performed so that the scale factor can be determined.

There are still many aspects of SALR clustering that are missing or can be improved and expanded upon, such as using an asymmetric particle interaction, developing a performance metric, implementing a clustering method (where each scatter point is placed into a specific cluster), and using multiple confining potentials and types of particles. These are each briefly discussed in Supplemental Note 10.

6 Conclusion

SALR clustering can represent a significant improvement in locating the centers of overlapping convex objects: it locates the correct number of nuclei more often and the nuclei centers more accurately than standard and leading methods; it can significantly improve the performance of previous methods; and it is able to determine, not only the number of clusters, but the correct position of the cluster centers in data clustering while not requiring a cluster to have a local density maximum.

Acknowledgments

The authors would like to acknowledge Guillermina Garcia from Sanford-Burnham Medical Research Institute for providing the whole slide fluorescence imaging. J.K. acknowledges Prof. Sylwia Ptasinska for the opportunity and resources to work on this project. J.K. and X.H. acknowledge the support from the US Department of Energy Office of Science, Office of Basic Energy Sciences, under Award Number DE-FC02-04ER15533. This is contribution number NDRL 5175 from the Notre Dame Radiation Laboratory.

Computer specifications

Computation times related to locating the nuclei centers are based on using a 64-bit i7-4790 CPU 3.60 GHz with 32.0 GB ram. Computations times reported for the 5D data set are based on using a 64-bit i7-3770 CPU 3.40 GHz with 12.0 GB ram.

Code & data availability

The SALR clustering code (written in Matlab) and documentation, as well as all images, truth data, and scatter point data used in this work, are in Supplementary Software and Data and available on GitHub (https://github.com/jkpld/SALR_Clustering). In particular, we include example scripts that directly reproduce the results of this work.

Author contributions

J.K. conceived the project, implemented the method, created validation truth data, analyzed the results, and wrote the manuscript. X.H. performed all tasks relating to the cell work (cell culture, cell experiments, staining the cells, having them imaged) and provided helpful discussion throughout the course of the project. D.M. provided valuable support and feedback in elevating the project and improving the manuscript.

References

  • [1] F. Xing and L. Yang, “Robust Nucleus/Cell Detection and Segmentation in Digital Pathology and Microscopy Images: A Comprehensive Review,” IEEE Reviews in Biomedical Engineering, vol. 3333, no. c, pp. 1–1, 2016.
  • [2] H. Irshad, A. Veillard, L. Roux, and D. Racoceanu, “Methods for Nuclei Detection, Segmentation and Classification in Digital Histopathology: A Review. Current Status and Future Potential,” IEEE Reviews in Biomedical Engineering, vol. PP, no. 99, pp. 1–1, 2013.
  • [3] T. J. Fuchs and J. M. Buhmann, “Computational pathology: Challenges and promises for tissue analysis,” Computerized Medical Imaging and Graphics, vol. 35, no. 7-8, pp. 515–530, 2011.
  • [4] S. Kothari, J. H. Phan, T. H. Stokes, and M. D. Wang, “Pathology imaging informatics for quantitative analysis of whole-slide images.” Journal of the American Medical Informatics Association : JAMIA, vol. 20, no. 6, pp. 1099–108, 2013.
  • [5] C. Sommer and D. W. Gerlich, “Machine learning in cell biology – teaching computers to recognize phenotypes,” Journal of Cell Sciience, vol. 126, no. Pt 24, pp. 5529–5539, 2013.
  • [6]

    P. J. Navarro, F. Pérez, J. Weiss, and M. Egea-Cortines, “Machine learning and computer vision system for phenotype data acquisition and analysis in plants,”

    Sensors (Switzerland), vol. 16, no. 5, 2016.
  • [7] L. D. Cohen, “Contour Models,” CVGIP: Graphical Models and Image Processing, vol. 53, no. 2, pp. 211–218, 1991.
  • [8] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” International Journal of Computer Vision, vol. 1, no. 4, pp. 321–331, 1988.
  • [9] C. Zimmer, E. Labruyère, V. Meas-Yedid, N. Guillén, and J. C. Olivo-Marin, “Segmentation and tracking of migrating cells in videomicroscopy with parametric active contours: A tool for cell-based drug testing,” IEEE Transactions on Medical Imaging, vol. 21, no. 10, pp. 1212–1221, 2002.
  • [10] X. Qi, F. Xing, D. J. Foran, and L. Yang, “Robust segmentation of overlapping cells in histopathology specimens using parallel seed detection and repulsive level set,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 3, pp. 754–765, 2012.
  • [11] B. T. Grys, D. S. Lo, N. Sahin, O. Z. Kraus, Q. Morris, C. Boone, and B. J. Andrews, “Machine learning and computer vision approaches for phenotypic profiling,” Journal of Cell Biology, vol. 216, no. 1, pp. 65–71, 2017.
  • [12] T. Blasi, H. Hennig, H. D. Summers, F. J. Theis, J. Cerveira, J. O. Patterson, D. Davies, A. Filby, A. E. Carpenter, and P. Rees, “Label-free cell cycle analysis for high-throughput imaging flow cytometry,” Nature Communications, vol. 7, p. 10256, 2016.
  • [13] D. Chen, S. Sarkar, J. Candia, S. J. Florczyk, S. Bodhak, M. K. Driscoll, C. G. Simon, J. P. Dunkers, and W. Losert, “Machine learning based methodology to identify cell shape phenotypes associated with microenvironmental cues,” Biomaterials, vol. 104, pp. 104–118, 2016.
  • [14] Q. Zhong, A. G. Busetto, J. P. Fededa, J. M. Buhmann, and D. W. Gerlich, “Unsupervised modeling of cell morphology dynamics for time-lapse microscopy,” Nature Methods, vol. 9, no. 7, pp. 711–713, 2012.
  • [15] J. Wang, X. Zhou, P. L. Bradley, S.-F. Chang, N. Perrimon, and S. T. Wong, “Cellular Phenotype Recognition for High-Content RNA Interference Genome-Wide Screening,” Journal of Biomolecular Screening, vol. 13, no. 1, pp. 29–39, 2007.
  • [16] B. Parvin, Q. Yang, J. Han, H. Chang, B. Rydberg, and M. H. Barcellos-Hoff, “Iterative voting for inference of structural saliency and characterization of subcellular events,” IEEE Transactions on Image Processing, vol. 16, no. 3, pp. 615–623, 2007.
  • [17] X. Zhang, H. Su, L. Yang, and S. Zhang, “Fine-grained histopathological image analysis via robust segmentation and large-scale retrieval,”

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , vol. 07-12-June, pp. 5361–5368, 2015.
  • [18] H. Xu, C. Lu, and M. Mandal, “An Efficient Technique for Nuclei Segmentation Based on Ellipse Descriptor Analysis and Improved Seed Detection Algorithm,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 5, pp. 1729–1741, 2014.
  • [19] P. Quelhas, M. Marcuzzo, A. M. Mendonça, and A. Campilho, “Cell nuclei and cytoplasm joint segmentation using the sliding band filter,” IEEE Transactions on Medical Imaging, vol. 29, no. 8, pp. 1463–1473, 2010.
  • [20] T. Esteves, P. Quelhas, A. M. Mendonça, and A. Campilho, “Gradient convergence filters and a phase congruency approach for in vivo cell nuclei detection,” Machine Vision and Applications, vol. 23, no. 4, pp. 623–638, 2012.
  • [21] T. Lindeberg, “Feature Detection with Automatic Scale Selection,” International Journal of Computer Vision, vol. 30, no. 2, pp. 79 – 116, 1998.
  • [22] Y. Al-Kofahi, W. Lassoued, W. Lee, and B. Roysam, “Improved automatic detection and segmentation of cell nuclei in histopathology images,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 4, pp. 841–852, 2010.
  • [23] H. Kong, H. C. Akakin, and S. E. Sarma, “A generalized laplacian of gaussian filter for blob detection and its applications,” IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 1719–1733, 2013.
  • [24] H. Xu, C. Lu, R. Berendt, N. Jha, M. Mandal, and S. Member, “Automatic Nuclei Detection based on Generalized Laplacian of Gaussian Filters,” Journal of Biomedical and Health Informatics, vol. XX, no. XX, pp. 1–12, 2016.
  • [25] J. MacQueen, “Some methods for classification and analysis of multivariate observations,”

    Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability

    , vol. 1, no. 233, pp. 281–297, 1967.
  • [26] J. C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters,” Journal of Cybernetics, vol. 3, no. 3, pp. 32–57, 1973.
  • [27] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, 1981.
  • [28] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, 1977, vol. 39, no. 1.
  • [29] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241–254, 1967.
  • [30] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002.
  • [31] F. Bolton and U. Rössler, “Classical model of a Wigner crystal in a quantum dot,” Superlattices and Microstructures, vol. 13, no. 2, p. 139, 1993.
  • [32] S. Mossa, F. Sciortino, P. Tartaglia, and E. Zaccarelli, “Ground-state clusters for short-range attractive and long-range repulsive potentials,” Langmuir, vol. 20, no. 24, pp. 10 756–10 763, 2004.
  • [33] F. Sciortino, S. Mossa, E. Zaccarelli, and P. Tartaglia, “Equilibrium cluster phases and low-density arrested disordered states: The role of short-range attraction and long-range repulsion,” Physical Review Letters, vol. 93, no. 5, pp. 5–8, 2004.
  • [34] P. Bogacki and L. Shampine, “A 3(2) Pair of Runge-Kutta Formulas,” Appl Math Lett, vol. 2, no. 4, pp. 321–325, 1989.
  • [35] S. A. Blundell and S. Chacko, “Excited states of incipient Wigner molecules,” Physical Review B, vol. 83, no. 19, p. 195444, 2011.
  • [36] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
  • [37] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust Wide Baseline Stereo from Maximally Stable Extremal,” In British Machine Vision Conference, pp. 384–393, 2002.