DeepAI
Log In Sign Up

svcR: An R Package for Support Vector Clustering improved with Geometric Hashing applied to Lexical Pattern Discovery

04/23/2015
by   Nicolas Turenne, et al.
0

We present a new R package which takes a numerical matrix format as data input, and computes clusters using a support vector clustering method (SVC). We have implemented an original 2D-grid labeling approach to speed up cluster extraction. In this sense, SVC can be seen as an efficient cluster extraction if clusters are separable in a 2-D map. Secondly we showed that this SVC approach using a Jaccard-Radial base kernel can help to classify well enough a set of terms into ontological classes and help to define regular expression rules for information extraction in documents; our case study concerns a set of terms and documents about developmental and molecular biology.

READ FULL TEXT VIEW PDF

page 10

page 19

page 20

page 23

12/10/2018

Ramp-based Twin Support Vector Clustering

Traditional plane-based clustering methods measure the cost of within-cl...
04/29/2018

Big Data Quantum Support Vector Clustering

Clustering is a complex process in finding the relevant hidden patterns ...
02/27/2017

Mutual Information based labelling and comparing clusters

After a clustering solution is generated automatically, labelling these ...
03/20/2018

UnibucKernel: A kernel-based learning method for complex word identification

In this paper, we present a kernel-based learning approach for the 2018 ...
06/12/2015

Optimal γ and C for ε-Support Vector Regression with RBF Kernels

The objective of this study is to investigate the efficient determinatio...
12/04/2020

Adaptive Explicit Kernel Minkowski Weighted K-means

The K-means algorithm is among the most commonly used data clustering me...

1 Introduction

Mining text archives is a great challenge since lots of documents are available and their amount grows in the same way as the capacity of computer storage. Making rules of a domain for knowledge extraction involves efficient features with low semantic ambiguity. It is not an easy task and we try to answer this question by representing vectors of linguistic expressions (i.e. terms) by features and using a scalable density-based distance to cluster the terms.
The first idea for our problem concerns the choice of a density-based method and the improvement of its scalability. Clustering can be a useful knowledge-poor technique to induce organization into scattered data [20]

. Non-parametric methods such as support vector machines can be interesting to analyze noisy data by density processing.

[1, 37, 18]

proposed an unsupervised support vector algorithm to enclose data clusters by contours and based it on a radial kernel. Diverse applications have been tested for novelty detection

[11, 25], rule extraction [44], d soxyribo-nucleic acid (DNA) and chemical compounds [3, 12] or image processing [6]. The method of point assignation to contours and related clusters is based on adjacent points between each pair and is time-consuming. Some studies [32, 34] have been proposed to speed-up the method. In particular [43] and [26] proposed an improved method to label clusters, i.e. to assign point to clusters by graph analysis.
We present a new robust method based on the computation of a hash function through surrounding points working with a grid which we map to data using a k-nearest neighbor method. We developed this clustering method under the R platform [40], as a package called svcR, and we compared our approach to other ones, especially graph-based, on the Iris dataset (svcR is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=svcR”). [23]

have also developed a support vector method for clustering but using a divisive way iteratively searching a classical hyperplan separator based on classical support vector machine. The first step tries to separate the data set and a set artificially build in the same space of attribute values and the same size than the data set which is theoretically not justified; it seems that if not many classes are present (2 or 3) and not many attributes describe data, the algorithm seems to find groups, in other cases it tends to find a number of final clusters equal to the number of iterations.


The second idea presented in this paper concerns our original usage of support vector clustering (SVC) clustering methodology cited above for solving a certain form of ambiguity in natural language. Information retrieval [36, 8] and information extraction [24, 39] are key methodologies to retrieve information from text archives. But simple keywords may have several senses and assignment of term to conceptual classes should be important [14, 17, 35, 15, 13]. Clustering may be used to reduce the number of variables to take into account in rules for information mining in documents. We base our assumption on two works. Firstly that collocation analysis is useful to understand morphological structure and its link to a conceptual space [16, 38]. Secondly that clustering can bring a good approach to build semantic classes with the help of a distance of similarity [33, 31]

. This methodology about clustering linguistic terms can help to get common features to build rules for information mining in document archives. Classification of a set of terms requires to represent data as morphological information vectors (terms themselves or parts of terms and how?) and to determine which kernel has to be used to achieve SVC. We try to use whole terms, morphological primitives and bigrams as morphological information. And we try to use the Levenshtein distance and the Jaccard similarity index, Radial basis function, and combination Levenshtein-Radial or Jaccard-Radial kernels to study the clustering effect.


In Section 2, we introduce the methodology of support vector clustering. Section 3 presents the labeling approach and Section 4 gives studies of vector representation and different kernels for term clustering. Finally Section 5 shows evaluation of the technique.

2 Support vector clustering

In this section, we recall the clustering approach.

2.1 Kernel trick

We know a priori classes of items (red circles and yellow squares) and we search a linear frontier in a higher dimensional space. For that, data are transformed using a kernel function (dot product). Preprocess the data with:

(1)

is a dot product of the space (Hilbert space, ), and learn the mapping of to (class).

(2)
(3)


Usually

2.2 Optimization

As an extreme view the distribution of data under the scope of unsupervised learning can be interpreted as density estimation. But in our case the approach estimates

quantiles of a distribution, not its density. In the case of SVC, we determine support vectors to delimit the distribution of points. The goal now points out to find the minimal sphere which surrounds data. One can show: if ,…, is separable from the origin in , then the solution of margin minimization between two classes corresponds to the normal vector of the hyperplan separating the data from the origin with maximum margin.
In our case we try to encapsulate data into a ball. The points inside the ball represent data to classify (first) and the origin represents the second class. Primal problem is written as follows. Let the (non-fixed) center of the ball, the radius of the ball and is a fixed penalty constant controlling the number of data near the ball. Let us minimize:

(4)

Under the constraints:

(5)

is the center of the ball. The dual problem (for a convex problem) is the Lagrangian . Where and are Lagrange multipliers, see [1, 37, 18] for details about computation of . Multipliers are used to compute a (i.e. ) and .

If is a support vector, the radius is:

(6)

For any point the distance from the center is:

(7)

Hence it is possible to test if is inside the sphere or not by comparing and .

2.3 Contour deformation

The value of parameter asymptotically represents a max bound of the BSV rate. Parameter takes values in , as

is reduced, more and more points are labeled as outliers. If

increases the Gaussian radius decreases and the number of SV increases. Subsequently if one or more points of a cluster become support vector a specific contour will be generated for the cluster. From a certain value of , support vectors appear around each cluster.

3 Geometric hashing based labeling

In this section, we describe our mapping methodology to assign data points to clusters.

3.1 2-d grid assumption

In the previous method only support vectors guide processing to make contours but escape to know if a given point lies inside or outside the contour. Some methods such as describe in the foundation work by [1], and [43] work with an adjacency matrix defined as follows. Given two points of the data and and (the radius of the ball), the adjacency matrix such that:

(8)

Hence [1] define a set of points between each pair and calculate if all the points belongs to the sphere or not, and so assign the pair of point to a cluster; In the second method [43] use a graph method to analyze the density areas of the graph defined by the adjacency matrix. We have compared our approach to these ones we call respectively in the following nearest-neighbors (NN) and minimum spanning tree (MST). These methods are time-consuming and we imagined a method based on a geometric hashing function achieved with a grid surrounding data points in the attribute space. Basically according to the SVC method we only compute the radius for the points of the grid (that are hash keys) to build clusters, and as [9] we assume that almost closest points can be associated to a same hash. We use a nearest-neighbor method [7] to associate data points to their hash.

3.2 Algorithm

The basic idea behind random projections is a class of hash functions that are locally sensitive; that is, if two points are close, they will have small

values and they will hash to the same value with a high probability. If they are distant they collide with small probability. We have the following definitions. Let

be the size of the grid, and fixed by the user. A 2-dimension grid is characterised with a step . The step is defined according the minimal/maximal value of two first coordinates obtained by correspondence analysis (COA), is the first coordinate, and the second coordinate. We use the ade4 package of R-project to compute COA [10]. Let be a grid point.

Definition 3.1

()
Let be the scale of the grid from correspondance coordinates ck

(9)

We can define the set of grid points with each point by:

Definition 3.2

()
Let be the set defined by:

(10)

For each point we can assess membership to clusters without specifying which one.

Definition 3.3

Let be the set of clusters, knowing radius R according Equation 7

(11)

We now try to define clusters set with grid points:

Definition 3.4

(C)
We call C the set of clusters. A cluster consists of a grid point and all neighbouring grid points:

(12)

Now we define the ball as the neighborhood of the hash key from which it is assigned a specific cluster reference using a k-nearest neighbor threshold:

Definition 3.5


Let the grid, C the set of clusters and a point with coordinates (X, Y) in the grid space GxG. Then the ball of P is defined by at least k neighbours belonging to a same cluster:

(13)

A family is called locality-sensitive if, for any point , the function is defined as follows:

Definition 3.6


(14)

decreases in . That is the probability of the collision of points and decreases with the distance between them.
After defining a grid on data space, ClusterLabeling function achieves the first stage assigning a cluster number to each point of the grid. The calculation of Lagrange coefficient gives the kernel matrix (MK). User settles the size of grid G, and MinMax value in data space can be computed. The main function (findSvcModel, described in next chapter) outputs a matrix called NumPoints linking each grid point to a cluster id. The radius Rc can be computed according algorithm shown in Table 1.

0:  kernelmatrix MK, grid size G, MinMaxX min max value of x value in data, MinMaxY min max value of y value in data. 0:  NumPoints, a GxG vector for each grid point and membership to a cluster id.   while each GxG Grid point P do {we identify if a point belongs to a possible cluster}         Associate x, y values to P from MinMaxX and MinMaxY         Calculate Radius Rp of P , if Rp ¡= Rc , give ball membership to P   end while   while each GxG Grid point P(i) do {we identify cluster id(s)}         while each P(k) around P(i) of one step do               if all points P(k) have no cluster membership then                     Create a new cluster vector CV with a new cluster id Cm                     Put CV in a list of cluster vector membership LCV                     Put P(i) to CV and associate Cm in NumPoints               else                     associate cluster id of P(k) to P(i) in NumPoints               end if         end while   end while   while each CV(i) in LCV do {we merge closed clusters}         while each other CV(j) in LCV != CV(i) do               if CV(i) has distance of one step from CV(j) then                     Merge CV(i) and CV(j)                     Update NumPoints               end if         end while   end while Algorithm 1
Table 1: Calculate Rc radius of the ball using MK.

Finally we can assign a cluster label for any point of the data set according the hash function and the corresponding ball value, defined in Equation 3.5.

(15)

MatchGridPoint function, presented below, achieves the second stage; computation of in Equation 15. It returns a vector we call ClassPoints associating a cluster id to each data point in the initial dataset (see Table 2).

0:  data matrix MD, grid size G, MinMaxX, MinMaxY, NumPoints, neighbourhood of a data point k. 0:  ClassPoints, a vector for data point and membership to a cluster id. 1:  for  each point D(i) in MD  do 2:        Calculate Grid coordinate of any D(i) , with MinMaxX, MinMaxY 3:  end for 4:  for  each point D(i) in MD  do 5:        Init a score vector SV(i) with dimension of cluster id(s) 6:        for  each Grid Point P(j) in NumPoints  do 7:              if  P(i) cluster id = k is found and distance between P(j) and D(i) = k  then 8:                    Increment SV(i)(k) 9:                    Associate Max(SV) to Classpoint(i) 10:              end if 11:        end for 12:  end for Algorithm 2
Table 2: MatchGridPoint routine.

3.3 Usage of the svcR package

Main function is the findSvcModel function. It computes a clustering model and returns it as an R object which is usable to other function for display and export. Let call ret the return object, it covers some information about model parameters as the language coefficients (getlagrangeCoeff(ret)$A attribute), the kernel matrix (getMatrixK(ret) attribute) and the cluster memberships (getClassPoints(ret) attribute). findSvcModel takes 10 arguments :

  • data.frame means data.frame parameter in standard use
    or means data.frame in loadMat use
    or means DatMat in Eval use, a matrix given as unic argument

  • MetOpt, optimization control parameter : optimStoch (stochastic way of optimization) or optimQuad (quadratic way of optimization)

  • MetLab, labelling method: gridLabeling (grid labelling) or mstLabeling (mst labelling) or knnLabeling (knn labelling)

  • KernChoice, kernel choice: KernLinear (Euclidian) or KernGaussian (RBF) or KernGaussianDist (Exponential) or KernDist (Matrix data as Kernel value)

  • Nu, parameter

  • q, parameter

  • k, nearest neigbours for grid

  • G, grid size

  • Cx, component to display (1 for 1st attribute)

  • Cy, component to display (2 for 2nd attribute)

If Cx and Cy are 0 the correspondent analysis is used. The data is given as first argument. The format is data.frame() (i.e. list) as the iris well known dataset. Some R libraries are required as quadprog [2] for optimization, ade4 [10] and spdep [5]

for principal component analysis. This an exemple of usage in R :


R> library("svcR")
R> data("iris")
R> retA <- findSvcModel(iris, MetOpt = "optimStoch", MetLab = "gridLabeling",
+     KernChoice = "KernGaussian", Nu = 0.5, q = 40, K = 1, G = 5,
+     Cx = 0, Cy = 0)
R> plot(retA)
R> ExportClusters(retA, "iris")
R> findSvcModel.summary(retA)

It means as data is the iris data frame. The Kernel choice is radial-based, parameters of SVC technique are and . Parameters for cluster labeling are neighbor and grid size of points. means that first two principal components are used. MetLab value means that geometric-hashing method is used. Plot function permits to visualize clusters. ExportClusters outputs clusters in a file with variables names. findSvcModel.summary displays size and number of clusters, and averaged attributes for each cluster. Some functions can help the user to navigate in clusters. ShowClusters(retA) returns all clusters ordered by their id (cluster 0 is a bag of variables not clusterable), GetClustersTerm(retA, term = ”121”) returns clusters in which ”121” is a substring names of a member include in them, and GetClusterID(retA, Id = 1) returns the cluster with .

3.4 Toy example

We used the famous Fisher’s Iris data set. It contains 3 classes, 150 variables and 4 attributes. Our clustering extraction is largely based on the topology of points localized on a 2-D map. The dimensions of the maps are found by using a correspondence analysis and we kept the first two coordinates. The Iris data on these projection classes 2 and 3 are not well separated as it shown on Figure 1. So the method can catch well class 1 and from time to time it occurs a ”bridge” between class 2 and 3 that links them to form one cluster (Figure 1). The system is not very robust to force a so weak topological boundary. And so several iterations can force cluster 2 and cluster 3 to appear. For a grid size of , we obtain 50% of success after a certain number of run executions.

Figure 1: 2-D displays showing: data, clustered grid and data superimposed with clustered grid. Top left: Data plotted with COA , ; Top middle , #unclassified points is 17, #missclassified points is 9; Top right: , #unclassified points is 2, #missclassified points is 7; Bottom: .

The nearest neighbour parameter is used to find the closest cluster for a given data point. Low values such as or give same level of precision evaluation parameter to obtain 3 clusters. But this approach is not sufficient for good level of precision when the size of the Grid is high () because the distance of peripheral data point is too far from their cluster. We compare precision of our approach (”GRID”) with two other variants based on an adjacency matrix construction. The first variant makes the adjacency matrix with a minimum spanning tree (”MST-adj”), the second uses k-nearest neighbours (”KNN-adj”). All the three approaches have an order parameter we call k, that is the number of nearest neighbours for GRID and KNN-adj, and the number of links of a node in the tree for MST-adj. Mainly two clusters are captured by any approaches, and precision is computed by number of points of majority class included in cluster divided by the number of points (150). For GRID, precision when is small (between 1 and 3) is stable and competitive with other approaches (Figure 2).

Figure 2: Clustering precision on iris dataset with parameters , , . K is the number of neighbours for GRID or KNN-adjacency or number of links for MST-adjacency approach.

A second stage of labelling using high-distance nearest neighbour should perform well at this size grid. But as we can see on Figure 3 (bottom) the running time for svcR is less interesting when becomes higher, when is between 5 and 25 time run can increase by 25% . We generally choose G in this range and the performance is not damaged compared to other approaches. If we look at Figure 3 (top) MST-adj is faster than KNN, and difference with svcR with 150 points is 2.05 times faster. Even with we divides times by 1.65, hence we get back at least 40% of time. In summary we can see that for and a data size for any method the run time is almost the same but increases very fast for the NN method when data size increases. Our approach becomes interesting for a much higher amount of data. For the whole Iris Data set our approach is two times faster but the run time depends on the grid size. We can see on Figure 3 that if it becomes less competitive. If time run is ten times than when , and twenty times than when . We used the quadprog package in R-Project for optimization.

Figure 3: Running time of svcR. At top, figure shows normalized speed of our Grid approach with MST-adj and KNN-adj according data size from 1 point to 150 points ( parameters of svcR are Nu = 0.7, q = 1200, G = 13). At down, figure shows normalized speed when grid size parameter of svcR increases from 5 to 40 points (Number of points = 150, Nu = 0.7, q = 1200).

4 Representation of term sets and kernels

In this section, we describe our good representation to classify term from a specific domain with an adapted kernel.

4.1 Data, language models and domain knowledge

In the previous chapter we have shown that a radial base function can make a suitable clustering. But the data were made of a few attributes and not coming from natural language surrounded by sense ambiguities. We tried to make an attempt to classify terms coming from a specific domain: molecular and developmental biology.

Our linguistic data set consists of 1,893 terms (linguistic phrases) manually extracted from an annotation of 1,471 documents (5,730 sentences) where annotated linguistic phrases describe temporal stages of biological development. The corpus itself has been build manually grabbed from Medline document database about spore coat formation and gene transcription specifically for Bacillus Subtilis species. We define some ideas about the language model studied in next chapter. Let suppose the following phrase ”septal localisation of the division”; it will be supposed to be a term. From this term we can consider different sub-structures. ”septal” and ”localisation” are considered to be distinct words, and for instance ”sept” is supposed to be a radical i.e a sequence of character which can be found in other words. ”septal localisation” is considered as a bigram, i.e. a sequence of two words. ”localisation of the” is considered as a trigram, i.e. a sequence of three words.

Textual corpus we used describes biological knowledge and especially a well known biological model called sporulation. This biological process is activated by a microorganism to be resistant in an environment with starvation. The bacterial is transformed into a resistant sphere with mininum needs and activity. In information extraction from texts gene network reconstruction is a quite interesting field to understand how a gene network is activated. Temporal and spatial information are complementary information useful to understand when gene interactions occured. A well studied biological process as sporulution can be a reference model with both interest:

  • Gathering enough molecular information about gene-gene interactions in texts since ten years;

  • Being a well described biological model across different stages.

Six main stages describe the sporulation process. At the beginning of the process a frontier called the septum is created and at the last stage an engulfment is created to leave out the bacterial spore. The 1,893 terms have been also classified manually into the 6 biological stages. An average amount of 600 terms can cover a given stage. The problem is related to morphological and fuzzy description of language. Where a strict formal description should used for instance ”stage II” concerning the second stage of biological development, an expert could use ”during the first stages of sporulation” or ”at the onset of sporulation” or ”at stage I-II” or ”after septum formation” …etc. Moreover complexity of description, we can imagine insofar because 600 terms per class on to only 1,893, is that lots of terms are not exclusive to one stage (i.e. one class). Lots of expressions can designate a stage and often several stages at the same time.

Why do a clustering method such as SVC could be of interest ? We observed that:

  1. Most of terms describing occurrence of a gene activation/inhibation/regulation are not expressed in a simple regular way such as ”at stage 2” or ”at stage 3”. But terminology of temporal knowledge has a variable expressivity;

  2. Lots of terms are not exclusive to a stage.

In such usage context, the svcR technique could help an expert to build rules about expressions to get equivalence between a set of expressions and a mapping of rule with a specific class. We decided to compare which language model can bring benefit for term description and for each language model which kernel can be also more relevant. We had manually selected a list simple morphological radicals (11 tokens), word bigrams (a restricted sample of 500 on to 1,477) and word trigrams (a restricted sample of 500 on 2,179) from the whole set of terms. Figure 4 gives a sample of some linguistic expressions. In our clustering experiments we first made a sample of 98 terms and 4 classes, similar in size with iris data (Section 3.4).

Terms (TM) Radicals (RD) Bigrams (BG) Trigrams (TG)
class 1 class 2 class 3 class 4
insertion into the septum prespore development cortex layer , synthesized between , the forespore inner and , outer membranes coat, encases the spore init cell specific the mother cell
integrity of the septum prespore gene expression cortex peptidoglycan in spores coats sept spore coat mother cell specific
septal compartment prespore programme of gene expression cortex structure coats of wild-type spores prespore during sporulation in the mother
septal localization of the division prespore-like cells cortexless spores compartment endospore and sporulation mother cell compartment
septal peptidoglycan during cell division prespore-specific cortical or vegetative peptidoglycan synthesis compartment-specific engulfment of sporulation growth and sporulation
Figure 4: Samples of terminological data sets.

After viewing which language model (term-radical, term-term, term-bigram, term-trigram) and which kernel are enough efficient, we apply the language model and the kernel to the whole set of 1893 terms.

4.2 Kernels

As terms (that are strings), intrinsically and without textual context, can be statistically compared pairwise (in a Levenshtein way) or using a bag-of-words (in the Jaccard way) we compared these approaches, in addition to robustness due to randomized non null value in the Jaccard case. The Levenshtein distance is an editing distance based on the cost to transform a string into another [27]. Assume and being two strings. Let be the sub-string consisting of the first symbols of string a where and be the sub-string consisting of the first symbols, iteratively we obtain the Levenshtein distance at position and :

(16)

where , and are weights of insertion, deletion and substitution on operations respectivaly and Finally represents the weighted Levenshtein distance. From its expression we define the Levenshtein radial base kernel:

(17)

We also define a kernel using only the component of Levenshtein distance between a pair of terms:

(18)

Equation 17 and Equation 18

are a composition of a semi-positive definite kernel (the radial base function) so the final kernels are also semi-definite positive. The Jaccard index is a similarity index

[19] that is useful to assess the similarity between two objects computed only knowing the set of their attributes, and not the whole set of attributes being often huge and not describing the given objects. Its expression is the following knowing that a string is composed with tokens ,…, and string is composed with tokens ,…, :

(19)

Hence we define a Jaccard-radial base kernel (JRB) according vector defined with Jaccard index with other terms (the data matrix is symmetric):

(20)

We also define a kernel using only the component of Jaccard index between a pair of terms:

(21)

Equation 20 and Equation 21 are a composition of a semi-positive definite kernel (the radial base function) so the final kernels are also semi-definite positive. [42] and [4]

have been respectively adapted a kernel approach with a Levenshtein and a Jaccard similarity coefficient and proved their robustness though their classical simplicity. In our data representation we have used four Kernels: the Levenshtein-radial base (LRB), the radial base-Levenshtein (RBL), the Jaccard-radial base (JRB) and the radial base-Jaccard (RBJ). We also have introduced noise in the data matrix such that if the Jaccard coefficient gives 0 we assign a random non null value to the data matrix component. We call this fuzziness, Jaccard+. The vector approach using such distance and index heuristics in natural language processing sets the representation of description by sets of words but property of such sets can be modulated. For instance co-frequency in textual context (with left and right collocations) (lexical-based similarity), or string inclusion between two terms (dictionary-based similarity), or ontological nodes shared between two terms (conceptual-based similarity). We focused on the second way and we compared several cases of dictionary to build the matrix of similarity. As variables to classify of course we used the sets of terms, and as attributes the set of radicals (RD), the sets of terms itself (TM), the sets of bigrams (BG) and the set of trigrams (TG).

4.3 Results

As we see in the Section 2 for the presentation of the SVC method coupled to our geometric approach for cluster extraction, if no clear geometric separation in data occurred on the 2-D map of correspondence analysis coordinates, the method is unsuccessful. Figure 5 shows different plots of the different cross between data attributes and distances. We see on these maps that TM-TM Levenshtein, TM-RD-Jaccard and TM-RD Jaccard+ can produced interesting maps for SVC application. Thus on each set of the data matrix we applied the cluster extraction to compare the efficiency of class retrieval. Figure 6 shows performance of the method. The Jaccard kernel gives best results with a good separation and extraction of classes. And the variant set introducing random noise in the matrix still becomes successful with 2 misclassified items on 98.

Figure 5: In this table each display is dependant upon features describing linguistic phrases (TM or terms) i.e. with terms themselves (TM-TM), with radicals (TM-RD), with bigrams (TM-BG) or with trigrams (TM-TG), secondly results depends upon kernel used for clustering Levenshtein, Jaccard or Jaccard with artificial noise. Displays represent data classes in green, red, blue, yellow colors and in 2-d maps of the COA components.
Figure 6: Clustering with SVC-geometric hashing (, , chosen at best between 1 and 10000).

Now we adopt the best obtained clustering setting that is a term-radical matrix (language model) and Jaccard Radial base kernel. Now to study scalability efficiency we expand amount of terms and radicals taken into account for Jaccard distance computation. Independently of clusters purity (class homogeneity), impact of features (radicals) is a warranty to make a good separation between similar terms. We do not forget that support vector machine is a non linear method which is efficient only if data are separable. Hence recall that role of features is to make similarity clue between terms, role of Jaccard index is to capture similarity, role of 2D component analysis is to capture main features that make separation between data, and finally role of support vector clustering is to capture bounds of cluster thanks to their geometric separation. Figure 7 shows that too many features do not make separation of data (attribute DName is changing for each four data sets):

But too few features make too few set of clusters. A medium set of features can lead to a good number of clusters. In our case 38 features describing structure of terms induce 15 clusters easily distinguishable visually. Recall the set of terms is made of 1893 terms describing 6 stages of sporulation process as we mentioned in Section 4.1.

Figure 7: Clustering with SVC-geometric hashing (, , , JRB+); Each column means different number of terms and number of features, datasets sizes increase from left to right. Below are the number of clusters extracted with svcR. Yellow color represent clusters, red is data color for major class of a cluster and green is data color but not belonging to major class of a cluster.

Lots of terms belong to several stages (in the sense of classes). Even typical string token relevant of a class can belong to different stages. It is mainly due to biological stage results from microscopy studies, so visual patterns and often a co-occurrence of patterns can be simultaneously typical of a stage but individually we can observe a pattern occurs during several stages as mother cell and compartmentalization (beginning at stage 2 and staying at stages 3, 4, 5, 6), engulfment (beginning at stage 3 and staying at stages 4, 5, 6), septum (beginning at stage 1 and staying at stages 2, 3, 4). This property of cross-membership is hard enough to compute as a mapping between a specific term to a unique class. In our results (Figure 7) getting more clusters (15) than classes (6) induces that terms can be misclassified (green points) but make a variety of specific clusters from which we expect they capture patterns association that should be used to define rule of an automaton. For instance among the 15 clusters, a specific one gives the following members :

compartmentalized activation , compartment-specific activation

We can imagine a rule associated to stage 2, 3, 4 and 5 : ”compartment⁢ activation”. Another one gives the following members :

slow postseptational, prespore-specific SpoIIIE synthesis, endospore coat,

endospore coat assembly, endospore coat component, forespore coat, from the endospore coat,

cortex and/or coat assembly, spore coat and cortex.

We can imagine a rule associated to stages 3, 4, 5 and 6 : ”endospore coat”, ”coat ⁢ cortex”, ”cortex ⁢ coat”. From these clusters of terms coat, cort, prespore, endospore, postseptational, forespore are in the sets of features. In this process of lexical rule definition the user plays an important role in such way a cluster do not give information directly exportable as an automatically defined rule. Visualization of clusters by an expert leads to identification of patterns association to include in lexical rules. Especially by the fact that elements taken into account are features and knowledge about features is required to say that these rules will be applicable to a set of classes (biological stages). The methodology makes us to understand, but it is not a discovery, that clustering mixes several components of different categories. Nevertheless it can be efficient to identify relevant features to identify as a lexical pattern to build rules for information extraction, in our case information extraction of a biomolecular process described linguistically and formally by several stages (i.e. a scenario in the domain of biology).

5 Comparison with other techniques

In this section, we discuss behavior of concurrent clustering methods, existing kernels and interpretation of SVC clustering capacity. Below a simple general R utility function, gets outputs of used R clustering functions (k-means, svcR, hierarchical) and exploits a data property that is insertion of the class number in each term (as ”4 coat protein” meaning ”coat protein” belongs to class 4). Hence using grep function it is very easy to find the repartition of classes over clusters :

TabEval <- function(Dat) {
M <- matrix(nrow = (max(Dat[]) + 1), ncol = 8, 0)
for( k in 1:max(Dat[]) ){
ΨStat <- c()
ΨSize <- length(Dat[ Dat == k])
Ψfor(i in 1:6) {
ΨΨGR   <- grep(i, names(Dat[ Dat == k]) )
ΨΨStat <- c(Stat, 100 * length(GR) / Size )
Ψ}
ΨStat= c(Stat, 0, Size )
ΨM[k,] <- Stat
}
Stat <- c()
for(i in 1:6) {
ΨSize <- length(Dat[])
ΨGR <- grep(i, names(Dat[]) )
ΨStat <- c(Stat, 100 * length(GR) / Size )
}
Stat <- c(Stat, 0, Size )
M[nrow(M), ] <- Stat
print(M, digits = 1)

5.1 Classical clustering

Algorithms such as k-means (KM) and hierarchical clustering (HC) are widespread poor knowledge techniques using metrics to find automatically clusters in any kind of data. Figure 

8 shows graphically how such clusters could be represented.

Figure 8: Clustering with SVC-geometric hashing (left), hierarchical agglomerative clustering (centre), k-means (right); Data are 1893 terms with 6 classes and 37 features using a Jaccard-radial base kernel.

About svcR and KM, 2-dimensional coordinates come from component analysis. On the KM map only centroids represent clusters (as star plotting characters). HC (Figure 8, center) displays a dendrogram where branches mean clustered points and require a cut-off at a level of the tree to catch clusters. In R, we used kmeans function from stats package [40], and hclusterpar function from amap package [29].

As Data contains 6 classes and svcR approaches with JRB kernel induces extraction of 17 clusters we settle 30 clusters extraction as settings for both KM and HC function. Figure 9 shows the content of clusters and class distribution for each approach (hierarchical, k-means and SVC). The right column of each result set means the size of each cluster. The last line means distribution over classes from the whole dataset as baseline (it means that 12% of terms belong to class 1, 19% to class 2, 20% to class 3, 20% to class 4, 16% to class 4, 12% to class 6 and size of set is 1893 terms). First we can observe that distribution profile in cluster size is similar between HC and svcR. Secondly, looking at over-representation of classes over clusters HC and KM do not achieve better discrimination of terms across the 6 classes some clusters are better over. Language ambiguities seem to be a real bottleneck for all methods when usage is based on a Jaccard-Radial Kernel. But what happens when string kernels are used ?

5.2 String kernels

[28] and [30] promoted kernel strings to get semantic knowledge from texts. The string kernels calculate similarities between two strings by matching the common substring in the strings. A standard string kernel is the constant one (SK-constant) and assess similarities even is characters are matching in any order, and higher is the return value when order is respected and size of matching is bigger. Exact matching of characters is called spectrum kernel (SK-spectrum) [41]. For instance let suppose a string of 29 characters and estimate value of a string with itself, SK-constant return 3165, SK-spectrum return 430.

HC KM svcR
C1 C2 C3 # C1 C2 C3 # C1 C2 C3 #
0.17 0.17 0.16 481 0.23 0.27 0.16 153 0.17 0.17 0.16 481
0.22 0.24 0.16 152 0.04 0.18 0.24 49 0.11 0.17 0.23 283
0.14 0.16 0.25 199 0.22 0.16 0.20 64 0.03 0.25 0.22 156
0.27 0.27 0 15 0.03 0.22 0.22 103 0.04 0.20 0.21 113
0.17 0.17 0.17 78 0.09 0.27 0.18 11 0.15 0.31 0.26 81
0.30 0.49 0.07 61 0 0.10 0.31 29 0.06 0.21 0.19 63
0.13 0.13 0.20 15 0.17 0.11 0.19 54 0.17 0.17 0.17 78
0.12 0.08 0.27 26 0.04 0.28 0.20 50 0 0.28 0.39 18
0.04 0.24 0.21 219 52 0 0 44 0 0 0.25 4
0.25 0 0.25 4 0.17 0.24 17 41 0.08 0.75 0.17 12
0.12 0.19 0.20 1893 0.12 0.19 0.20 1893 0.12 0.19 0.20 1893
Figure 9: Class distribution over clusters resulting from HC (left 4 columns), KM (centre 4 columns) and svcR (right 4 columns). Only three first classes are displayed (over 6) i.e. C1 to C3. These classes are hand-made built and each term is tagged in the data matrix with one of these classes. After clustering, we collect class membership of terms for a given cluster. In the table, each line is the distribution of a given cluster (different over each method HC, KM or svcR). Each line shows the weigth in percent for each class in the cluster. The last line represent a baseline, showing how should be the weigth of a class if all terms should form a unic cluster. Hence for instance class 1 represents 12% of the set of terms. On each line we dispaly also (column #) the number of terms belonging to the cluster.
Figure 10: SVC using string kernel-constant (left), or using string kernel-spectrum (right).

If we pick two termes from our biology term data set : SK-constant (”inner coat”, ”in the mother cell”) = 22, and SK-spectrum (”inner coat”, ”in the mother cell”) = 15 ; another pair give SK-constant (”inner coat”, ”initiation of sporulation”) = 27, and SK-spectrum (”inner coat”, ”initiation of sporulation”) = 24. Variation between both pairs are not far according string kernels, though terms of the first pair are from one class (class 2) and the other pair compares terms from different classes (class 1 and class 2). We built a kernel matrix using both string kernels and achieved cluster labelling with this similarity information. Result is shown in Figure 10 :

Even if SK-constant shows some capability to isolate clusters, a big cluster contains 1600 items, hence 85% of information. Such kernel is though challenging, perhaps including more lexical knowledge.

5.3 Clusterability of a SVC model

Section 2.1 presents a general framework of a kernel method. It does not mean any assumption about clustering but moreover about classification. Nevertheless SVC is not a new technique in itself. SVC has been seen as one-cluster discovery since a ball in the dual space is targeted. Hence it was described in detail for a long time as a one class approach applied to novelty detection when information is deviating from a block of well known information. In R-project, kernlab package [22, 21] implements novelty detection task. When running one-class kernel to our dataset it returns a model of 394 support vectors, with and cross-validation 0.205. Our observation is that SVC performs well cluster extraction (or labeling) from a 2-dimensional map, depending on existence condition of clusters. It means that data ought to be separable in the 2-d map. Separability can be managed by composition of a metric with a radial-based function over the whole input matrix dimensions. A possible explanation for capability of SVC to identify clusters is related to the same problem as trying to flatten the skin of an orange onto a tabletop. In this case, projection is a procedure to transform locations and features from the three-dimensional surface onto the two-dimensional paper in a defined and consistent way. The result is some slight bulges and a lot of gaps. The transformation of map information from a sphere to a flat sheet can be accomplished in many ways but no single projection can accurately portray area, shape, scale, and direction. SVC clustering takes origin from capacity within projections to distort.

6 Conclusion

We developed, improved and applied a density and kernel based method called support vector clustering (SVC) we implemented as an R-project package (svcR). The package is available from the CRAN R project server (http://cran.r-project.org/ see Software, Packages; svcR version 1.4), and downloadable from the R graphical user interface (required R libraries : quadprog, ade4 and spdep). First we proved that mapping points in the data space to a grid and using the sphere radius from the attribute space and a k-nearest-neighbor approach improves time consumption for cluster labeling. In this sense, SVC can be seen as an efficient cluster extraction if clusters are separable in a 2-D map. Secondly we found a representation for term clustering using a mixed Jaccard-Radial base kernel and we proved its efficiency with SVC for term clustering in a natural language processing task as lexical classification (i.e. oriented ontology knowledge acquisition). Some investigations remain under R implementation to integrate C functions for matrix acquisition so as to make the toolkit more scalable in data size. Semantic and lexical-based kernels are promising for application in text mining frameworks. Yet it must understand how to select and integrate attributes for the description of terms. We aim at investigating in future work extraction of clusters over more than 2 dimensions, and test of robustness for non-separable data.

Acknowledgments

Special thanks to Roy Varshavsky, Marine Campedel, Dunyon Lee and Olivier Chapelle for their discussion. The methodology discussed in this paper has been supported by the INRA-1077-SE grant from the French Institute for Agricultural Research (agriculture, food & nutrition, environment and basic biology).

References

  • [1] Asa Ben-Hur, David Horn, Hava Siegelmann, and Vladimir Vapnik. Support vector clustering.

    Journal of Machine Learning Research

    , 2:125–137, 2001.
  • [2] A. Berwin, R. Turlach, and A. Weingessel. quadprog: Functions to solve Quadratic Programming Problems., 2007. R package version 1.5-3.
  • [3] Biter Bilen. Support vector clustering of microarray gene expression data. Technical Report, Bilkent University, Turkey, 2005.
  • [4] Mikhail Bilenko and Raymond J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases.

    Technical Report AI 02-296, Artificial Intelligence Lab, University of Texas at Austin, 2002.

  • [5] Roger Bivand. spdep: Spatial Dependence, Weighting Schemes, Statistics and Models, 2010. R-package version 0.5-14.
  • [6] Marine Campedel and Eric Moulines. M thodologie de s lection de caract ristiques pour la classification d’images satellitaires. In Proceedings of the Conf rence Nationale sur l’Apprentissage (CAP’05), Nice, France, pages 107–108, 2005.
  • [7] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967.
  • [8] B atrice Daille. Conceptual structuring through term variations. In D. MacCarthy F. Bond, A. Korhonen and A. Villacicencio, editors, Proceedings of the 38th International Conference of Association for Computational Linguistics (ACL’03), Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, 2003.
  • [9] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry (SCG’04), New York, USA, pages 253–262. ACM New York, NY, USA, 2004.
  • [10] S. Dray and A.B. Dufour. The ade4 package: Implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4):1–20, 2007.
  • [11] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Salvatore Stolfo.

    A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data.

    In D. Barbara and S. Jajodia, editors, Applications of Data Mining in Computer Security, pages 77–102. Kluwer, 2002.
  • [12] Damien Eveillard and Yann Guermeur. Statistical processing of selex results. Proceedings of the 10th International Conference on Intelligent Systems for Molecular Biology (ISMB’02), Poster Session, Edmonton, Canada, 2002.
  • [13] Christiane Fellbaum. WordNet : An Electronic Lexical Database. MIT Press, 1998.
  • [14] William A. Gale, Kenneth W. Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415–439, 1992.
  • [15] Gregory Grefenstette. Sextant: Extracting semantics from raw text: Implementation details, heuristics. Integrated Computer-Aided Engineering, 1:527–536, 1994.
  • [16] Zellig Harris. Mathematical Structure of Language. John Wiley & Sons, 1968.
  • [17] Marti Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, France, pages 539–545, 1992.
  • [18] David Horn. Clustering via hilbert space. Physica A, 302(1):70–79, 2001.
  • [19] Paul Jaccard. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Soci t Vaudoise des Sciences Naturelles, 37:241–272, 1901.
  • [20] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Englewood Cliffs, New Jersey, Prentice Hall, 1988.
  • [21] Alexandros Karatzoglou, David Meyer, and Kurt Hornik. Support vector machines in r. Journal of Statistical Software, 15(9):1–28, 2006.
  • [22] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab – an s4 package for kernel methods in r. Journal of Statistical Software, 11(9):1–20, 2004.
  • [23] Jong Kees, Elena Marchiori, and Aad van der Vaart. Finding clusters using support vector classifiers. In

    Proceedings of the 18th ESANN-European Symposium on Artificial Neural Networks (ESANN’03), Bruges, Belgium

    , pages 23–25, 2003.
  • [24] Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1):15–68, 2000.
  • [25] Aleksandar Lazarevic, Aysel Ozgur, Levent Ertoz, Jaideep Srivastava, and Vipin Kumar. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the 3rd SIAM International Conference on Data Mining (ICDM’03), San Francisco, USA, pages 25–36, 2003.
  • [26] Jaewook Lee and Daewon Lee. An improved cluster labeling method for support vector clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):461–464, 2005.
  • [27] Vladimir I. Levenshtein. Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady, 10(8):707–710, 1966.
  • [28] Huma Lodhi, Craig Saunders, John S Taylor, Nello Cristianini, and Chris Watkins. Support vector clustering. Journal of Machine Learning Research, 2:419–444, 2002.
  • [29] Antoine Lucas. amap: Another Multidimensional Analysis Package, 2007. R package version 0.8-2.
  • [30] Alessandro Moschitti. Syntactic and semantic kernels for short text pair categorization. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL’09) , Athens, Greece, pages 576–584. Association for Computational Linguistics Morristown, NJ, USA, 2009.
  • [31] Adeline Nazarenko, Pierre Zweigenbaum, Jacques Bouaud, and Beno t Habert. Corpus-based identification and refinement of semantic classes. Journal of the American Medical Informatics Association, 4(suppl):585–589, 1997.
  • [32] JinHyeong Park, Xiang Ji, Hongyuan Zha, and Rangachar Kasturi. Support vector clustering combined with spectral graph partitioning. In

    Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) , Cambridge, UK

    , pages 581–584, 2004.
  • [33] Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of english words. In Proceedings of the 31th Conference of the Association for Computational Linguistics (ACL’93), Jerusalem, Israel, pages 183–190, 1993.
  • [34] Wilfredo J. Puma-Villanueva, George B. Bezerra, Clodoaldo A.M. Lima, and Fernando J. Von Zuben. Improving support vector clustering with ensembles. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’05), Montreal, Canada, pages 13–15, 2005.
  • [35] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. In Proceedings of the 11th National Conference on Artificial Intelligence (NCAI’93) , Washington, USA, pages 811–816. AAAI Press/The MIT Press, 1993.
  • [36] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw Hill, New York, 1983.
  • [37] Bernhard Sch lkopf, John Platt, John Shawe-Taylor, Alex Smola, and Robert Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 33:1443–1471, 2001.
  • [38] Frank A. Smadja and Kathleen R. McKeown. Automatically extracting and representing collocations for language generation. In Proceedings of the 28th International Conference of Association for Computational Linguistics (ACL’90), Pittsburgh, USA, pages 574–579. Association for Computational Linguistics Morristown, NJ, USA, 1990.
  • [39] Steven Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, 1999.
  • [40] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. ISBN 3-900051-07-0.
  • [41] C. H. Teo and S. V. N. Vishwanathan. Fast and space efficient string kernels using suffix arrays. In Proceedings of the International Conference on Machine Learning (ICML’06) Pittsburgh, USA, pages 929–936. ACM New York, NY, USA, 2006.
  • [42] Jianhua Xu and Xuegong Zhang. Kernels based on weighted levenshtein distance. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’04), Budapest, Hungary, pages 3015–3018, 2004.
  • [43] Jianhua Yang, Vladimir Estivill-Castro, and Stephan Chalup. Support vector clustering through proximity graph modelling. In Proceedings of the 9th International Conference on Neural Information Processing,(ICONIP’02) , Singapore, pages 898–903, 2002.
  • [44] Ying Zhang, Hongye Su, Tao Jia, and Jian Chu. Rule extraction from trained support vector machines. In Proceedings of the 9th International Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’05), Hanoi, Vietnam, pages 61–70, 2005.