1 Introduction
Mining text archives is a great challenge since lots of documents are available and their amount grows in the same way as the capacity of computer storage. Making rules of a domain for knowledge extraction involves efficient features with low semantic ambiguity. It is not an easy task and we try to answer this question by representing vectors of linguistic expressions (i.e. terms) by features and using a scalable densitybased distance to cluster the terms.
The first idea for our problem concerns the choice of a densitybased method and the improvement of its scalability. Clustering can be a useful knowledgepoor technique to induce organization into scattered data [20]
. Nonparametric methods such as support vector machines can be interesting to analyze noisy data by density processing.
[1, 37, 18]proposed an unsupervised support vector algorithm to enclose data clusters by contours and based it on a radial kernel. Diverse applications have been tested for novelty detection
[11, 25], rule extraction [44], d soxyribonucleic acid (DNA) and chemical compounds [3, 12] or image processing [6]. The method of point assignation to contours and related clusters is based on adjacent points between each pair and is timeconsuming. Some studies [32, 34] have been proposed to speedup the method. In particular [43] and [26] proposed an improved method to label clusters, i.e. to assign point to clusters by graph analysis.We present a new robust method based on the computation of a hash function through surrounding points working with a grid which we map to data using a knearest neighbor method. We developed this clustering method under the R platform [40], as a package called svcR, and we compared our approach to other ones, especially graphbased, on the Iris dataset (svcR is available from the Comprehensive R Archive Network at http://CRAN.Rproject.org/package=svcR”). [23]
have also developed a support vector method for clustering but using a divisive way iteratively searching a classical hyperplan separator based on classical support vector machine. The first step tries to separate the data set and a set artificially build in the same space of attribute values and the same size than the data set which is theoretically not justified; it seems that if not many classes are present (2 or 3) and not many attributes describe data, the algorithm seems to find groups, in other cases it tends to find a number of final clusters equal to the number of iterations.
The second idea presented in this paper concerns our original usage of support vector clustering (SVC) clustering methodology cited above for solving a certain form of ambiguity in natural language. Information retrieval [36, 8] and information extraction [24, 39] are key methodologies to retrieve information from text archives. But simple keywords may have several senses and assignment of term to conceptual classes should be important [14, 17, 35, 15, 13]. Clustering may be used to reduce the number of variables to take into account in rules for information mining in documents. We base our assumption on two works. Firstly that collocation analysis is useful to understand morphological structure and its link to a conceptual space [16, 38]. Secondly that clustering can bring a good approach to build semantic classes with the help of a distance of similarity [33, 31]
. This methodology about clustering linguistic terms can help to get common features to build rules for information mining in document archives. Classification of a set of terms requires to represent data as morphological information vectors (terms themselves or parts of terms and how?) and to determine which kernel has to be used to achieve SVC. We try to use whole terms, morphological primitives and bigrams as morphological information. And we try to use the Levenshtein distance and the Jaccard similarity index, Radial basis function, and combination LevenshteinRadial or JaccardRadial kernels to study the clustering effect.
In Section 2, we introduce the methodology of support vector clustering. Section 3 presents the labeling approach and Section 4 gives studies of vector representation and different kernels for term clustering. Finally Section 5 shows evaluation of the technique.
2 Support vector clustering
In this section, we recall the clustering approach.
2.1 Kernel trick
We know a priori classes of items (red circles and yellow squares) and we search a linear frontier in a higher dimensional space. For that, data are transformed using a kernel function (dot product). Preprocess the data with:
(1) 
is a dot product of the space (Hilbert space, ), and learn the mapping of to (class).
(2) 
(3) 
Usually
2.2 Optimization
As an extreme view the distribution of data under the scope of unsupervised learning can be interpreted as density estimation. But in our case the approach estimates
quantiles of a distribution, not its density. In the case of SVC, we determine support vectors to delimit the distribution of points. The goal now points out to find the minimal sphere which surrounds data. One can show: if ,…, is separable from the origin in , then the solution of margin minimization between two classes corresponds to the normal vector of the hyperplan separating the data from the origin with maximum margin.In our case we try to encapsulate data into a ball. The points inside the ball represent data to classify (first) and the origin represents the second class. Primal problem is written as follows. Let the (nonfixed) center of the ball, the radius of the ball and is a fixed penalty constant controlling the number of data near the ball. Let us minimize:
(4) 
Under the constraints:
(5) 
is the center of the ball. The dual problem (for a convex problem) is the Lagrangian . Where and are Lagrange multipliers, see [1, 37, 18] for details about computation of . Multipliers are used to compute a (i.e. ) and .
If is a support vector, the radius is:
(6) 
For any point the distance from the center is:
(7) 
Hence it is possible to test if is inside the sphere or not by comparing and .
2.3 Contour deformation
The value of parameter asymptotically represents a max bound of the BSV rate. Parameter takes values in , as
is reduced, more and more points are labeled as outliers. If
increases the Gaussian radius decreases and the number of SV increases. Subsequently if one or more points of a cluster become support vector a specific contour will be generated for the cluster. From a certain value of , support vectors appear around each cluster.3 Geometric hashing based labeling
In this section, we describe our mapping methodology to assign data points to clusters.
3.1 2d grid assumption
In the previous method only support vectors guide processing to make contours but escape to know if a given point lies inside or outside the contour. Some methods such as describe in the foundation work by [1], and [43] work with an adjacency matrix defined as follows. Given two points of the data and and (the radius of the ball), the adjacency matrix such that:
(8) 
Hence [1] define a set of points between each pair and calculate if all the points belongs to the sphere or not, and so assign the pair of point to a cluster; In the second method [43] use a graph method to analyze the density areas of the graph defined by the adjacency matrix. We have compared our approach to these ones we call respectively in the following nearestneighbors (NN) and minimum spanning tree (MST). These methods are timeconsuming and we imagined a method based on a geometric hashing function achieved with a grid surrounding data points in the attribute space. Basically according to the SVC method we only compute the radius for the points of the grid (that are hash keys) to build clusters, and as [9] we assume that almost closest points can be associated to a same hash. We use a nearestneighbor method [7] to associate data points to their hash.
3.2 Algorithm
The basic idea behind random projections is a class of hash functions that are locally sensitive; that is, if two points are close, they will have small
values and they will hash to the same value with a high probability. If they are distant they collide with small probability. We have the following definitions. Let
be the size of the grid, and fixed by the user. A 2dimension grid is characterised with a step . The step is defined according the minimal/maximal value of two first coordinates obtained by correspondence analysis (COA), is the first coordinate, and the second coordinate. We use the ade4 package of Rproject to compute COA [10]. Let be a grid point.Definition 3.1
()
Let be the scale of the grid from correspondance coordinates ck
(9) 
We can define the set of grid points with each point by:
Definition 3.2
()
Let be the set defined by:
(10) 
For each point we can assess membership to clusters without specifying which one.
Definition 3.3
Let be the set of clusters, knowing radius R according Equation 7
(11) 
We now try to define clusters set with grid points:
Definition 3.4
(C)
We call C the set of clusters. A cluster consists of a grid point and all neighbouring grid points:
(12) 
Now we define the ball as the neighborhood of the hash key from which it is assigned a specific cluster reference using a knearest neighbor threshold:
Definition 3.5
Let the grid, C the set of clusters and a point with coordinates (X, Y) in the grid space GxG. Then the ball of P is defined by at least k neighbours belonging to a same cluster:
(13) 
A family is called localitysensitive if, for any point , the function is defined as follows:
Definition 3.6
(14) 
decreases in . That is the probability of the collision of points and decreases with the distance between them.
After defining a grid on data space, ClusterLabeling function achieves the first stage assigning a cluster number to each point of the grid.
The calculation of Lagrange coefficient gives the kernel matrix (MK). User settles the size of grid G, and MinMax value in data space can be computed. The main function (findSvcModel, described in next chapter) outputs a matrix called NumPoints linking each grid point to a cluster id. The radius Rc can be computed according algorithm shown in Table 1.
0: kernelmatrix MK, grid size G, MinMaxX min max value of x value in data, MinMaxY min max value of y value in data. 0: NumPoints, a GxG vector for each grid point and membership to a cluster id. while each GxG Grid point P do {we identify if a point belongs to a possible cluster} Associate x, y values to P from MinMaxX and MinMaxY Calculate Radius Rp of P , if Rp ¡= Rc , give ball membership to P end while while each GxG Grid point P(i) do {we identify cluster id(s)} while each P(k) around P(i) of one step do if all points P(k) have no cluster membership then Create a new cluster vector CV with a new cluster id Cm Put CV in a list of cluster vector membership LCV Put P(i) to CV and associate Cm in NumPoints else associate cluster id of P(k) to P(i) in NumPoints end if end while end while while each CV(i) in LCV do {we merge closed clusters} while each other CV(j) in LCV != CV(i) do if CV(i) has distance of one step from CV(j) then Merge CV(i) and CV(j) Update NumPoints end if end while end while 
Finally we can assign a cluster label for any point of the data set according the hash function and the corresponding ball value, defined in Equation 3.5.
(15) 
MatchGridPoint function, presented below, achieves the second stage; computation of in Equation 15. It returns a vector we call ClassPoints associating a cluster id to each data point in the initial dataset (see Table 2).
0: data matrix MD, grid size G, MinMaxX, MinMaxY, NumPoints, neighbourhood of a data point k. 0: ClassPoints, a vector for data point and membership to a cluster id. 1: for each point D(i) in MD do 2: Calculate Grid coordinate of any D(i) , with MinMaxX, MinMaxY 3: end for 4: for each point D(i) in MD do 5: Init a score vector SV(i) with dimension of cluster id(s) 6: for each Grid Point P(j) in NumPoints do 7: if P(i) cluster id = k is found and distance between P(j) and D(i) = k then 8: Increment SV(i)(k) 9: Associate Max(SV) to Classpoint(i) 10: end if 11: end for 12: end for 
3.3 Usage of the svcR package
Main function is the findSvcModel function. It computes a clustering model and returns it as an R object which is usable to other function for display and export. Let call ret the return object, it covers some information about model parameters as the language coefficients (getlagrangeCoeff(ret)$A attribute), the kernel matrix (getMatrixK(ret) attribute) and the cluster memberships (getClassPoints(ret) attribute). findSvcModel takes 10 arguments :

data.frame means data.frame parameter in standard use
or means data.frame in loadMat use
or means DatMat in Eval use, a matrix given as unic argument 
MetOpt, optimization control parameter : optimStoch (stochastic way of optimization) or optimQuad (quadratic way of optimization)

MetLab, labelling method: gridLabeling (grid labelling) or mstLabeling (mst labelling) or knnLabeling (knn labelling)

KernChoice, kernel choice: KernLinear (Euclidian) or KernGaussian (RBF) or KernGaussianDist (Exponential) or KernDist (Matrix data as Kernel value)

Nu, parameter

q, parameter

k, nearest neigbours for grid

G, grid size

Cx, component to display (1 for 1st attribute)

Cy, component to display (2 for 2nd attribute)
If Cx and Cy are 0 the correspondent analysis is used. The data is given as first argument. The format is data.frame() (i.e. list) as the iris well known dataset. Some R libraries are required as quadprog [2] for optimization, ade4 [10] and spdep [5]
for principal component analysis. This an exemple of usage in R :
R> library("svcR") R> data("iris")
R> retA < findSvcModel(iris, MetOpt = "optimStoch", MetLab = "gridLabeling", + KernChoice = "KernGaussian", Nu = 0.5, q = 40, K = 1, G = 5, + Cx = 0, Cy = 0)
R> plot(retA) R> ExportClusters(retA, "iris") R> findSvcModel.summary(retA)
It means as data is the iris data frame. The Kernel choice is radialbased, parameters of SVC technique are and . Parameters for cluster labeling are neighbor and grid size of points. means that first two principal components are used. MetLab value means that geometrichashing method is used. Plot function permits to visualize clusters. ExportClusters outputs clusters in a file with variables names. findSvcModel.summary displays size and number of clusters, and averaged attributes for each cluster. Some functions can help the user to navigate in clusters. ShowClusters(retA) returns all clusters ordered by their id (cluster 0 is a bag of variables not clusterable), GetClustersTerm(retA, term = ”121”) returns clusters in which ”121” is a substring names of a member include in them, and GetClusterID(retA, Id = 1) returns the cluster with .
3.4 Toy example
We used the famous Fisher’s Iris data set. It contains 3 classes, 150 variables and 4 attributes. Our clustering extraction is largely based on the topology of points localized on a 2D map. The dimensions of the maps are found by using a correspondence analysis and we kept the first two coordinates. The Iris data on these projection classes 2 and 3 are not well separated as it shown on Figure 1. So the method can catch well class 1 and from time to time it occurs a ”bridge” between class 2 and 3 that links them to form one cluster (Figure 1).
The system is not very robust to force a so weak topological boundary. And so several iterations can force cluster 2 and cluster 3 to appear. For a grid size of , we obtain 50% of success after a certain number of run executions.
The nearest neighbour parameter is used to find the closest cluster for a given data point. Low values such as or give same level of precision evaluation parameter to obtain 3 clusters. But this approach is not sufficient for good level of precision when the size of the Grid is high () because the distance of peripheral data point is too far from their cluster. We compare precision of our approach (”GRID”) with two other variants based on an adjacency matrix construction. The first variant makes the adjacency matrix with a minimum spanning tree (”MSTadj”), the second uses knearest neighbours (”KNNadj”). All the three approaches have an order parameter we call k, that is the number of nearest neighbours for GRID and KNNadj, and the number of links of a node in the tree for MSTadj. Mainly two clusters are captured by any approaches, and precision is computed by number of points of majority class included in cluster divided by the number of points (150). For GRID, precision when is small (between 1 and 3) is stable and competitive with other approaches (Figure 2).
A second stage of labelling using highdistance nearest neighbour should perform well at this size grid. But as we can see on Figure 3 (bottom) the running time for svcR is less interesting when becomes higher, when is between 5 and 25 time run can increase by 25% . We generally choose G in this range and the performance is not damaged compared to other approaches. If we look at Figure 3 (top) MSTadj is faster than KNN, and difference with svcR with 150 points is 2.05 times faster. Even with we divides times by 1.65, hence we get back at least 40% of time. In summary we can see that for and a data size for any method the run time is almost the same but increases very fast for the NN method when data size increases. Our approach becomes interesting for a much higher amount of data. For the whole Iris Data set our approach is two times faster but the run time depends on the grid size. We can see on Figure 3 that if it becomes less competitive. If time run is ten times than when , and twenty times than when . We used the quadprog package in RProject for optimization.
4 Representation of term sets and kernels
In this section, we describe our good representation to classify term from a specific domain with an adapted kernel.
4.1 Data, language models and domain knowledge
In the previous chapter we have shown that a radial base function can make a suitable clustering. But the data were made of a few attributes and not coming from natural language surrounded by sense ambiguities. We tried to make an attempt to classify terms coming from a specific domain: molecular and developmental biology.
Our linguistic data set consists of 1,893 terms (linguistic phrases) manually extracted from an annotation of 1,471 documents (5,730 sentences) where annotated linguistic phrases describe temporal stages of biological development. The corpus itself has been build manually grabbed from Medline document database about spore coat formation and gene transcription specifically for Bacillus Subtilis species. We define some ideas about the language model studied in next chapter. Let suppose the following phrase ”septal localisation of the division”; it will be supposed to be a term. From this term we can consider different substructures. ”septal” and ”localisation” are considered to be distinct words, and for instance ”sept” is supposed to be a radical i.e a sequence of character which can be found in other words. ”septal localisation” is considered as a bigram, i.e. a sequence of two words. ”localisation of the” is considered as a trigram, i.e. a sequence of three words.
Textual corpus we used describes biological knowledge and especially a well known biological model called sporulation. This biological process is activated by a microorganism to be resistant in an environment with starvation. The bacterial is transformed into a resistant sphere with mininum needs and activity. In information extraction from texts gene network reconstruction is a quite interesting field to understand how a gene network is activated. Temporal and spatial information are complementary information useful to understand when gene interactions occured. A well studied biological process as sporulution can be a reference model with both interest:

Gathering enough molecular information about genegene interactions in texts since ten years;

Being a well described biological model across different stages.
Six main stages describe the sporulation process. At the beginning of the process a frontier called the septum is created and at the last stage an engulfment is created to leave out the bacterial spore. The 1,893 terms have been also classified manually into the 6 biological stages. An average amount of 600 terms can cover a given stage. The problem is related to morphological and fuzzy description of language. Where a strict formal description should used for instance ”stage II” concerning the second stage of biological development, an expert could use ”during the first stages of sporulation” or ”at the onset of sporulation” or ”at stage III” or ”after septum formation” …etc. Moreover complexity of description, we can imagine insofar because 600 terms per class on to only 1,893, is that lots of terms are not exclusive to one stage (i.e. one class). Lots of expressions can designate a stage and often several stages at the same time.
Why do a clustering method such as SVC could be of interest ? We observed that:

Most of terms describing occurrence of a gene activation/inhibation/regulation are not expressed in a simple regular way such as ”at stage 2” or ”at stage 3”. But terminology of temporal knowledge has a variable expressivity;

Lots of terms are not exclusive to a stage.
In such usage context, the svcR technique could help an expert to build rules about expressions to get equivalence between a set of expressions and a mapping of rule with a specific class. We decided to compare which language model can bring benefit for term description and for each language model which kernel can be also more relevant. We had manually selected a list simple morphological radicals (11 tokens), word bigrams (a restricted sample of 500 on to 1,477) and word trigrams (a restricted sample of 500 on 2,179) from the whole set of terms. Figure 4 gives a sample of some linguistic expressions. In our clustering experiments we first made a sample of 98 terms and 4 classes, similar in size with iris data (Section 3.4).
Terms (TM)  Radicals (RD)  Bigrams (BG)  Trigrams (TG)  

class 1  class 2  class 3  class 4  
insertion into the septum  prespore development  cortex layer , synthesized between , the forespore inner and , outer membranes  coat, encases the spore  init  cell specific  the mother cell 
integrity of the septum  prespore gene expression  cortex peptidoglycan in spores  coats  sept  spore coat  mother cell specific 
septal compartment  prespore programme of gene expression  cortex structure  coats of wildtype spores  prespore  during sporulation  in the mother 
septal localization of the division  presporelike cells  cortexless spores  compartment  endospore  and sporulation  mother cell compartment 
septal peptidoglycan during cell division  presporespecific  cortical or vegetative peptidoglycan synthesis  compartmentspecific  engulfment  of sporulation  growth and sporulation 
After viewing which language model (termradical, termterm, termbigram, termtrigram) and which kernel are enough efficient, we apply the language model and the kernel to the whole set of 1893 terms.
4.2 Kernels
As terms (that are strings), intrinsically and without textual context, can be statistically compared pairwise (in a Levenshtein way) or using a bagofwords (in the Jaccard way) we compared these approaches, in addition to robustness due to randomized non null value in the Jaccard case. The Levenshtein distance is an editing distance based on the cost to transform a string into another [27]. Assume and being two strings. Let be the substring consisting of the first symbols of string a where and be the substring consisting of the first symbols, iteratively we obtain the Levenshtein distance at position and :
(16) 
where , and are weights of insertion, deletion and substitution on operations respectivaly and Finally represents the weighted Levenshtein distance. From its expression we define the Levenshtein radial base kernel:
(17) 
We also define a kernel using only the component of Levenshtein distance between a pair of terms:
(18) 
are a composition of a semipositive definite kernel (the radial base function) so the final kernels are also semidefinite positive. The Jaccard index is a similarity index
[19] that is useful to assess the similarity between two objects computed only knowing the set of their attributes, and not the whole set of attributes being often huge and not describing the given objects. Its expression is the following knowing that a string is composed with tokens ,…, and string is composed with tokens ,…, :(19) 
Hence we define a Jaccardradial base kernel (JRB) according vector defined with Jaccard index with other terms (the data matrix is symmetric):
(20) 
We also define a kernel using only the component of Jaccard index between a pair of terms:
(21) 
Equation 20 and Equation 21 are a composition of a semipositive definite kernel (the radial base function) so the final kernels are also semidefinite positive. [42] and [4]
have been respectively adapted a kernel approach with a Levenshtein and a Jaccard similarity coefficient and proved their robustness though their classical simplicity. In our data representation we have used four Kernels: the Levenshteinradial base (LRB), the radial baseLevenshtein (RBL), the Jaccardradial base (JRB) and the radial baseJaccard (RBJ). We also have introduced noise in the data matrix such that if the Jaccard coefficient gives 0 we assign a random non null value to the data matrix component. We call this fuzziness, Jaccard+. The vector approach using such distance and index heuristics in natural language processing sets the representation of description by sets of words but property of such sets can be modulated. For instance cofrequency in textual context (with left and right collocations) (lexicalbased similarity), or string inclusion between two terms (dictionarybased similarity), or ontological nodes shared between two terms (conceptualbased similarity). We focused on the second way and we compared several cases of dictionary to build the matrix of similarity. As variables to classify of course we used the sets of terms, and as attributes the set of radicals (RD), the sets of terms itself (TM), the sets of bigrams (BG) and the set of trigrams (TG).
4.3 Results
As we see in the Section 2 for the presentation of the SVC method coupled to our geometric approach for cluster extraction, if no clear geometric separation in data occurred on the 2D map of correspondence analysis coordinates, the method is unsuccessful. Figure 5 shows different plots of the different cross between data attributes and distances. We see on these maps that TMTM Levenshtein, TMRDJaccard and TMRD Jaccard+ can produced interesting maps for SVC application.
Thus on each set of the data matrix we applied the cluster extraction to compare the efficiency of class retrieval. Figure 6 shows performance of the method. The Jaccard kernel gives best results with a good separation and extraction of classes. And the variant set introducing random noise in the matrix still becomes successful with 2 misclassified items on 98.
Now we adopt the best obtained clustering setting that is a termradical matrix (language model) and Jaccard Radial base kernel. Now to study scalability efficiency we expand amount of terms and radicals taken into account for Jaccard distance computation. Independently of clusters purity (class homogeneity), impact of features (radicals) is a warranty to make a good separation between similar terms. We do not forget that support vector machine is a non linear method which is efficient only if data are separable. Hence recall that role of features is to make similarity clue between terms, role of Jaccard index is to capture similarity, role of 2D component analysis is to capture main features that make separation between data, and finally role of support vector clustering is to capture bounds of cluster thanks to their geometric separation. Figure 7 shows that too many features do not make separation of data (attribute DName is changing for each four data sets):
But too few features make too few set of clusters. A medium set of features can lead to a good number of clusters. In our case 38 features describing structure of terms induce 15 clusters easily distinguishable visually. Recall the set of terms is made of 1893 terms describing 6 stages of sporulation process as we mentioned in Section 4.1.
Lots of terms belong to several stages (in the sense of classes). Even typical string token relevant of a class can belong to different stages. It is mainly due to biological stage results from microscopy studies, so visual patterns and often a cooccurrence of patterns can be simultaneously typical of a stage but individually we can observe a pattern occurs during several stages as mother cell and compartmentalization (beginning at stage 2 and staying at stages 3, 4, 5, 6), engulfment (beginning at stage 3 and staying at stages 4, 5, 6), septum (beginning at stage 1 and staying at stages 2, 3, 4). This property of crossmembership is hard enough to compute as a mapping between a specific term to a unique class. In our results (Figure 7) getting more clusters (15) than classes (6) induces that terms can be misclassified (green points) but make a variety of specific clusters from which we expect they capture patterns association that should be used to define rule of an automaton. For instance among the 15 clusters, a specific one gives the following members :
compartmentalized activation , compartmentspecific activation
We can imagine a rule associated to stage 2, 3, 4 and 5 : ”compartment activation”. Another one gives the following members :
slow postseptational, presporespecific SpoIIIE synthesis, endospore coat,
endospore coat assembly, endospore coat component, forespore coat, from the endospore coat,
cortex and/or coat assembly, spore coat and cortex.
We can imagine a rule associated to stages 3, 4, 5 and 6 : ”endospore coat”, ”coat cortex”, ”cortex coat”. From these clusters of terms coat, cort, prespore, endospore, postseptational, forespore are in the sets of features. In this process of lexical rule definition the user plays an important role in such way a cluster do not give information directly exportable as an automatically defined rule. Visualization of clusters by an expert leads to identification of patterns association to include in lexical rules. Especially by the fact that elements taken into account are features and knowledge about features is required to say that these rules will be applicable to a set of classes (biological stages). The methodology makes us to understand, but it is not a discovery, that clustering mixes several components of different categories. Nevertheless it can be efficient to identify relevant features to identify as a lexical pattern to build rules for information extraction, in our case information extraction of a biomolecular process described linguistically and formally by several stages (i.e. a scenario in the domain of biology).
5 Comparison with other techniques
In this section, we discuss behavior of concurrent clustering methods, existing kernels and interpretation of SVC clustering capacity. Below a simple general R utility function, gets outputs of used R clustering functions (kmeans, svcR, hierarchical) and exploits a data property that is insertion of the class number in each term (as ”4 coat protein” meaning ”coat protein” belongs to class 4). Hence using grep function it is very easy to find the repartition of classes over clusters :
TabEval < function(Dat) { M < matrix(nrow = (max(Dat[]) + 1), ncol = 8, 0) for( k in 1:max(Dat[]) ){ ΨStat < c() ΨSize < length(Dat[ Dat == k]) Ψfor(i in 1:6) { ΨΨGR < grep(i, names(Dat[ Dat == k]) ) ΨΨStat < c(Stat, 100 * length(GR) / Size ) Ψ} ΨStat= c(Stat, 0, Size ) ΨM[k,] < Stat } Stat < c() for(i in 1:6) { ΨSize < length(Dat[]) ΨGR < grep(i, names(Dat[]) ) ΨStat < c(Stat, 100 * length(GR) / Size ) } Stat < c(Stat, 0, Size ) M[nrow(M), ] < Stat print(M, digits = 1)
5.1 Classical clustering
Algorithms such as kmeans (KM) and hierarchical clustering (HC) are widespread poor knowledge techniques using metrics to find automatically clusters in any kind of data. Figure
8 shows graphically how such clusters could be represented.About svcR and KM, 2dimensional coordinates come from component analysis. On the KM map only centroids represent clusters (as star plotting characters). HC (Figure 8, center) displays a dendrogram where branches mean clustered points and require a cutoff at a level of the tree to catch clusters. In R, we used kmeans function from stats package [40], and hclusterpar function from amap package [29].
As Data contains 6 classes and svcR approaches with JRB kernel induces extraction of 17 clusters we settle 30 clusters extraction as settings for both KM and HC function. Figure 9 shows the content of clusters and class distribution for each approach (hierarchical, kmeans and SVC). The right column of each result set means the size of each cluster. The last line means distribution over classes from the whole dataset as baseline (it means that 12% of terms belong to class 1, 19% to class 2, 20% to class 3, 20% to class 4, 16% to class 4, 12% to class 6 and size of set is 1893 terms). First we can observe that distribution profile in cluster size is similar between HC and svcR. Secondly, looking at overrepresentation of classes over clusters HC and KM do not achieve better discrimination of terms across the 6 classes some clusters are better over. Language ambiguities seem to be a real bottleneck for all methods when usage is based on a JaccardRadial Kernel. But what happens when string kernels are used ?
5.2 String kernels
[28] and [30] promoted kernel strings to get semantic knowledge from texts. The string kernels calculate similarities between two strings by matching the common substring in the strings. A standard string kernel is the constant one (SKconstant) and assess similarities even is characters are matching in any order, and higher is the return value when order is respected and size of matching is bigger. Exact matching of characters is called spectrum kernel (SKspectrum) [41]. For instance let suppose a string of 29 characters and estimate value of a string with itself, SKconstant return 3165, SKspectrum return 430.
HC  KM  svcR  
C1  C2  C3  #  C1  C2  C3  #  C1  C2  C3  # 
0.17  0.17  0.16  481  0.23  0.27  0.16  153  0.17  0.17  0.16  481 
0.22  0.24  0.16  152  0.04  0.18  0.24  49  0.11  0.17  0.23  283 
0.14  0.16  0.25  199  0.22  0.16  0.20  64  0.03  0.25  0.22  156 
0.27  0.27  0  15  0.03  0.22  0.22  103  0.04  0.20  0.21  113 
0.17  0.17  0.17  78  0.09  0.27  0.18  11  0.15  0.31  0.26  81 
0.30  0.49  0.07  61  0  0.10  0.31  29  0.06  0.21  0.19  63 
0.13  0.13  0.20  15  0.17  0.11  0.19  54  0.17  0.17  0.17  78 
0.12  0.08  0.27  26  0.04  0.28  0.20  50  0  0.28  0.39  18 
0.04  0.24  0.21  219  52  0  0  44  0  0  0.25  4 
0.25  0  0.25  4  0.17  0.24  17  41  0.08  0.75  0.17  12 
0.12  0.19  0.20  1893  0.12  0.19  0.20  1893  0.12  0.19  0.20  1893 
If we pick two termes from our biology term data set : SKconstant (”inner coat”, ”in the mother cell”) = 22, and SKspectrum (”inner coat”, ”in the mother cell”) = 15 ; another pair give SKconstant (”inner coat”, ”initiation of sporulation”) = 27, and SKspectrum (”inner coat”, ”initiation of sporulation”) = 24. Variation between both pairs are not far according string kernels, though terms of the first pair are from one class (class 2) and the other pair compares terms from different classes (class 1 and class 2). We built a kernel matrix using both string kernels and achieved cluster labelling with this similarity information. Result is shown in Figure 10 :
Even if SKconstant shows some capability to isolate clusters, a big cluster contains 1600 items, hence 85% of information. Such kernel is though challenging, perhaps including more lexical knowledge.
5.3 Clusterability of a SVC model
Section 2.1 presents a general framework of a kernel method. It does not mean any assumption about clustering but moreover about classification. Nevertheless SVC is not a new technique in itself. SVC has been seen as onecluster discovery since a ball in the dual space is targeted. Hence it was described in detail for a long time as a one class approach applied to novelty detection when information is deviating from a block of well known information. In Rproject, kernlab package [22, 21] implements novelty detection task. When running oneclass kernel to our dataset it returns a model of 394 support vectors, with and crossvalidation 0.205. Our observation is that SVC performs well cluster extraction (or labeling) from a 2dimensional map, depending on existence condition of clusters. It means that data ought to be separable in the 2d map. Separability can be managed by composition of a metric with a radialbased function over the whole input matrix dimensions. A possible explanation for capability of SVC to identify clusters is related to the same problem as trying to flatten the skin of an orange onto a tabletop. In this case, projection is a procedure to transform locations and features from the threedimensional surface onto the twodimensional paper in a defined and consistent way. The result is some slight bulges and a lot of gaps. The transformation of map information from a sphere to a flat sheet can be accomplished in many ways but no single projection can accurately portray area, shape, scale, and direction. SVC clustering takes origin from capacity within projections to distort.
6 Conclusion
We developed, improved and applied a density and kernel based method called support vector clustering (SVC) we implemented as an Rproject package (svcR). The package is available from the CRAN R project server (http://cran.rproject.org/ see Software, Packages; svcR version 1.4), and downloadable from the R graphical user interface (required R libraries : quadprog, ade4 and spdep). First we proved that mapping points in the data space to a grid and using the sphere radius from the attribute space and a knearestneighbor approach improves time consumption for cluster labeling. In this sense, SVC can be seen as an efficient cluster extraction if clusters are separable in a 2D map. Secondly we found a representation for term clustering using a mixed JaccardRadial base kernel and we proved its efficiency with SVC for term clustering in a natural language processing task as lexical classification (i.e. oriented ontology knowledge acquisition). Some investigations remain under R implementation to integrate C functions for matrix acquisition so as to make the toolkit more scalable in data size. Semantic and lexicalbased kernels are promising for application in text mining frameworks. Yet it must understand how to select and integrate attributes for the description of terms. We aim at investigating in future work extraction of clusters over more than 2 dimensions, and test of robustness for nonseparable data.
Acknowledgments
Special thanks to Roy Varshavsky, Marine Campedel, Dunyon Lee and Olivier Chapelle for their discussion. The methodology discussed in this paper has been supported by the INRA1077SE grant from the French Institute for Agricultural Research (agriculture, food & nutrition, environment and basic biology).
References

[1]
Asa BenHur, David Horn, Hava Siegelmann, and Vladimir Vapnik.
Support vector clustering.
Journal of Machine Learning Research
, 2:125–137, 2001.  [2] A. Berwin, R. Turlach, and A. Weingessel. quadprog: Functions to solve Quadratic Programming Problems., 2007. R package version 1.53.
 [3] Biter Bilen. Support vector clustering of microarray gene expression data. Technical Report, Bilkent University, Turkey, 2005.

[4]
Mikhail Bilenko and Raymond J. Mooney.
Learning to combine trained distance metrics for duplicate detection
in databases.
Technical Report AI 02296, Artificial Intelligence Lab, University of Texas at Austin, 2002.
 [5] Roger Bivand. spdep: Spatial Dependence, Weighting Schemes, Statistics and Models, 2010. Rpackage version 0.514.
 [6] Marine Campedel and Eric Moulines. M thodologie de s lection de caract ristiques pour la classification d’images satellitaires. In Proceedings of the Conf rence Nationale sur l’Apprentissage (CAP’05), Nice, France, pages 107–108, 2005.
 [7] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967.
 [8] B atrice Daille. Conceptual structuring through term variations. In D. MacCarthy F. Bond, A. Korhonen and A. Villacicencio, editors, Proceedings of the 38th International Conference of Association for Computational Linguistics (ACL’03), Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, 2003.
 [9] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry (SCG’04), New York, USA, pages 253–262. ACM New York, NY, USA, 2004.
 [10] S. Dray and A.B. Dufour. The ade4 package: Implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4):1–20, 2007.

[11]
Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Salvatore
Stolfo.
A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data.
In D. Barbara and S. Jajodia, editors, Applications of Data Mining in Computer Security, pages 77–102. Kluwer, 2002.  [12] Damien Eveillard and Yann Guermeur. Statistical processing of selex results. Proceedings of the 10th International Conference on Intelligent Systems for Molecular Biology (ISMB’02), Poster Session, Edmonton, Canada, 2002.
 [13] Christiane Fellbaum. WordNet : An Electronic Lexical Database. MIT Press, 1998.
 [14] William A. Gale, Kenneth W. Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415–439, 1992.
 [15] Gregory Grefenstette. Sextant: Extracting semantics from raw text: Implementation details, heuristics. Integrated ComputerAided Engineering, 1:527–536, 1994.
 [16] Zellig Harris. Mathematical Structure of Language. John Wiley & Sons, 1968.
 [17] Marti Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, France, pages 539–545, 1992.
 [18] David Horn. Clustering via hilbert space. Physica A, 302(1):70–79, 2001.
 [19] Paul Jaccard. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Soci t Vaudoise des Sciences Naturelles, 37:241–272, 1901.
 [20] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Englewood Cliffs, New Jersey, Prentice Hall, 1988.
 [21] Alexandros Karatzoglou, David Meyer, and Kurt Hornik. Support vector machines in r. Journal of Statistical Software, 15(9):1–28, 2006.
 [22] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab – an s4 package for kernel methods in r. Journal of Statistical Software, 11(9):1–20, 2004.

[23]
Jong Kees, Elena Marchiori, and Aad van der Vaart.
Finding clusters using support vector classifiers.
In
Proceedings of the 18th ESANNEuropean Symposium on Artificial Neural Networks (ESANN’03), Bruges, Belgium
, pages 23–25, 2003.  [24] Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1):15–68, 2000.
 [25] Aleksandar Lazarevic, Aysel Ozgur, Levent Ertoz, Jaideep Srivastava, and Vipin Kumar. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the 3rd SIAM International Conference on Data Mining (ICDM’03), San Francisco, USA, pages 25–36, 2003.
 [26] Jaewook Lee and Daewon Lee. An improved cluster labeling method for support vector clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):461–464, 2005.
 [27] Vladimir I. Levenshtein. Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady, 10(8):707–710, 1966.
 [28] Huma Lodhi, Craig Saunders, John S Taylor, Nello Cristianini, and Chris Watkins. Support vector clustering. Journal of Machine Learning Research, 2:419–444, 2002.
 [29] Antoine Lucas. amap: Another Multidimensional Analysis Package, 2007. R package version 0.82.
 [30] Alessandro Moschitti. Syntactic and semantic kernels for short text pair categorization. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL’09) , Athens, Greece, pages 576–584. Association for Computational Linguistics Morristown, NJ, USA, 2009.
 [31] Adeline Nazarenko, Pierre Zweigenbaum, Jacques Bouaud, and Beno t Habert. Corpusbased identification and refinement of semantic classes. Journal of the American Medical Informatics Association, 4(suppl):585–589, 1997.

[32]
JinHyeong Park, Xiang Ji, Hongyuan Zha, and Rangachar Kasturi.
Support vector clustering combined with spectral graph partitioning.
In
Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) , Cambridge, UK
, pages 581–584, 2004.  [33] Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of english words. In Proceedings of the 31th Conference of the Association for Computational Linguistics (ACL’93), Jerusalem, Israel, pages 183–190, 1993.
 [34] Wilfredo J. PumaVillanueva, George B. Bezerra, Clodoaldo A.M. Lima, and Fernando J. Von Zuben. Improving support vector clustering with ensembles. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’05), Montreal, Canada, pages 13–15, 2005.
 [35] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. In Proceedings of the 11th National Conference on Artificial Intelligence (NCAI’93) , Washington, USA, pages 811–816. AAAI Press/The MIT Press, 1993.
 [36] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw Hill, New York, 1983.
 [37] Bernhard Sch lkopf, John Platt, John ShaweTaylor, Alex Smola, and Robert Williamson. Estimating the support of a highdimensional distribution. Neural Computation, 33:1443–1471, 2001.
 [38] Frank A. Smadja and Kathleen R. McKeown. Automatically extracting and representing collocations for language generation. In Proceedings of the 28th International Conference of Association for Computational Linguistics (ACL’90), Pittsburgh, USA, pages 574–579. Association for Computational Linguistics Morristown, NJ, USA, 1990.
 [39] Steven Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(13):233–272, 1999.
 [40] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. ISBN 3900051070.
 [41] C. H. Teo and S. V. N. Vishwanathan. Fast and space efficient string kernels using suffix arrays. In Proceedings of the International Conference on Machine Learning (ICML’06) Pittsburgh, USA, pages 929–936. ACM New York, NY, USA, 2006.
 [42] Jianhua Xu and Xuegong Zhang. Kernels based on weighted levenshtein distance. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’04), Budapest, Hungary, pages 3015–3018, 2004.
 [43] Jianhua Yang, Vladimir EstivillCastro, and Stephan Chalup. Support vector clustering through proximity graph modelling. In Proceedings of the 9th International Conference on Neural Information Processing,(ICONIP’02) , Singapore, pages 898–903, 2002.
 [44] Ying Zhang, Hongye Su, Tao Jia, and Jian Chu. Rule extraction from trained support vector machines. In Proceedings of the 9th International PacificAsia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’05), Hanoi, Vietnam, pages 61–70, 2005.