In many real world applications of machine learning and related techniques, the raw data are not anymore in a standard and simple tabular format in which each object is described by a common and fixed set of numerical attributes. This standard vector model, while useful and efficient, has some obvious limitations: it is limited to numerical attributes, it cannot handle objects with non uniform descriptions (e.g., situations in which some objects have a richer description than others), relations between objects (e.g., persons involved in a social network), etc.
In addition, it is quite common for real world applications to have some dynamic aspect in the sense that the data under study are the results of a temporal process. Then, the traditional hypothesis of statistical independence between observations does not hold anymore: new hypothesis and theoretical analysis are needed to justify the mathematical soundness of the machine learning methods in this context.
Artificial neural networks provide some of the most efficient techniques for machine learning and data mining osowskiEtAl2004. As other solutions, they were mainly developed to handle vector data and analyzed theoretically in the context of statistically independent observations. However, the last decade has seen numerous efforts to overcome those two limitations hammerJain04NonStandardData
. We survey in this article some of the resulting solutions. We will focus our attention on the two major artificial neural network models: the Multi-Layer Perceptron (MLP) and the Self-Organizing Map (SOM).
2 Multi-Layer Perceptrons
The Multi-Layer Perceptron (MLP) is one the most well known artificial neural network model (see e.g., bishop_NNPR1995). On a statistical point of view, MLP can be considered as a parametric family of regression functions. Technically, if the data set consists in vector observations in , that is if each object is described by a vector , the output of a one hidden layer perceptron with
hidden neurons is given by
where the are vectors of , and the , and are real numbers ( denote the vector of all parameters obtained by concatenating the , and ). In this equation, is a bounded transfer function which introduces some non linearity in . Given a set of training examples, that is pairs , the learning process consists in minimizing over a distance between the target value and the predicted value. Given an error criterion (such as the mean squared error), an optimal value for is determined by any optimization algorithm (such as quasi Newton methods see e.g. boydVandenbergheCVBook2004
), leveraging the well know backpropagation algorithmwerbos74Thesis which enables a fast computation of the derivatives of with respect to . The use of a one hidden layer perceptron model is motivated by approximation results such as hornik_stinchcombe_white_NN1989 and by learnability results such as white1990NeuralNetworks
(in statistical community, learning is called estimation and learnability consistency).
2.1 Model selection issues for MLP
It is well known since the seminal paper of lapedes_Farber_1987 that MLP are an efficient solution for modeling time series whenever the linear model proves to be inadequate. The simplest approach consists in building a non linear auto-regressive model: given a real valued time series , one builds training pairs , where is a vector in defined by . Then a MLP is used to learn the mapping between the (the past of the time series in a time window of length ) and (the current value of the time series), as in any regression problem.
In order to avoid overlearning and/or large computation time, the question of selecting the correct number of neurons or, more generally, the question of model selection arises immediately. Standard methods used by the neural-networks community are based on pruning: one trains a possibly too large MLP and then removes useless neurons and/or connection weights. Heuristic solutions include Optimal Brain DamageleCunEtAl1990OBD and Optimal Brain Surgeon hassibiEtal1993OBS, but a statistically founded method, SSM (Statistical Stepwise Method), was introduced by cottrellEtAl1995SSM. The method relies on the minimization of the Bayesian Information Criterion (BIC). Shortly after, yao2000 and rynkiewiczEtAl2001 proved the consistency (almost surely) of BIC in the case of MLPs with one hidden layer. These results, established for time series, allow to generalize the consistency results in white1990NeuralNetworks for the iid case.
The convergence properties of BIC may be generalized even further. A first extension is given in rynkiewicz2008. The noise is supposed to be Gaussian and the transfer function is supposed to be bounded and three times derivable. Then rynkiewicz2008
shows that under some mild hypothesis, the maximum of the likelihood-ratio test statistic (LRTS) converges toward the maximum of the square of a Gaussian process indexed by a class of limit score functions. The theorem establishes the tightness of the likelihood-ratio test statistic and, in particular, the consistency of penalized likelihood criteria such as BIC. Some practical applications of such methods can be found inmangeas1997. The hypothesis on the noise was relaxed in rynkiewicz2012
On the basis of the theoretical results above, a practical procedure for MLP identification is proposed. For a one hidden layer perceptron with hidden units, we first introduce
where is the mean squared error of the MLP for parameter and is a penalty term. Then we proceed as follows:
Determination of the right number of hidden units.
begin with one hidden unit, compute ,
add one hidden unit if ,
if then stop and keep hidden units for the model.
Prune the weights of the MLP using classical techniques like SSM cottrellEtAl1995SSM.
Note that the choice of the penalty term is very important. On simulated data, good results have been reported for from to (see rynkiewiczEtAl2001, rynkiewicz2006).
Let us also mention that the tightness of the LRTS and, in particular, the consistency of the BIC criterion were recently established for more complex neural-networks models such as mixtures of MLPs olteanuRynkiewicz2008 and mixtures of experts olteanuRynkiewicz2011.
2.2 Modeling and forecasting nonstationary time series
As mentioned in the previous section, MLP are a useful tool for modeling time series. However, most of the results cited above are available for iid data or for stationary time series. In order to deal with highly nonlinear or nonstationary time series, a hybrid model involving hidden Markov models (HMM) and multilayer perceptrons (MLP hereafter) was proposed inrynkiewicz1999. Let us consider
a homogeneous Markov chain valued in a finite state-spaceand the observed time series. The hybrid HMM/MLP model can be written as follows:
where is a regression function of order . In this case, is the -th MLP of the model, parameterized by the weight vector . is a strictly positive number and is a iid sequence of standard Gaussian variables.
The estimating procedure as well as the statistical properties of the parameter estimates were established in rynkiewicz2001. The proposed model was successfully applied in modeling difficult data sets such as ozone peaks dutotEtAl2007 or financial shocks mailletEtAl2004.
2.3 Functional data
The original MLP model is limited to vector data for an obvious reason: each neuron computes its output as a non linear transformationapplied to a (shifted) inner product (see equation (1)). However, as first pointed out in sandberg_IEEETCS1996, this general formula applies to any data space on which linear forms can be defined: give a data space and a set of linear functions from to , one can define a general neuron with the help of , as calculating .
This generalization is particularly suitable for functional data, that is for data in which each object is described by one or several functions ramsay_silverman_FDA1997. This type of data is quite common for instance in multiple time series setting (where each object under study evolves through time and is described by the temporal evolutions of its characteristics) or in spectrometry. A functional neuron rossi_conanguez_NN2005 can then be defined as calculating , where is the observed function and is a parameter function. Results in rossi_conanguez_NN2005 show that MLP based on this type of neurons share many of the interesting properties of classical MLP, from the universal approximation to statistical consistency (see also rossi_conanguez_NPL2006 for an alternative functional neuron with similar properties). In addition, the parameter functions can be represented by standard numerical MLP, leading to a hierarchical solution in which a top level MLP for functional data is obtained by using a numerical MLP in each of its functional neurons. Experimental results in rossi_conanguez_NN2005; rossi_conanguez_fleuret_IJCNN2002 show the practical relevance of this technique.
3 Self-Organizing Maps
As the MLP, Kohonen’s Self-Organizing Map (SOM) is one of the most well known artificial neural network model kohonen_SOM2001. The SOM is a clustering and visualization model in which a set of vector observations in is mapped to set of neurons organized in a low dimensional prior structure, mainly a two dimensional grid or a one dimensional string. Each neuron is associated to a codebook vector in ( is also called a prototype). As in all prototype based clustering methods, each represents the data points that have been assigned to the corresponding neuron, in the sense that is close to those points (according to the Euclidean distance in ). The distinctive feature of the SOM is that each prototype is also somewhat representative of data points assigned to other neurons, based on the geometry of the prior structure: if neurons and are neighbours in the prior structure, then will be close to data points assigned to neuron (and vice versa). On the contrary, if and are far away from each other in the prior structure, the data points assigned to one neuron will not influence the prototype of the other neuron. This has some very important consequences in terms of visualization capabilities, as illustrated in vesanto1999SomVisu for instance.
The original SOM algorithm has been designed for vector data, but numerous adaptations to more complex data have been proposed. We survey here three specific extensions, respectively to time series, functional data and categorical data. Another important extension not covered here is proposed in hammer_etal_N2004 which is built upon processing of multiple time series with recursive versions of the SOM. The authors show that trees and graphs can be clustered by those versions of the SOM, using a temporal coding of the structure. Recent advances in this line of research include e.g. hagenbuchnerEtAl2009. Other specific adaptation include the symbol strings SOM described in somervuo_NN2004.
3.1 Time series with metadata
While the SOM is a clustering algorithm, it has been used frequently in supervised context as a component of a complex model. We described briefly here one such model as an example of complex time series processing with the SOM. Let us consider a time series with two time scales, i.e., that can be written down with two subscripts. The date is denoted by where represents the slow time scale and corresponds for instance to the day (or month or year) while corresponds to the observed values (e.g. the hours or half-hours of the day, the days of the month, the months of the year, etc.). Then the time series is denoted . We assume in addition that the slow time scale is associated with metadata. For instance, if each corresponds to a day in a year and one knows the day of the week, the month, etc. Metadata are supposed to be available prior a prediction.
The original time series takes value in , but the dual time scale leads naturally to a vector valued time series representation, that is to the . In this point of view, given the past of the vector valued time series, one has to predict a future vector value, that is a complete vector of values. This could be seen as a long term forecasting problem for which a usual solution would be to iterate one-step ahead forecasts. However, this leads generally to unsatisfactory solutions either because of a squashing behaviour (convergence of the forecasting to the mean value of series) or to a chaotic behaviour (for nonlinear methods).
An alternative solution is explored in cott_1998
. It consists in forecasting separately, on the one hand, the mean and variance of the time series on next slow time scale step (that is, on the next), and on the other hand, the profile of the fast time scale. The prediction of the mean and of the variance is done by any classical technique. For the profile, a SOM is used as follows. The vector values of the time series, i.e., the , are centred and normalized with respect to the fast time scale, that is are transformed into profiles defined by
where and are respectively the mean and the variance of . The profiles are clustered with a SOM leading to some prototype profiles . Each prototype is associated to the metadata of the profiles that has been assigned to the corresponding neuron.
Then a vector value is predicted as follows: the mean and variance are obtained by a standard forecasting model for the slow time scale. Then the metadata of the vector to predict is matched against the metadata associated to neurons: assume for instance, that metadata are days of the week, and the we try to predict a Sunday. Then one collects all the neurons to which Sunday profiles have been assigned. Finally, a weighted average of the matching prototypes is computed and rescaled according to and . As shown in cott_1998 this technique enables both some stable and meaningful full day predictions, while integrating non numerical metadata.
3.2 Functional data
The dual time scale approach described in the previous section has become a standard way of dealing with time series in a functional way, as shown in e.g. besseCardotStephenson00. But as pointed out in Section 2.3, functional data arise naturally in other contexts such as spectrometry. Then, the SOM has been naturally adapted to functional data in other contexts than time series. In those contexts, in addition to the normalization technique described above that produces profiles, one can use functional transformation such as derivative calculations in order to drive the clustering process by the shapes of the functions rather than mainly by their average values rossiConanGuezElGolliESANN2004SOMFunc.
Another adaptation consists in integrating the SOM with optimal segmentation techniques that represent functions or time series with simple models, such as piecewise constant functions for instance. The main idea it to a apply a SOM to functional data using any functional distance (from the norm to more advanced Sobolev norms villmann2007Sobolev) with an additional constraint that prototypes must be simple, e.g., piecewise constant. This leads to interesting visualization capabilities in which the complexity of the display is automatically globally adjusted hebrailEtAl2010ClustSeg.
3.3 Categorical data
In surveys, it is quite standard that the collected answers are categorical variables with a finite number of possible values. In this case, a specific adaptation of the SOM algorithm can be defined, in the same way that Multiple Correspondence Analysis is related to Principal Component Analysis. More precisely, useful encoding methods for categorical data are the Burt Table (BT), which is the full contingency table between all pairs of categories of the variables, or the Complete Disjunctive Table (CDT), that contains the answers of each individual coded as 0/1 against dummy variables that correspond to all the categories of all variables. Then, a Multiple Correspondence Analysis of the BT or of the CDT is nothing else than a Principal Component Analysis on BT or CDT, previously transformed to take into account a specific distance between the rows and a weighting of the individualsleRoux:2004. The SOM can be adapted to categorical data using this approach, as described in cott_2005 and cott_2004. The same transformation on BT or CDT is achieved and a SOM using the rows of the transformed tables can thus be trained. This training provides an organized clustering of all the possible values of the categorical variables on a prior structure such as a two dimensional grid. Moreover, if a simultaneous representation of the individuals and of the values is needed, two coupled SOM can be trained and superimposed. The aforementioned articles present various real-world use cases from socio-economic field.
4 Kernel and dissimilarity SOM
The extensions of artificial neural networks model described in the previous sections are ad hoc in the sense that they are constructed using specific features of the data at hand. This is a strength but also a limitation as they are not universal: given a new data type, one has to design a new adaptation of the general technique. In the present section, we present more general versions of the SOM that are based on a dissimilarity or a kernel on the input data. Assuming the existence of such a measure is far weaker than assuming the data are in a vector format. For instance, it is simple to define a dissimilarity/similarity between the vertices of a graph, a data structure that is very frequent in real world problems newman2003GraphSurveySIAM, while representing directly those vertices as vectors is generally difficult.
4.1 Dissimilarity SOM
Let us assume that the data under study belong to a set on which a dissimilarity is defined: is a function from to that maps a pair of objects and to a non negative real number which measures how different and are. Hypothesis on are minimal: it has to by symmetric () and such that .
As pointed out above, dissimilarities are readily available on sets of non vector data. A classical example is the string edit distance levenshtein1966 which defines a distance111A distance is a dissimilarity that satisfies in addition the strong hypothesis of the triangle inequality: . on symbol strings. More general edit distances can be defined, such as for instance the graph edit distance which measure distances between graphs bunke_ICMLDMPR2003.
As the hypothesis on are minimal, one cannot assume anymore that vector calculation are possible in this set. Then, the learning rules of the SOM do not apply as they are based on linear combination of the prototypes with the data points. To circumvent this difficulty, kohonen_somervuo_N1998 suggest to chose the values of the prototypes in the set of observations . This leads to a batch version of the SOM which proceeds as follows. After a random initialization of the prototypes, each observation is assigned to the neuron with the closest propotype (according to the dissimilarity measure) and the prototypes are then updated. For each neuron, the updated is chosen among the observations as the minimizer the following distortion
where is ’s neuron and is a decreasing function of the distance between neurons in the prior structure. This modification of the SOM algorithm is known as the median SOM
and is closely related to the earlier median version of the standard k-means algorithmkaufman_rousseeuw_STABLNRM1987.
In the case where is a small sample, the constraint to chose the prototypes in the data can be seen as too strong. Then, elgolli_etal_RSA2006 suggests to associate several prototypes (a given number ) to each neuron. A neuron is represented by a subset of size from and the different steps of the SOM algorithm are modified accordingly. A fast implementation is described in conanguez_etal_NN2006.
A successful application of the dissimilarity SOM on real world data concerns school-to-work transitions. In massoniEtAl2009, we were interested in identifying career-path typologies, which is a challenging topic for the economists working on the labor market. The data was issued from the “Generation’98” survey by the CEREQ. The data sample contained information about 16040 young people having graduated in 1998 and monitored during 94 months after having left school. The labor-market statuses had nine categories, from permanent contracts to unemployed and including military service, inactivity or higher education.
The dissimilarity matrix was computed using optimal matching distances abbottTsay2000, which are currently the main stream in economy and sociology. The most striking opposition appeared between the career-paths leading to stable-employment situations and the “chaotic” ones. The stable positions were mainly situated in the west region of the map. However, the north and south regions were quite different: in the north-west region, the access to a permanent contract (red) was achieved after a fixed-term contract (orange), while the south-west classes were only subject to transitions through military service (purple) or education (pink). The stability of the career paths was getting worse as we moved to the east of the map. In the north-east region, the initial fixed-term contract was getting longer until becoming precarious, while the south-east region was characterized by the excluding trajectories: unemployment (light blue) and inactivity (dark blue).
Two other extensions of the SOM to dissimilarity data have been proposed; they both avoid the use of constrained prototypes. The oldest one is based on deterministic annealing graepel_etal_N1998 while a more recent one uses the so-called relational approach that relies on pseudo-Euclidean spaces hammerhasenfuss2010neuralcomputation; hammer_etal_WSOM2007. Both approaches lead to better results for datasets where the ratio between the number of observations and the number of neurons is small.
4.2 Kernel SOM
An alternative approach to dissimilarities is to rely on kernels. Kernels can be seen as a generalization of the notion of similarity. More precisely, a kernel on a set is a symmetric function from to that satisfies a positivity property:
For such a kernel, there is a Hilbert space (called the feature space of the kernel) and a mapping from , such that the inner product in corresponds to the kernel via the mapping, that is aronszajn_TAMS1950 :
Then can be interpreted as a similarity on (values close to zero correspond to unrelated objects) and defines indirectly a distance between objects in as follows:
As shown in e.g. shawetaylor_cristianini_KMPA2004, kernels are a very convenient way to extend standard machine learning methods to arbitrary spaces. Indeed, the feature space comes with the same elementary operations as : linear combination, inner product, norm and distance. Then, one has just to work in the feature space as if it were the original data space. The only difficulty comes from the fact that and are not explicit in general, mainly because is an infinite dimensional functional space. Then one has to rely on equation (5) to implement a machine learning algorithm in completely indirectly using only . This is the so called Kernel trick.
In the case of the batch version of the SOM, this is quite simple boulet_etal_N2008. Indeed, assignments of data points to neurons are based on the Euclidean distance in the classical numerical case: this translates directly into the distance in the feature space, which is calculated solely using the kernel (see equation (4.2)). Prototypes update is performed as weighted averages of all data points: weights are computed with the function introduced in equation (4) as a proxy for the prior structure. It can be shown that those weights, which are computed using the assignments only, are sufficient to define the prototypes and that they can be plugged into the distance calculation, without needing an explicit calculation of . Variants of this scheme, especially stochastic ones, have been studied in andras_IJNS2002; macdonald_fyfe_ICKIESAT2000. It should also be noted that the relational approach mentioned in the previous section hammerhasenfuss2010neuralcomputation; hammer_etal_WSOM2007 can be seen a relaxed kernel SOM, that is an application of a similar algorithm in situations where the function is not positive.
While kernels are very convenient, the positivity conditions might seem very strong at first. It is indeed much stronger than the conditions imposed to a dissimilarity, for instance. Nevertheless, numerous kernels have been defined on complex data gartner08:_kernel_struc_data, ranging from kernels on strings based on substrings lodhi_JMLR2002 to kernel between the vertices of a graph such as the heat kernel kondor_lafferty_ICML2002; smola_kondor_COLT2003 (see boulet_etal_N2008 for a SOM based application of this kernel to a medieval data set of notarial acts). Two graphs can also be compared via a kernel based on random walks gartner_etal_ACCLT2003 or on subtrees comparisons ramon_gartner_WMGTS2003.
Present days data are becoming more and more complex, according to several criteria: structure (from simple vector data to relational data mixing a network structure with categorical and numerical descriptions), time evolution (from a fixed snapshot of the data to ever changing dynamical data) and volume (from small datasets with a handful of variables and one thousand of objects to terabytes and more datasets). Adapting artificial neural networks to those new data is a continuous challenge which can be solved only by mixing different strategies as outlined in this paper: adding complexity to the models enable to tackle non standard behavior (such as non-stationarity), theoretical guarantees limit the risk of overfitting, new models can be tailor made for some specific data structures such as graph or functions, while generic kernel/dissimilarity models can handle almost any type of data. The ability to combine all those strategies demonstrates once again the flexibility of the artificial neural network paradigm.