I Introduction and Problem Description
Humankind is unequivocally living in the age of “Big Data”. The rapid increase in connectivity between people, businesses and consumers, the media, and more has led to an explosion of publicly and privately available data. New information is constantly generated by social media, polling, market surveys, digital cameras, government surveillance, smart phones, scientific experimentation, and a multitude more of the technological sources and innovations of the past few decades. By the year 2020, the rate of production of digital data is projected to be 44 times as high as the rate in the year 2009, and the overall amount of available data is projected to be as high as 44 zettabytes (1 zettabyte = gigabytes). This wealth of information available in the digital ecosystem, combined with ever-increasing information storage capacity, has incredibly far reaching implications in diverse applications BigData ; EconData ; Science ; Storage ; Survey ; WageData . In order to realize the potential of the available data, methods for gaining meaningful insights must be developed. As the sheer quantity of available data exceeds human computational capability, efficient computer algorithms must be created and implemented. This is where the field of machine learning comes into play ML .
Broadly, machine learning is the process of allowing computer programs to parse available data and learn (infer) general rules. The notion of fundamental importance in the previous statement lies in the term “general”. The goal of machine learning is to find a model using input data which can be generalized and applied to new data in such a way that model performance increases with increasing amounts of input data. The basic scheme consists of analyzing a set of input data (“training data”) containing many entities (instances) to which we want to assign some value (label). Each instance is described by a set of quantities (features) which, theoretically, allow it to be mapped to a specific label. The problem, then, is to find a mapping algorithm (model) with parameters which the computer can fit to the given input data, and subsequently apply to future data (“testing data”). Towards that end, there are two main types of machine learning algorithms: supervised and unsupervised. Unsupervised learning involves training data with unknown labels or associations. Unsupervised learning algorithms seek to label instances based on their connections or commonalities with other instances, via methods such as clusteringCD ; CD1 ; CD2 ; CD5 ; CD6 ; these include clustering methods CD3 ; CD4
that employ object called “replicas” somewhat similar in spirit to those that we will introduce in the current work for supervised learning. Supervised machine learning corresponds to learning on training data that has known outcomes, i.e., data for which the “right answer” is known. The algorithm aims to fit the model by using the relationship between the features and known labels, to effectively generalize to new data with unknown labels. Since the advent of supervised machine learning a number of algorithms have been developed. These are of varying complexity and performance, with some of the most popular being “Support Vector Machine” (SVM) methodsSVM ; SVM1 . One may wonder why, in light of the plethora of currently available powerful methods, should we be concerned with the development of novel algorithms? Crudely, in addition to the benefits of having a robust “toolbox” of multiple algorithms, it turns out that existing algorithms are not without their faults.
In this paper, we will specifically focus on supervised machine learning corresponding to data with either discrete (classification) or continuous (regression) labels. We will introduce our new algorithm that learns by fitting an ensemble of stochastic series expansions to the training data, and then ‘votes’ on the output of the label. We will demonstrate, through detailed case studies, that this algorithm, which we term the “Stochastic Replica Voting Machines” (SRVM) method, rivals the best performing contemporary models, and additionally surpasses them in various performance metrics. We will demonstrate that the algorithm applies equally well to both classification and regression.
The remainder of this article is organized as follows: In Section II, we describe the statistical physics based inspiration for the current algorithm. In Section III, we provide the essential detailed setup of the algorithm. We then proceed to test our algorithm against various benchmarks (Section IV). Apart from underscoring the high accuracy of the SRVM method, we report on traits such as the dependence of the results on the specific parameter that underlie our algorithm (Section IV.1), its stability (Section IV.2), and the dependence on our results on different methods of pre-processing (scaling) the data (Section IV.3). Of particular interest are the overlaps (Section IV.4) between the different stochastic functions or “replica” solvers that underlie our method. These overlaps correlate with the accuracy of our predictions thus enabling us to pinpoint optimal values of the parameters defining our algorithm. In Section IV.5, we illustrate that, by its very nature of including numerous independent stochastic solvers, our approach suffers from far less bias than the common SVM machine learning method. We describe (Section IV.6) a trivial generalization of our method to include multiple “layers” wherein the voting between the different solvers allows for various differently chosen weights. In particular, as we explain, one may find the optimal parameter values of our method (potentially including generalized weights) by applying regression machine learning to machine to recursively learn and predict the parameters values that will yield the best accuracy (or replica overlap). In Section V, we further present tests on the residuals of our method to demonstrate its strength also for regression studies. In Section VI, we suggest that SRVM may be employed to obtain algorotuhm independent bounds on the accuracy attainable by any algorithm. We conclude (Section VII) with a summary of our main results and a speculation concerning coordinates in physical systems as emergent features for which the representation of the data is most robust.
Ii Basic tenets
Recent decades have seen a flurry of advances in computer science that have been triggered and/or aided by various findings in the natural sciences. Indeed, artificial neural networks aim to emulate quintessential aspects of the biological networks of the brain. Neural networks witnessed tremendous success in advancing artificial learningANN ; DL . The study of spin-glasses by physicists and materials scientists has led to the development of Hopfield networks. The incorporation of thermodynamic and statistical mechanics principles to these systems led to some of the most sophisticated machine learning models to date Hopfield ; SpinGlass ; SG1 ; SG2 ; SG3 ; Boltzmann . It is evident that natural scientific principles (such as those from biology or physics) may serve as excellent bases for constructing learning algorithms. Recent results demonstrate that a certain theoretical basis may be required in order to enable learning algorithms to apply to scientific data BlackBox . With these notions in mind, we formulated a novel algorithm for supervised learning that is motivated by statistical physics.
In the classical statistical mechanics of N-particle systems, e.g., huang
each particle carries its own phase space degrees of freedom: its position and momentum coordinates (thus in three-dimensional space, the state of each particle is defined by six degrees of freedom). At any given instant, the ‘list’ of all particle coordinates and momenta for all the particles in the system completely specifies its instantaneous state. Thus, for a system ofparticles, this ‘microstate’ can be represented as a single point in a 6N-dimensional phase space. The system itself, comprised of an extremely large number of particles, is macroscopic and can be described using only a few bulk degrees of freedom (i.e., temperature, pressure, magnetization, etc.). These bulk degrees of freedom characterize the observable state of the system in what is termed the ‘macrostate’. The dynamical evolution of the particles in the system causes the microstate to constantly change, transitioning to new points in the phase space (new ‘list’ of 6N coordinates). If the system is in equilibrium, there is no change in macroscopic degrees of freedom with time, and this means that the microstates correspond in some way to the given macrostate. Additionally, the properties of the macrostate can be found by taking an ensemble average over the microstates corresponding to the macrostate. In general, changing external constraints changes the microstates that are available to the systems particles, and the macrostate can also change. This implies that various sets of microstates correspond to specific macrostates, and this is indeed the case. More specifically, each microstate corresponds to only one specific macrostate. In the phase space picture, then, certain regions of phase space (corresponding to sets of microstates) will map directly to a single macrostate, and there will be boundaries in the phase space separating the different regions.
The above description of statistical mechanical phase space is reminiscent of classification problems. As discussed in the Introduction, classification-based learning problems consist of instances (the particles) which are described by a set of features (positions and momenta). These values of the features for a given instance are cast into a feature vector which gives the ‘location’ of the instance in high-dimensional feature space (phase space). Each instance has an associated classification label corresponding to the set of features, such that certain regions of feature space map to specific labels. The goal of the learning algorithm is to find the boundary between the classification labels in feature space, so that new instances (which correspond to some point in feature space) can be appropriately mapped to the proper label. A schematic is provided in Fig. 1.
In order to achieve this goal, we need an appropriate mapping function , where is a vector representing a particular point in the space of all attributes (“features”) of the data. In the statistical mechanical framework, mapping to a specific macrostate is done via minimization of an appropriate free energy. Once the free energy is properly extremized, calculating its value for a given point in phase space will allow for the elucidation of the corresponding macrostate or phase. Twentieth century physicist Lev Landau studied free energies that could be expanded in a set of polynomial kernel functions of features (the so-called “order parameters”) and their gradients. The kernel expansion with coefficients whose values were fixed through optimization could then be applied to determine which macrostate a region of phase space belonged to (see, e.g., huang ; lev ). Thus, borrowing this idea, since we are interested in identifying classification boundaries, we will assert that the label, of a given instance can be expanded with unknown coefficients, in a set of kernel functions which take as their argument the feature vector. For a binary classification problem, the sign of a voting function weighted by different “replica” functions determines the classification of the vector .
The general idea underlying our use of “replicas” is sketched in Fig. (2). In essence, the system may be examined independently by random machine learning solvers (the “replicas”). These replicas may collectively interact with one another in order to produce a collective prediction that is more stable and less biased than that potentially found by a single solver. This idea was used in CD3 ; CD4 for general unsupervised machine learning (clustering), unsupervised image segmentation vision ; vision1 ; vision2 , determining structure in various phases of complex many body systems phase1 ; phase2 , and for examining instances of the Traveling Salesman Problem TSM . In using multiple replicas, we aim to capture an anthropological principle known as wisdom of the crowds wisdom : the predictions made by a large crowd may far more accurate than the guess made by a single person (a single solver or “replica”).
Iii The SRVM algorithm in a nutshell- mathematical details
We will now couch the above intuition in a rather concrete and exceedingly simple mathematical framework. The resulting recipe will lead to an algorithm that may be straightforwardly implemented. Similar to other supervised machine learning approaches, the algorithm that we construct will be trained using instances of known labels. Typically, such training data sparsely cover the space defined by the features of the data. To work around this, our algorithm will employ an “ensemble averaging” technique that randomly samples the feature space. We will generate a stochastic set of
feature vectors (associated with points in feature space) that we will term ‘anchor points’. We will then use the proximity of these anchor points to training points to assign a classification label. Essentially, we will employ the known labels corresponding to training points in feature space (instances) with various kernel functions to attempt to classify the space around the known points so as to create general mapping functions. Specifically, we consider a specific input “training” data of sizepoints each comprised of features (expressed as a dimensional vector) for these points and the corresponding given correct classification . With these preliminaries in place, we now define
and aim, as we will describe below, to set equal to the known correct classification . Here, are fixed random vectors (which we will often term “anchor vectors”) that are different for each “replica” , and is a stochastically chosen function. It may, e.g., be any standard function,
where , , , , and are constants that serve as defining parameters for the (Gaussian, exponential, complementary error function, Airy functions (of the first kind), and Fermi type distribution (the latter is also known, when , as the Logistic function)) functions that appear above. The kernels used in Eq. (1) need not all be of the same type. Any linear combination of different kernel types (such as the different explicit functions listed in Eq. (2) or Eqs. (8, 9,IV) that we will introduce shortly) might also be chosen. Indeed, in order to avoid spurious behavior of functions of Eq. (1) stemming from the trivial asymptotics of the various individual kernel types, we may select the kernels that appear in Eq. (1) to be of multiple types. The equations that we will result for the coefficients will be linear in all of these cases. On all the examples that we studied, the single kernel type expansion fared well (we found modest improvements on including multiple function types (Section IV.6). However, it is conceivable that on other data sets a heterogeneous set of kernel types may fare substantially better. As we will further explain, in Eq. (1), is a set of viable functions of the variables (different specific functions (either of various types (Eq. (2)) or, more commonly in our simplest analysis, functions of a certain general type having yet different fixed vectors ) associated with “different replicas” ). We may trivially re-express the above as where . Thus, inverting Eq. (1), we have
where is the vector (with components ) of correct classification results and is the inverse of the Kernel matrix. With the aid of Eq. (3), we may solve for the coefficients . Typically, the systems that we study are underdetermined. Therefore, the inverse matrix is actually a pseudo-inverse; finding the coefficients involves a least squares fit. The pseudo-inverse of Eq. (3) minimizes, for each replica , the “learning energy”
Here, represent the predicted () results (as given by Eq. (1)) while , as noted above, are the replica independent correct () classification results that a good algorithm aims to uncover. Thus, the coefficients that are calculated for a given replica will appropriately map a given “state” to the correct “phase” label given the phase space sampling information. We repeat the above calculation for multiple stochastic sets of replicas ( in total) in an attempt to “ensemble average” based on knowledge of the actual phase space mapping to appropriately find the correct divisions. As the mapping functions are continuous while the classification labels are discrete, the output of the mapping function for each replica has to be thresholded. Once the system is “trained” with the training data (i.e., given the training data, the coefficients are fixed by Eq. (3)), we examine what occurs for new “test” input vectors . For the binary classification cases that will be largely studied throughout this paper, we will typically set, for each replica ,
This thresholding may be generalized for multi-class classification. In general Receiver Operating Characteristic (ROC) curves ROC can be used to test for the best value of the threshold. Once the output of Eq. (5) is computed for all points in each of the replicas, the overall classification of an instance is found via voting. The “overall” label of a given instance is found by taking the average of the values predicted for that instance across all replicas, and then appropriately thresholding it (as in Eq. (5)).
where is the predicted label by the -th replica. The process of voting based on stochastic replicas allows for the correction of occasional mislabeling due to random fluctuations, and leads to a more reliable final result.
The specific, equal weight, voting scheme of Eq. (6) is one of many possible voting choices that may be employed. As we will briefly touch on later, multiple voting methods could be used to increase the overall performance (Section IV.6)); the multi-replica voting schemes qualitatively emulate “interactions” between the different replicas (diagrammatically represented by springs in Fig. 2). Since the averaging implicit in voting leads to a continuous range of voting outcomes, the same thresholding methodology of Eq. (5) will be employed. In the tests that we performed, we contrasted our results with those found by SVM.
For completeness, we remark on possible “phase transitions” that may appear in the data as a function of feature values (and that we largely did not test for in our analysis). In real systems (including physical ones as originally investigated byhuang ; lev ), different behavior might appear in distinct feature regime values. In the contact of other data sets, this may, e.g., the performance of athletes before and after an injury. When such “phase transitions” are present then when the values of the features are varied across these boundaries, non-analyticities will appear; different functional forms will be needed to describe the system in its different phases. (These phase transitions in the data are different from the phase transitions associated with solvability and correct classification (see, e.g., dandan+ ; lenka for transitions in unsupervised clustering/classification).) Towards this end, one may employ Eq. (1) in a subvolume of “feature space” (for all training data points that lie in this region) to see if the accuracies may vary and transitions are encountered (sharply distinct functional forms become optimal across phase boundaries) as evinced by striking changes in the overlap between different replicas. If the ultimate function that underlies the correct classification exhibits no (or only mild) such singularities then good classification may be obtained sans a detailed investigation as to how the classification results for a single point change when training data in different subvolumes of feature space around are used in our algorithm.
To close our circle of ideas and description, we return to our main intuition. As noted in Section II and underscored once again here, the guiding principle behind our method is, in a conceptual nutshell, that of
“Wisdom of the Crowds for Fits”.
By this statement, we mean that if different attempted fits (e.g., Eq. (1) with varying kernels) all yield the same prediction for a new data point then regardless of the “exact” functional form (if such an exact function exists and may be solved for) that describes the data in physics or other problems, practically, the common classification predicted by all of these random fits (“replicas”) for the point is likely to be the correct one. Indeed, the possibility of multiple fits that all yield a similar prediction appears across many fields of science. In all numerous problems, the precise underlying functional form explaining the data is unknown yet various fits all leads to similar predictions at temperatures, pressures, etc., where the experiments can be performed.
The inter-replica voting that we use amongst the outcomes of the random real functions emulates interactions between the individual replica solvers. In a physics parlance, we not only minimize a cost function of Eq. (4) for individual solvers given training data. We also take into account the collective (voting) outcome and correlations between the individual solvers. Qualitatively, this emulates a minimization of a “free energy” type function of a free energy given energy and entropy,
where and here denotes the information about correlations between the replicas (i.e., in our case, the votes of Eq. (6)) as to the correct classification of feature space point . In Eq. (7), the weight emulates the appearance of temperature as it appears in free energy minimization problems. Eq. (7) is only provided for qualitative reference. The classification that we will use is that provided by Eqs. (1 - 6). We will illustrate the utility of the “wisdom of the crowds for fits” maxim in our study of numerous examples that we embark on next.
Iv Quantitative analysis of the SRVM Algorithm
To assess the performance of the SRVM algorithm, we will apply it to several test data sets and examine various statistical performance metrics. In order to ascertain the ability of the SRVM algorithm to model the data, we split (as is customary) the data into two parts: a training set and a testing set. The training set was used to construct the model (i.e., the model was found by solving Eqs. (1,4)). Subsequently, the testing data set was used to evaluate the performance of the model. Some of the data sets employed in testing the SRVM algorithm that are discussed in this paper came with explicit testing data sets. For other data set benchmarks, no explicit test set are provided; in these cases, five-fold cross validation (CV) techniques are employed to fit and analyze the model. Five-fold CV involves randomly splitting the data set into five equal size subsets or folds, and using 4 of these folds together as a training set and the fifth fold as a testing set, while iteratively cycling through so that each fold serves as the testing set once. This allowed us to analyze the performance of the model for multiple folds, as well as report average performance metric values across all five folds. This five-fold CV was used throughout to ascertain the accuracy. Unless explicitly noted otherwise, all accuracies that we report were obtained by five-fold CV.
Here, are constants that may, similar to Eq. (1), be determined by Eq. (3) (the minimization of Eq. (4)). Such a form is natural if the predicted quantity is an analytic function of each of the features; analyticity is expected in physical systems in the absence of phase boundaries. Different replicas may be associated with different orders .
For completeness, although we will not further explore it in the current work, we must underscore that, generally, there is, of course, nothing special about the simple decomposition of Eq. (8); one may replace the single features by any of their functions and consider the trivial generalization
Here, the subscript shorthand . Different choices of (for a given ) lead to additional replicas. Supplanting the global forms of Eqs. (8, 9), one may also readily construct other multinomial approximants to be additional replicas by taking
to be tensor product splines of various generalized orderstensor . One may naturally also consider Laurent type multinomials (possibly also with different anchor points (i.e., shifted coordinates (feature values) with constant
)) and, more generally, the Padé type ratios
Given the above, disparate replicas may be defined by the highest powers of the functions and appearing in the above ratio (as well as the choice of the functions and ).
We next explicitly turn to the ten examples that we tested.
The first test case is that of our own synthetic data that allow for a simple linear separation between two sets with non-intersecting convex hulls (the two sets appear in the upper right and lower left sides in Fig. 3). The goal of the algorithm is to detect this structure and correctly classify different points as belonging to either of these two data sets. We used Eqs. (1, 3) with a Gaussian kernel for fixed vectors that were randomly chosen for each of the replicas; this led to an accuracy (as ascertained by the 5-fold CV) of 100%. Fig. 3
illustrates the distribution of the two data sets and the boundary formed by the Gaussian Kernel SRVM algorithm. The boundary obtained by our method is a smooth surface- not a straight line as found by other class classification algorithms that we tested (e.g., SVM with a linear kernel, logistic regression, and other linear classifiers); the linear kernel SVM algorithm similarly achieved an accuracy of.
In the remainder of this paper, we will focus on far more pertinent non-linearly separable problems and examine nine different benchmarks.
The next data set that we will test is that of the “Four-class” ref:libsvm benchmark- a binary classification problem having features for each of its 862 data points. Fig. 4 visually depicts the data on a dimensional map. Similar to our first example, the goal of the machine learning algorithm is to correctly identify the binary classification of input data (similarly set to be +1 (marked black in Fig. 4) or -1 (red)). We obtained a perfect (i.e., ) accuracy when applying SVM with a radial kernel. We studed this system with our SRVM method with the multinomial kernel of Eq. (8). Fig. 5 demonstrates how the prediction accuracy varies with the multinomial order . In the tested range, is monotonic with increasing polynomial order. When the multinomial order , the accuracy is 100 %. Figures 6, 7, and 8 provide the boundaries found when equals 3, 5 and 7 respectively. Only the training points are shown in these figures. We see that when , a smooth boundary between the two classes results. We similarly applied our algorithm with a Gaussian kernel to the Four-class problem. We first discuss the single replica results. The number of fixed vectors in Equation 1 plays an important role in predicting the results. We initially randomly produced fixed vectors (less than a tenth of the number of data set points). This led to an average accuracy of 99.09%. Reducing the number of fixed vectors to only resulted in an accuracy decrease to 81.18%. In this and other instances, we saw that (not unexpectedly) when the number of fixed vectors became too small, the prediction accuracy diminished. In Section IV.1, we will discuss this trend in greater depth. As discussed in Section III, the SRVM combines the single replica results via voting (Eq. (6)). To avoid a gridlock when performing such a vote, we chose the number of replicas
to be an odd number (we pickedhere). Each replica corresponds to a possible predictor that is related to a different set of fixed vectors . Averaging over replicas (Eq. (6)) produced an accuracy of .
Our subsequent test case is that of “svmguide1” benchmark ref:libsvm . This well studied benchmark problem (originating from astroparticle physics) consists of training file and testing file (i.e., there is no need to perform CV). The number of data points in training file and testing file are, respectively, 3089 and 4000; each data point has features. Optimizing and using the best parameters for a radial basis SVM kernel enabled a accuracy. We applied our SRVM algorithm with a polynomial kernel (see Fig. 9) to this benchmark. Contrary to the Four-class problem, the accuracy initially grew with increasing polynomial order ; however, at larger the accuracy diminished. The peak prediction accuracy for the test data is . In Section IV.4, we will discuss how the best value of may be ascertained from replica overlap (without being given the results for the test data). We further applied the Gaussian kernel algorithm to the svmguide1 problem and tested three different value of number of fixed vectors (). In single replica tests, the highest accuracy () was realized for fixed vectors. Setting gave rise to accuracies of and respectively. Using replicas in the Gaussian kernel algorithm, improved the accuracy to .
The “Liver disorder” data set ref:libsvm is a benchmark problem that has 345 data points which has features for each input. It has no testing file so that we performed the CV tests as before. We first investigated the performance of SVM. Optimizing the SVM parameters in a radial basis enabled an average CV accuracy of . Next, we applied the () multinomial SRVM. This led to an average CV accuracy of . Lastly, we applied the Gaussian kernel SRVM algorithm to the problem. We found the optimal number of fixed vectors is . This led, for the single replica variant, to an accuracy of . We then couple different replicas (, and ). The results illustrate that replica voting indeed improves the accuracy. Specifically, replicas led to an accuracy of . In the case of replicas, we achieved an accuracy of . For , the average CV accuracy became .
As another example, we also tested the Heart disease data set from the UCI machine learning repository database Heart . This is a binary classification problem consisting of 270 data points with features. For calculations in this paper, the data will be scaled (to lie in the interval). We will present various aspects of our results for this prominent benchmark in later sections.
The results from the Statlog Australian Credit Approval data set Australian (hereby abbreviated to “Australian”) will, similarly, also be presented. This benchmark is comprised of 126 binary-classified instances with 309 features and, as we will demonstrate, possesses characteristics which make it an excellent representative data set. Similar to the Heart benchmark, the data presented for the Australian data set are also scaled such that each of the features spans the interval .
An additional example on which we performed detailed analysis is that of LSVT voice rehabilitation data set LSVT . This is a binary classification problem in which each of the 126 instances has 309 features.
One more binary classification benchmark on which we tested our algorithm is that of “Internet Advertisement Data Set” ads . This benchmark contains 3279 instances each of which has 1558 attributes.
Another data set that we examined was the “IRIS” flower data set IRIS . This benchmark tabulates four features (the length and the width of the sepals and petals of the flowers) for three different types of irises.
The last benchmark on which we tried our SRVM algorithm was that of the “Breast Cancer Wisconsin” data set Wisconsin . This is a binary classification problem. In this benchmark, given ten different (geometrical and texture) features of cell nuclei that are seen in a digitized image of a fine needle aspirate (FNA) of a breast mass), a tumor is to be classified as being benign or malignant. The original dataset contains a few points with missing features; these points were excluded from our study.
When analyzing data sets using classification or regression algorithms, it is important to begin by pre-processing the data to be studied. In many data sets, it is common to have various instances which are missing values corresponding to certain features. Numerous methods exist to deal with missing values through various types of imputationimputation1 ; imputation2 . Typically the act of imputing data for missing values is itself a learning step, which inherently adds complexity to the analysis process. In the data sets studied here, the number of instances with missing values was small enough that these instances were discarded.
In addition to handling missing values, the pre-processing step also typically involves scaling of the data, so that the values corresponding to a given feature are of the same scale as all of the other features. This suppresses any effects of a feature with high variance and magnitude, dwarfing features with smaller variances and magnitudes. The three main feature scaling types are (i) scaling to the range [0,1], (ii) scaling to the range [-1,1], and (iii) normalization of the range of values for a given feature such that they have a mean of zero and variance of one. In SectionIV.3, we will test whether there is any statistically significant difference in the performance of the algorithm with different feature scaling types.
Table 1 provides a synopsis of the accuracies obtained by SRVM method for the above common nine benchmarks that we examined. As seen therein, the accuracy of our algorithm was, on average, better than that of SVM by an insignificant margin.
iv.1 Accuracy Dependence on the Number of Replicas and Anchor Vectors
When evaluating the performance of a binary classification model, the first step is typically to measure the accuracy of the classifier when applied to the testing data of known labels. The accuracy is simply defined as the percentage of correctly labeled instances in the testing set. In analyzing the LSVT data set LSVT , we primarily used the Gaussian kernel of Eq. (2). A priori, the spread () of this Gaussian may assume any value. We observed that setting yielded the best results. Consequently, this was the value used in our analysis. We employed the five-fold CV and examined the average accuracy, , across all five folds for various numbers of anchor points () and replicas (). The results of this analysis are presented in Fig. 10. In panel (9(a)), we show a 3D surface plot of the average accuracy as a function of the number of anchor points and number of replicas. In panels (9(b)) and (9(c)), we show projections of the 3D plot for constant and , respectively. It is evident from these plots that the accuracy quickly reaches an asymptotic value with increasing replica number. Once a maximum is reached, further changes in the number of replicas have little net impact on the accuracy. Additionally, it is evident that (regardless of the number of replicas used) the accuracy increases rapidly with number of anchor points, levels off at a maximum, and then decays with further increasing . The decay of average accuracy with increasing beyond a certain value is indicative of over-fitting. Analysis of the accuracy data presented in Fig. 10 suggests that a maximum accuracy of =88.9 for the LSVT data set occurs at and .
|Data Set||Classes||Number of Instances||Number of Features||SVM||SRVM|
We now turn to a similar analysis for the “Heart” data set Heart . For simplicity, we set the value of to unity. In order to find the optimal number of anchor points for these data, we increased the number of anchor points from 10 to 250 in increments of 10 (see panel (a) of Figure 11). The resulting accuracy was averaged over 10 different sets of replicas analyzed with a 5 fold CV. The highest accuracy was achieved when . A further minimum in the accuracy appears for anchor points. For anchor points as low in number as , our procedure yields an accuracy above 80% (a value quite close to the highest obtained accuracy of 82% that we obtained when using anchor points). In panel (b) of Figure 11, we show the effect of increasing replica number on the average accuracy in Heart example. The range of the number of replicas is quite wide, . Both curves in this panel (corresponding to the average accuracy and the replica overlap) display an oscillatory behavior about the averaged result and the amplitude of oscillations decreases as the number of replicas increases. Already for replicas, we achieved an average accuracy of 82%. Considering that the highest accuracy the we reached (as is seen in the graph) is 83.3% for replicas, in further analysis of the Heart data set, we used the more modest number of replicas.
An important point that we will underscore and reiterate throughout this work (and discuss, more specifically, in Section IV.4), is that we may determine the optimal number of replicas , number of anchor points , and any other undetermined quantity by noting when the average inter-replica is (near) maximal as a function of these parameters.
We return to our analysis of the Australian data set. The dependence of accuracy on number of anchor points is tested on the Australian data set with Gaussian kernel models with replica number . Each point of the plot is the average of 20 randomly generated models; see Fig. 12(a).
We observe that as the number of fixed vectors is increased, initially the fitted model becomes more sophisticated and the prediction accuracy rises rapidly. This shows that the model can be quite accurate even with a low number of fixed vectors. Beyond a certain point, increasing the number of fixed vectors starts leading to over-fitting and the prediction accuracy drops, however the drop is rather gradual, indicating that the model is robust against overfitting.
The dependence of the accuracy on the replica number was tested in the Australian data set by performing 50 five-fold CVs and taking the average accuracies across the SRVM results with anchor points for the Gaussian kernel and investigating the results when the number of replicas was varied from 1 to 89. The results are plotted in Fig. 12(b).
In addition to assessing the accuracy of the SRVM algorithm, it is important to compare its performance to established learning algorithms and to try and quantify any relative advantages and/or deficiencies. To that end, as we noted earlier, we took the Support Vector Machines (SVM) algorithm SVM ; SVM1 as a baseline for comparison. For the LSVT data set, we used a ‘brute force’ method of finding the optimal parameters for this contender to our method- the SVM model- by running it for all values in a grid in parameter space. Once the optimal parameters were found, it was observed that the maximum accuracy for SVM was 0.873. The difference in accuracy between our SRVM method (in which optimized parameters were found by replica overlap not by comparing to the solution) and the standard SVM algorithm (now optimized to achieve highest accuracy) is 0.016. This difference is not statistically significant, so the relative advantage of either method might not be immediately clear.
) then one may estimate the requisite runtime of anchor points from the known runtime from smaller.
To dig further into the comparison, while simultaneously exploring the SRVM performance on a deeper level, we next examine the runtime of both the SVM and SRVM algorithms. We ran the SRVM algorithm on the LSVT data set for various values of and . The runtime is considered to be the time that it takes to calculate the average CV accuracy, and does not include finding the optimal number of parameters, or pre-processing steps. Figure (12(a)) shows a 3D surface plot of the runtime versus the number of anchor points and replicas. In figures (12(b)) and (12(c)), the runtime is exhibited as a function of the number of anchor points and the number of replicas . The data make clear that the runtime increases linearly with increasing and . This observation suggests that it is possible to find the runtime at low numbers of both variables in order to assess how long a run will take with larger values.
The general optimization of model performance involves maximizing accuracy while simultaneously minimizing the necessary runtime. Therefore, it would be beneficial to have a measure of the compounding of these two goals. To assess the intersection of accuracy and run time, we can define a metric which we call the coefficient of performance, , which we define as
This metric allows for an efficient via for simultaneously looking at optimal accuracy and run time. Using the results of the runtime and accuracy measures discussed above for the LSVT data set, we calculated the values of as a function of and . Detailed results highlighting various aspects are shown in panels (13(a),13(b),13(c)).
It is clear that the COP decays as a function of both the number of anchor points and the number of replicas . This is consistent with the linearity of the runtime and the asymptotic behavior of the accuracy. Locating the ‘knee’ in the COP data allows for extracting reasonable values of the parameters for the trade-off between accuracy and runtime.
anchor points for variable numbers of replicas, simulated 20 times each with different random replica generation seeds in each simulation. Associated standard deviations are shown as error bars. (b) Plot of these standard deviations in the accuracy that are associated with runs for various replica numbers. The monotonic decrease in the standard deviation with increasing number of replicas demonstrates that prediction results become more stable with increasing replica number; when additional replicas (for an increasing yet still small ) vote, the final outcome becomes progressively more stable to statistical fluctuations from the stochastic generation of the anchor point vectors. The plot further makes clear that the standard deviation quickly reaches a leveling-off point at which further replica increase does not have a statistically significant impact on stability.
In addition to calculating COPs for the SRVM algorithm, we also calculated them for SVM. Broadly, the SVM COP is considerably better than that of SRVM, and this is entirely due to the fact that SVM runs much quicker. This is likely due to the fact that the SVM algorithm has been highly streamlined and optimized in various software packages over the decades, whereas our algorithm is new. It could also be due to the fact that, within all implementations of our algorithm, we computed the pseudo inverse exactly (Eq. (3)) instead of approximating it with methods such as gradient descent. Furthermore, we have not employed other methods such as regularization to increase the accuracy with all of the other parameters fixed. Additional improvements of the algorithm structure will likely decrease the runtime.
A central characteristic of the SRVM algorithm is the use of voting between replicas to increase the accuracy. Intuitively, one would expect that the number of replicas and the accuracy should be positively correlated: the use of more replicas leads to improved accuracy. Another quintessential feature of the SRVM is that the anchor vectors associated with each individual replica are generated stochastically. This allows for a robust classification of new instances. This also means that each run of the algorithm will be different, with different outcomes possible. Therefore, it is important to examine the stability of the output. It is expected that for a low number of replicas (), the overall vote can change rather dramatically with different runs, so the accuracy can fluctuate. It is further expected that as the number of replicas increases, the fluctuations will be suppressed by the presence of more information in the overall vote. To test this, we ran the SRVM algorithm on the LSVT data set with =30 anchor vectors per replica 20 times each, for varying number of replicas. In panel (a) of Fig. (15), we display the average accuracy across all 20 runs with a fixed number of replicas as this number () increases. The error bars in the figure reflect the standard deviation in accuracy. In panel (b) of Fig. (15), we plot the standard deviation in accuracy versus number of replicas. From the panels of Fig. (15), it is clear that the standard deviation decreases rapidly with increasing number of replicas and eventually levels off to a roughly constant value. This is consistent with the earlier observation that the runtime is linear and the accuracy approaches a leveling off before decreasing. Further, the result implies that beyond a certain number of replicas, the overall accuracy is largely stable to fluctuations associated with stochastic generation of anchor vectors, thus alleviating a potential weakness of the method.
Tie stability was further tested using the Australian data set by performing 50 five-fold CVs and computing the average accuracies across models with replica numbers ranging from 1 to 89. The results are provided in Fig. 12. As this figure makes evident, for this data set, the average accuracy rises relatively quickly at the beginning from just one replica and maintains a general monotonic trend as the replica number increases. We begin to observe diminishing returns somewhere after 15 replicas. This is to be expected, as the amount of available information in the data set is objectively limited so there is a cap on achievable accuracy. Note that since we do a simple majority vote, the replica numbers are all odd to ensure that no ties appear during voting. Another example where we tested the accuracy as a function of the number of replica is shown in Fig. 11(b) for the Heart benchmark. As seen therein, the accuracy and replica overlap achieve their maximal values when we used replicas. One may expect that as the number of fixed vectors increases, initially the fitted model becomes more sophisticated and the prediction accuracy rises. Above a certain value, increasing the number of fixed vectors starts leading to over-fitting and the prediction accuracy drops. Using more fixed vectors also results in a slower algorithm. Therefore it would be very useful if we had a way of estimating how many fixed vectors are appropriate for a certain problem. In Fig. 12(a), we notice that the curves for both the average accuracy and the average replica overlap rise rapidly from their values for a single fixed vector () to a nearly flat maximum that appears when the number of around fixed vectors ; when , the accuracy begins to taper off due to the alluded to overfitting. The two curves indeed follow each other closely, supporting the notion of using replica overlap to estimate the dependence of the expected accuracy on . We found similar behaviors for other parameters other than the anchor vector number . There are some other outstanding features of the figure: the rapid rise of the two curves at low fixed vector number shows that the model can be quite accurate even with low fixed vector number, and the slow tapering off of the two curves indicates that the model is robust against overfitting.
iv.3 Impact of pre-processing
In the beginning of this section, we alluded to the possibility that the specific pre-processing method employed may have an impact on the performance of the SRVM algorithm. In this subsection, we will examine the impact of pre-processing the data using feature scaling on our final results. To that end,
we preprocessed the data for the LSVT data set in three different ways:
(1) Linearly transforming the data such that domain of each feature over the entire data set ranges fromto .
(2) Linearly transforming the data such that each feature assumes values in (i.e., scaling the data to have a difference of two between the maximal and minimal value of each feature), and
(3) Normalizing the scaled data with mean and standard deviation equal to unity.
We examined the average accuracies (and their variances) associated with these three different pre-processing methods using statistical tests. The results (see table 2
) demonstrate that one must absolutely reject the null hypothesis
that all of the means are equal. The disparate pre-processing methods definitely lead to different results. The specific testing of the means were performed both (i) assuming normal distribution of the averages (the f-statistic) and (ii) without this assumption (the h-statistic using Kruskall Wallis testH
). Both tests revealed that the average accuracy was not statistically uniform across all methods of feature scaling. To quantitatively investigate which methods were intrinsically different from one another, individual t-tests of the means were undertaken. We performed three different t-testsstudent-t-test . These tests demonstrated that there is no difference between pre-processings to the types (1) and (2). However, these two cases however are different from the normalization (pre-processing type (3)). The normalization pre-processing tends to be the most accurate of the three for lower numbers of anchor points () and replicas . In general, the data sets that were scaled normally (pre-processing (3)) had higher accuracy with lower numbers of anchor points. We further tested that the variances were equivalent using a Levene test Levene ; the results indicated no statistically significant difference between the variances across all pre-processing methods employed. Taken together, these outcomes indicate that one must consider the specific type of pre-processing undertaken when assessing the performance of the SRVM algorithm.
|Null Hypothesis||Test Statistic||p-Value||Conclusion|
|==||w=2.1240||p=0.1289||Fail to Reject H|
|=||t=0||p=1.0||Fail to Reject H|
LSVT data set. Results of statistical hypothesis tests undertaken to assess whether different pre-processing techniques impact algorithm accuracy for the same sets of parameters (and ). Here, denotes the hypothesis that all of the three pre-processing methods yield identical results.
iv.4 Optimization via replica overlap metrics
When choosing the optimal values of the parameters for a learning algorithm, it is helpful to have a reference function which does not require the calculation of the accuracy, which still relays information about model performance. This gives a more ‘fair’ way of choosing the best values of the parameters without a brute force method. In the SRVM algorithm, because we have many replicas which are voting together by a simple majority vote, it seems reasonable that some measure of the overlap between the replicas would be a measure of model performance. Indeed, when all of the replicas are largely in agreement, it should imply the model is performing optimally and vice versa. However, it is possible that all of the replicas could be in agreement, with all of them being incorrect. Therefore, it is important to test whether proposed replica overlaps are agreement with the accuracy. To test this, we propose two different replica overlap functions, and test them on the LSVT data set. The first overlap function is defined as
and measures the total overlap of the predicted labels of all replicas. For each of the replicas , the vectors have components (where is the number of distinct predicted (or, in some rare cases, fitted) data points in each replica ). This metric may be trivially averaged by dividing by the total number of distinct replica pairs, i.e., by multiplying the righthand side of Eq. (11) by . A similar, but computationally more efficient, overlap function is defined as
which measures the overlap of each replica with the overall vote as determined by Eq. (6). The calculation of Eq. (12) requires storing less information than computing the overlap of Eq. (11). Similar to each vector in the set , the vector has components (one component (the voted prediction) for each of the distinct examined data points ). (Similar to Eq. (11), an average is trivially calculated by dividing the righthand side of Eq. (12) by ).) In Fig. 17, we plot the results of the overlap measures of the LSVT data set. In these panels, the number of anchor points, , is plotted on the x-axis with data sets for increasing values of replica numbers, , with increasing being proportional to increasing y-axis values, shown. The data in Fig. 17 seems to display the same overall characteristics as that of the accuracy shown earlier, but it is important we compare them directly. In Fig. 16, we plot the average accuracy, and two overlap functions for =29 replicas and various numbers of anchor points, all scaled by their maximum values so as to be able to fall on the same plot. It is abundantly clear from the data, that the two definitions of the replica overlap and the accuracy scale in a one-to-one fashion, making either overlap function an excellent candidate to find the optimal parameter values without necessitating the calculation of model accuracy. Similar trends are seen in Figs. (11(b), 17,18, 19, 20).
In numerous situations, having a measure of the likelihood that a specific data point will be correctly labeled is important for identifying the location of classification boundaries. The existence of replicas in the SRVM algorithm precisely affords exactly such a measure. To this end, we may define a single data point overlap across replicas which can approximate the probabilities of classification. This allows us to determine whether a given pointis near a classification boundary, as when result in different labels, the point is likely near a boundary. We call this single instance overlap the ‘Agreement’, and define it as
The bar chart in figure 21 (a) shows how well the predictions of different replicas agree. The majority of points fall in the last bin (meaning that for most points, different replicas predicted the same result). We also calculated the corresponding accuracy for each bin in Fig. 21(b). Again, it is apparent from the bar chart that the points with the maximum replica voting agreement have the highest accuracy as it is expected. As we mentioned earlier, the Heart data consists of 270 data points each of 13 features. Replicas are of 50 points in thirteen dimensions. The average accuracy after 10 runs of voting was 81.25%; the corresponding accuracy of SVM is 82.4%. Thus, in this example (as in others), the accuracies obtained by both SRVM and SVM are nearly the same.
The correlation between availability and classification accuracy was also investigated using the Australian data set. The Australian data set was used to demonstrate the correlation between replica overlap and datapoint prediction accuracy. For this test, we used Gaussian kernel models containing 31 replica with 50 anchor points each. A total of 10 models were generated, and a 5-fold cross-validation on the data set was ran on each model. For each 5-fold cross-validation, every data point in the data set will be a test data point exactly once. Any test data point was classified by its replica agreement . In Fig. 22, we binned the test data points based on their agreement values (multiplied by replica number () for clarity of presentation) and calculated an aggregate accuracy for each bin. We see that indeed the data points with higher replica agreement in general are also being predicted with higher accuracy, showing a clear positive correlation between replica agreement and prediction accuracy. Another feature from Figs. (12, 26) is that the vast majority of data points have good replica agreement. In Figs. (23, 24, 25), we report on similar tendencies found for the Four-class, svmguide1, and liver disorder benchmarks.
The Australian data set was also used to show that the average agreement across the data set is also a useful replica overlap function. Using the same procedure for Fig. 12, we used Gaussian kernel models with varying number of anchor points and 5 replicas each, and calculated the average agreement of all datapoints. The results are plotted with the average accuracy in Fig. 26(a), and we see strong correlation between the average accuracy and the average agreement. If we introduce the average RMS error as the learning energy scaled by the number of data points: , we see in Fig. 26(b) the average RMS error correlates negatively with both the average accuracy and the average agreement.
In Fig. 27, we show the correlation between averaged replica overlap and averaged accuracy for different number of anchor points. The range of anchor points runs from 10 to 500 while the number of replicas are being kept fixed at 31. Similar to the Australian and other data sets that we analyzed, both the replica overlap and the accuracy closely tracked each another (and further correlated with the value of the energy function of Eq. (4)). Here, these quantities were non-monotonic as a function of the number of anchor points. The “energy” curve is this figure corresponds to the average of Eq. (4) over 31 different replica realizations. Five-fold CV was employed in our tests for the accuracy of the predictions. Perusing this Figure, we see that the averaged penalty function of Eq. (4) becomes minimal when the highest inter-replica overlap is achieved and when the predicted classifications are of the highest accuracy.
In summary, the single control parameter that governs our algorithm (the number of anchor points ) may be automatically optimized by an examination of the inter-replica overlaps as well as analysis of the energy costs of Eq. (4). In Section IV.6, we will discuss more efficient determination of the optimal parameters by recursively applying machine learning onto itself. We first, however, explicitly turn to a notable attribute of the SRVM method that has its origins in its stochastic nature.
iv.5 Class Imbalance and Alternative Performance Metrics
In Fig. (31
), we depict the results of a principal component analysis of the LSVT data set (in the figure, we outline the two dominant principal components). This analysis enables us to visualize where our algorithm fails to find the correct answer for the LSVT data set. As seen in this figure, while there is no apparent distribution in principal component space of the cases that we obtained incorrectly, the two classes are massively imbalanced (as is often the case in classification sets). Other metrics are necessary to compare our results to those of SVM (and other algorithms). To that end, we briefly regress to the “accuracy paradox”AP . This colloquial “paradox” is simple to explain: if the data set given is heavily imbalanced so that most of the provided data belong to one type, one might as well just guess the dominant answer every time and miss subtle instances. This must be taken into consideration. To that end, in Table 3 and Fig. (32), we provide a confusion table and look at how well we perform while qualifying true positives and negatives and false positive and negatives. The specificity and sensitivity are related to true positives and false negative rates. This information may be used to compute various metrics; these measures combine class imbalance, specificity, sensitivity, and performance. A particularly useful measure for assessing class imbalance (appearing in Table 4 )is the so-called “Cohen’s ”. In terms of these metrics, the superiority of SRVM over SVM is made apparent (by a very large statistically significant difference). Strikingly, the statistical measures in Table 4 indicate that, colloquially speaking, SVM edoes more “guessing” as compared to the SRVM algorithm and tends to become “lucky” by predicting the dominant class more often. This is to be expected since SVM segments feature space with a unique kernel and only considers points on a specific boundary that need to be carefully classified. By contrast, the distinguishing attribute of SRVM is that numerous stochastic replicas are considered- a characteristic that tends to lead to less bias. Taken together, the Cohen’s values Cohen , along with the F-score (that does not take into account true negatives) and (Matthews Correlation coefficients) Matthews metrics illustrate that SRVM exhibits a statistically significant advantage over the SVM algorithm insofar as the lack of inbuilt class imbalance bias is concerned.
|Label||Fold 1||Fold 3||Fold 5|
|SRVM Right, SVM Right||18||20||22|
|SRVM Wrong, SVM Wrong||4||2||2|
|SRVM Right, SVM Wrong||2||1||0|
|SRVM Wrong, SVM Right||1||0||0|
|SRVM Mixed, SVM Right||1||2||0|
|SRVM Mixed, SVM Wrong||0||0||1|
iv.6 Layered Voting: Multiple Kernels and Recursive Learning
As we remarked earlier, there are many possible “interactions” between individual replicas (see the schematic of Fig. (2)). The equal weight average of Eq. (6) is merely one of the simplest choices to deciding how fuse the results of different replicas into a collective prediction.
Following the conventional terminology of neural networks, we may add “hidden layers” to the SRVM by allowing different kernels to all vote. Each kernel predicts an outcome on its own for each instance. We can combine voting results from different kernels to come together by voting anew from the results from the first (single kernel) votes, see Fig. (33). The advantage of such a modus operandi is that we can adjust weights for the different functions. Without adjusting for weights, instead of Eq. (6), one may use the more general average of
where the predicted value for replica is found using Eq. (1) with kernel belonging to the th entry of the list of Eq. (2) or other trivial extensions thereof (and the total number of functions in such lists). The weight of one function may be adjusted as the calculation proceeds to be higher or lower to increase the accuracy. Without adjusting for weights, we observe in Figs. (34,35,36) that the accuracy, run time, and coefficient of performance are similar to those that we obtained earlier (within a single layer voting model- the usual SRVM). A trivial extension of Eq. (14) is that of the weight adjusted voting,
with the weights satisfying, .
As we explained in Section IV.4, one may aim to find the optimal parameters by noting when these lead to a maximal overlap between the replicas. Additionally, of course, one may see when these lead to accurate solutions- yet that either requires “cheating”- i.e., (1) adjusting the parameters to obtain the known answer or to (2) the removal of some of the known input data to use it as a CV test (the latter case is non-optimal since already known data are removed from the training set). At any rate, testing for overlaps and/or direct accuracies by brute force change of parameters can be taxing. An alternate approach for determining the optimal parameters and weights such as those of in Eq. (15
) (and simple multi-layer generalizations thereof), somewhat similar to reinforcement learningRL , is to compute the overlap and/or accuracies for a set of parameters and then recursively use SRVM to extrapolate and decide on the optimal parameters. Since the accuracies/overlaps are continuous variables, this task lies in the domain of “regression” (the prediction of an outcome that is a continuous variable). That is, by successively applying SRVM to the accuracy or replica overlap results we may hone in on the region of the parameters where optimal performance may occur (similar to ternary search algorithms).
To illustrate the basic premise, we provide the results of such an analysis for suggesting the optimal number of anchor points for which the highest overlap and accuracy for the Heart data set appear. In Figs. (37, 38
) we show the results of two regression analysis using the(i) Gaussian kernels of Eq. (2) (with 10 anchor points in each of the replicas used) and (ii) cubic splines. Similar to the well known Runge phenomenon Runge for high order polynomial fits for equally spaced points, we observed that when no clamping was done at the endpoints at the domain, these one-dimensional regression curves performed well throughout apart from the regimes near the endpoints. When we fixed the values of the accuracies at the two endpoints and performed regression analysis with either (i) or (ii) given 15 training points for different values of , the resulting curves were relatively close to the actual data for all points that we tested earlier. Most importantly (see Figs. (37, 38)), the maxima of the regression curves were close to those found in the complete data set. This simple example illustrates the viability of using machine learning recursively onto itself in order to find the optimal parameters that might maximize its accuracy. The parameters governing our algorithm may be automatically optimized by such a recursive scheme.
We must remark that in this one-dimensional problem of determining the optimal number of anchor points for which maximal accuracy may appear, one may readily forgo the use recursive SRVM (or, similarly, any machine learning algorithm) and instead employ a simple ternary search to guide the search parameters for which maximal viable accuracy may be achieved. We next briefly discuss regression more generally.
Thus, in conclusion, as we motivated above and illustrated for a simple example, we may bypass the need for an exhaustive parameter search and instead employ recursive machine learning to find the optimal parameters.
(green). Based on the histograms, the residuals appear to be roughly normally distributed with the exception of some slight skewing in the tails. This result indicates a strong performance of the model.
=250 anchor points. The quantiles of the residuals are plotted against the expected quantiles of the normal distribution. With the exception of minor skewing in the tails, the quantiles appear normal, suggesting strong model performance.
V Regression via replicas
The functions and the associated kernels that we presented in Section III are continuous. Thus, on an intuitive level, they are more naturally related to continuous value prediction (a regression) than to discrete quantifier (such as those associated with the classification problems that we considered in earlier Sections).
In this Section, we will explicitly examine whether SRVM is indeed a viable regression machine learning algorithm. As we will explain below, what is important for a regression solver is to have normal statistical properties. Indeed, for regression problems, there are no “benchmarks” that are as clearly defined for continuous regression data as they are for discrete classifiers (where an answer is clearly wrong or right). With this in mind, we examined the features of the LSVT data set (for which we developed a binary classifier) and tested to see whether our predictors (without the thresholding of Eq. (5)) comprise good (continuous) regression predictors to the binary data.
A natural route for SRVM is to expand in kernels of the full vectors (as in the kernels of Eq. (2) employed in the expansion of Eq. (1) that we have largely used in the current work with the exception of the multinomial kernels of Eqs. (8, 9). However, the manner in which a regression is usually performed for other existing machine learning approaches is different. Instead of expanding in functions of all components (features of a vector) , most researchers typically examine functions of individual features, e.g., for features. That is, one typically posits that a function (instead of considering functions of whole instance feature vectors ) may be the optimal kernel to use. Underlying this common practice of single feature expansion is that multicollinearity is assumed and then this assumption may be consistently tested for; this also enables a study of the individual significance of a given feature. Translated into our framework, such a regression is tantamount to an expansion of the form
where denotes the -th feature (component) of the vector .
In our regression studies, we performed regression in both ways. That is,
(A) We expanded in uniform kernels of all components vectors , and
(B) Similar to Eq. (8), we also examined the system when expanding in kernels of individual functions of the single components (single features of the component vectors ).
In our regression analysis, we mainly focused on method (A) (that of expanding in functions of the full feature vector as in Eq. (2)). When searching for optimal parameters, one has to focus on the mean squared error versus the generalization error since the mean squared error (as is visible in Fig. (39)) will always decrease with more anchor points . However, the generalization error will increase dramatically when overfitting occurs beyond a threshold number of anchor points. Inspecting Fig. (39), we may ascertain the optimal number of anchor points.
Assessing the quality of a regression is notably more challenging than determining the accuracy in a classification problem. The predicted outcome is clear cut for the discrete variable in a classification problem; this is obviously not the case for regression outcomes which are continuous functions. Instead of seeing whether the “exact” outcome is achieved (an impossible feat for continuous real numbers), additional, more detailed checks, are necessary. The commonplace minimization of the sum of square errors is indeed how we found the optimal parameters (that are used in the plots). As is well appreciated, the raw sum of square errors is not a sufficiently illuminating metric for judging the quality of regression solvers.
For a regression to perform optimally, aside from predicting results that are close to the correct answers, its residuals (errors) should be random and normally distributed about the true population; the residuals should have no autocorrelation (no bias of one data point influencing another). In Figs. (40,45), we provide scatter plots and histograms of the residuals with the mean in red, standard deviations in blue and green. it is seen that the histograms are very normal. One may also examine the probability plots of Fig. 42; apart from skewing at the tails, normal residuals indicate high quality regression. Autocorrelation statistical tests further suggest that no significant autocorrelations are present. All of our tests indicate that no bias is present in our regression.
Vi Possible algorithm independent bounds on the accuracy
As is well appreciated, different machine learning methods have their distinct virtues. Some methods (such as artificial neural networks) seem to work “magically” well on a large number of data sets for reasons that, to date, largely remain shrouded in mystery. In this Section, we wish to suggest that SRVM may lead to universal asymptotic limits on the accuracy of these and all algorithms. The logic underlying this speculation is as follows. In reality, any physical process exhibits an inherent error. That is, there is an underlying “theory” or “model independent” error to any physical process. To give a colloquial example: suppose that we knew all of the scores in various matches. Even with much knowledge on the scores of the final games in all prior matches and information about individual players, it still is, of course, impossible to predict with certainty what the result of a new soccer match will be. There is, even in “classical” systems such as soccer games an important element of stochasticity including pure “luck”. That is, any given finite set of features will be insufficient to provide error free predictions no matter how complex our algorithm may be. As in putative physical theories, the total error associated with a given theory vis a vis the measured data will be generally the simple sum,
Here, denotes the total error of the prediction vis a vis the measured data, is the inherent underlying measurement error (including any external noise that is out of our control and, employing features that are incomplete and cannot allow an accurate prediction even if they were all known with absolute precision), and is the error in the theory or machine learning predictor that we use. The two diametrically opposite limits of Eq. (17) are intuitively clear. If, one is given the accurate correctly measured features with which predictions may, in principle, be made with absolute certainty (i.e., ) then all error in the prediction is due the use of an inaccurate theory (). At the other opposite extreme, if one has the correct theory (e.g., equations that correctly describe physical processes) then any error in the predictions will be due to either incomplete or inaccurate input data (). The latter errors are not limited to literal physical measurements alone. For instance, if there are numerous spin glass SG1 ground states that are consistent with any given the assignments of a small finite number of spins then one may not accurately predict the spin at each site due to the underlying degeneracy SG2 ; BinomialSG - the given features do not suffice for such a complete prediction of the ground state that one is asked to predict. Now, if the errors in different theoretical models (“replicas” in the parlance of SRVM) are independent of each other then error in the average prediction of the collection of theories (each with individual theoretical error
), i.e., the SRVM prediction, will, by the central limit theorem, scale as
A consequence if Eq. (18) is that by increasing the number of replicas , the error rate decreases and converges to the model independent value (the inherent stochastic error underlying the systems and the features employed). Of course, some theories may be more powerful than others and lead to better predictions. However, if Eq. (18) is correct, then asymptotically, irrespective of which algorithm is employed for the individual replicas, a universal limiting error will be reached. In the context of the examples studied in this work, if we test theories with a different number of anchor points then some may outperform others for given number of replicas yet in the large replica limit, all SRVM averages should coincide on the same answer and associated errors. We qualitatively tested to see if this prediction might be consistent with our data. Specifically, in Fig. 43 and Fig. 45, we respectively fit the SRVM cross-validation errors (regarded here as ) of the Heart and Australian data sets to Eq. (18). A general monotonically decreasing trend of the average accuracy is seen with increasing replica number (apart from situations in which the accuracy is already near optimal when is small). More notably, the extrapolated intercept in these figures seems to be uniform across all different anchor point solvers. If Eq. (18) is correct then these intercepts correspond to the limiting stochastic error of the system that no algorithm can surpass. One cannot, of course, assert that these results demonstrate the validity of our conjecture. However, the observed behaviors are consistent with Eq. (18).
In summary, we presented a new machine learning algorithm- the “Stochastic Replica Voting Machine” (SRVM)- that largely drew its inspiration from Landau theories and the existence of various continuous real function fits to the same existing data sets along with the “wisdom of the crowds” phenomena. This method uses expansions of data via kernels as do many other algorithms, e.g., kernel-review , including SVM. However unlike SVM, our method does not follow the error optimization and regularization scheme. The guiding principle underlying our approach is that of invariance and stability of the predicted results when different random functions
are used. In the context of the results presented in the current work, known data are fitted to fix the kernel coefficients in multiple stochastic functions. Once these coefficients are fixed, predictions are made by the ensemble of these random functions (“replicas”) as to the correct classification/regression of new data. Each of the functions “votes” for the predicted outcome. The system then averages over all predictions by weighing these in a chosen manner. We tested the algorithm’s performance on multiple known benchmark problems. Overall, we found the accuracy of our algorithm to be comparable to that of standard well used techniques such as those of Support Vector Machines (SVM). By contrast to SVM, however, the optimal parameters in our model are set by using all of the given data (not tossing away a subset of these when using cross-validation). In our framework, the optimal parameters are ascertained by observing when the different stochastic functions (the “replicas”) have a high degree of overlap. That is, we see for which parameters there exists a consensus in the predictions of the replicas regarding the outcome for particular data. No less notably, due to the intrinsic stochastic character of multiple functions used, the system is far superior to SVM in avoiding class imbalance bias. Similar to “Random Forest Decision Tree”decisions algorithms, SRVM invokes voting between different predictions. However, contrary to numerous random forest and neural network methods, SRVM does not introduce “decision trees”. Rather, in SRVM, actual real functions (such as those of Eq. (2)) are employed. If, from physics based or other considerations, information is known about the expected functional dependence of the results on the input features then one may expand in a basis of stochastic functions of that expected form instead of employing the generic functions that we used in the current work. We remark that the use of multiple classifiers (different from our stochastic functions) to enhance accuracy further appears in other machine learning approaches such as those of unweighted “bagging” bag or more sophisticated “boosting”’ boosting methods that have been prevalent in, e.g., neural networks; it is conceivable that our accuracy might be further improved by incorporating aspects of these schemes when combining the bare SRVM algorithm that we described in the current work with other known classifiers. Indeed, as we detail elsewhere us-materials
, a function describing any particular neural network can be regarded as yet another member of the ensemble of functions used in an SRVM implementation. We stress that unlike Markov Chain Monte Carlo (MCMC)mcmc
methods, the crux of our general approach hinges, in the absence of given special details, on the use random stochastic functions of different types (not that of sampling from a single distribution function). Contrary to MCMC, neural networks (including deep learning)ANN ; DL , and many other methods, the number of parameters in our simple approach is rather limited. Optimizing for these few parameters (in our case the number of anchor points) can be done in an automated way (see Section IV.6). That is, the correct parameters to be used do not need to be introduced by human training but are rather self-generated. Not much training is required in order to achieve high accuracy. For all of the examples that we studied, we did not find a significant difference in accuracy between instances in which (1) direct arithmetic averages of the individual replica predictions (Eq. (6)) were used and (2) when individual replica predictions are weighted differently depending on, e.g., how close these are to the anticipated correct classification values (e.g., the individual replica predictions that are closer to values for the binary classification problem may, in the final vote, be given higher weight over other predictions when those predictions have a modulus that is very different from unity). That is, we found that the results for the examples that we studied were largely insensitive to the particular choice of voting function of the individual replica predictions . The accuracies that we obtained when using disparate voting functions for the Heart benchmark are provided in Tables 5 and 6.
|Gaussian Weighting||= 1||= 10||= 100||= 1000|
|Gaussian Weighted Accuracy||0.7606||0.7988||0.8051||0.8109|
As we alluded to in Section IV.6, for more complex problems, one may envision applying machine learning onto itself functions; such a recursive modus operandi may enable one to potentially determine the optimal manner in which voting is to be taken from the individual replica results. Given the simplicity of our algorithm and its numerous natural extensions, much more work can be done to further streamline the algorithm and apply it to many different data sets. Aside from the numerous data set benchmarks tested in the current work, two additional materials oriented classification problems (both binary and ternary) were studied in us-materials . The current results of our supervised machine learning study augment those of an earlier replica type approach for unsupervised learning and the solution of combinatorial problems in which the notions of stability and (potentially recursive) voting or information theory correlations/inference were employed CD3 ; CD4 ; vision ; vision1 ; vision2 ; phase1 ; phase2 ; TSM ; mychapter . We may, indeed, very readily combine our supervised machine learning approach with clustering ideas for unsupervised machine learning. In this latter clustering approach, instead of minimizing errors between the predictions of random functions on known training data vis-à-vis the known outcomes (as is done for supervised machine learning), we may minimize an energy function that favors clustering of feature space points (or pixels in image segmentation applications) that share similar features (and maximize information theory correlations of candidate replica solutions). Thus, instead of minimizing the cost function measuring the quality of fits relative to known training points, data may be classified via the added contributions of the trained random SRVM kernels augmented by additional weights that measure the correlation between different points in feature space that are classified as belonging to the same set (as in the unsupervised clustering approach of CD3 ; CD4 ; vision ; vision1 ; vision2 ; phase1 ; phase2 ; TSM ; mychapter ). There are many other natural extensions of the method presented here. For instance, instead of the linear expansions in the kernel functions (Eq. (1)), one may, of course, consider higher order expansions in the kernels. A notable advantage of the SRVM method is that it approximates the data by mathematical functions of the input features that may, hopefully, be rationalized for (instead of more abstract constructs). Another possible advantage of SRVM is that it may enable the generation of algorithm independent bounds on the accuracy (as suggested in Section VI).
We conclude with a brief speculation. As we repeatedly underscored (and do so once again now), the key notion behind our approach was that of stability. Stochastic functions were employed as individual predictors. If these stochastic predictors all consistently agreed on the same classification (or regression) of a given data point then, as our inter-replica overlap analysis demonstrated, these predictions were all likely to be correct. Given this correlation, one may turn this result on its head and ponder whether fundamental physical theories are, in effect, Landau type theories in disguise and if spatio-temporal coordinates are not absolute (up to standard covariant transformations) but are rather assigned emergent features enabling the most consistent predictions concerning the behaviors of these systems. If a particular physical Lagrangian or effective Landau theory with generic low order polynomial coupling terms and gradients is assumed to describe a particular system then one may assign coordinates to various data points such that the predictions using this action are the most stable (such a possibility is similar to the reassignment of fitness variables in chemical analysis permutation ). That is, we ask if the abstract features in unsupervised learning (in which the pertinent features are inferred) might also correspond to true physical coordinates such that the ensuing representation of the data concerning particle coordinates is smooth thus enabling a description of its behavior by low cost of simple (Lagrangian, energy, or other) and generically low order smooth functions of the collective descriptors and their gradients relative to the individual coordinates (features) . qi
We are grateful to the support of the National Science Foundation under grant number NSF DMR-1411229. ZN is also grateful to the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611, where this work was completed.
L. Einav and J. Levin, “Economics in the age of big data”, Science346, 6210 (2014).
- (2) https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
- (3) A. Yildermaz, “Using Big Data to Decode Private Sector Wage Growth”, arXiv:1609.09067 (2016).
- (4) C. L. Philip Chen and C. -Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: A survey on Big Data”, Information Sciences 275, 314-347 (2014) .
- (5) A. Szalay and J. Gray, “Science in an exponential world”, Nature 440, 23-24 (2006).
- (6) M. Hilbert and P. Lopez, “The World’s Technological Capacity to Store, Communicate, and Compute Information”, Science, 332, Issue 6025, pp. 60-65 (2011).
- (7) E. Alpaydin, Introduction to Machine Learning, 3rd Ed., The MIT Press, ISBN: 978-0-262-02818-9 (2014).
- (8) S. Fortunato, “Community detection in graphs”, Physics Reports 486, 75-174 (2010).
- (9) M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks”, Phys. Rev. E 69, 026113 (2004).
- (10) V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks”, J. Stat. Mech. 10, 10008 (2008).
- (11) V. Gudkov, V. Montelaegre, S. Nussinov, and Z. Nussinov, “Community detection in complex networks by dynamical simplex evolution”, Phys. Rev. E 78, 016113 (2008).
- (12) M. Rosvall and C. T. Bergstrom, “Maps of random walks on complex networks reveal community structure”, Proc. Natl. Aca. Sci. U.S.A. 105, 1118-1123 (2008).
- (13) Peter Ronhovde and Zohar Nussinov, “An Improved Potts Model Applied to Community Detection”, Physical Review E 81, 046114 (2010).
- (14) P. Ronhovde and Z. Nussinov, “Multiresolution community detection for megascale networks by information-based replica correlations”, Phys. Rev. E 80, 016109 (2009).
- (15) D. Hu, P. Ronhovde, and Z. Nussinov, “A Replica Inference Approach to Unsupervised Multi-Scale Image Segmentation”, Phys. Rev. E 85, 016101 (2012).
- (16) Dandan Hu, Pinaki Sarder, Peter Ronhovde, Sharon Bloch, Samuel Achilefu, and Zohar Nussinov, “Automatic segmentation of fluorescence lifetime microscopy images of cells using multiresolution community detection: a first study”, Journal of microscopy 253 (1), 54-64 (2014).
- (17) D. Hu, P. Sarder, P. Ronhovde, S. Bloch, S. Achilefu, and Z. Nussinov, “Community detection for fluorescent lifetime microscopy image segmentation”, Proc. SPIE 8949, Three-Dimensional and Multidimensional Microscopy: Image Acquisition and Processing XXI, 89491K (2014); http://dx.doi.org/10.1117/12.2036875.
- (18) P. Ronhovde, S. Chakrabarty, M. Sahu, K. K. Sahu, K. F. Kelton, N. Mauro, and Z. Nussinov “Detection of hidden structures on all scales in amorphous materials and complex physical systems: basic notions and applications to networks, lattice systems, and glasses”, Scientific Reports 2, 329 (2012) DOI: 10.1038/srep00329.
- (19) P. Ronhovde, S. Chakrabarty, M. Sahu, K. F. Kelton, N. A. Mauro, K . K. Sahu, and Z. Nussinov, “Detecting hidden spatial and spatio-temporal structures in glasses and complex physical systems by multiresolution network clustering”, The European Physics Journal E 34, 105 (2011) DOI: 10.1140/epje/i2011-11105-9.
- (20) Bo Sun, Blake Leonard, Peter Ronhovde, and Zohar Nussinov, “An interacting replica approach applied to the traveling salesman problem”, INSPEC Accession Number: 16267791, http://ieeexplore.ieee.org/document/7556001/ IEEE Conference Publications (2016) special issue, SAI computing Conference (reviewed conference proceedings) Pages: 319 - 329, DOI: 10.1109/SAI.2016.7556001.
- (21) James Surowiecki, The Wisdom of Crowds, Anchor Books. pp. xv. ISBN 0-385- 72170-6 (2005); A. Spiegel, http://www.npr.org/sections/parallels/2014/04/02/297839429/-so-you-think-youre-smarter-than-a-cia-agent (2014).
- (22) Z. Nussinov, P. Ronhovde, Dandan Hu, S. Chakrabarty, M. Sahu, Bo Sun, N. A. Mauro, and K. K. Sahu, “Inference of hidden structures in complex physical systems by multi-scale clustering”, Chapter 6 in the book “Information Science for Materials Discovery and Design”, Springer Series Materials, vol. 225 Edited by Turab Lookman, Frank Alexander, and Krishna Rajan. 978-3-319-23870-8 (2016).
The nature of statistical learning Theory, 2nd EdSpringer, NewYork (1999).
- (24) J.A.K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers”, Neural Processing Letters 9, 293-300 (1999).
- (25) D. J. Livingstone, Artificial Neural Networks: Methods and Applications, Humana Press (2011).
- (26) Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, (2016), MIT Press, Cambridge, MA and London, England
- (27) J. J. Hopfield, ”Neural Networks and Physical Systems with Emergent Collective Computational Abilities”, PNAS 79;2554-2558 (1982)
- (28) D. J. Amit, H. Gutfreund, and H. Sompolinsky, ”Spin-glass models of neural networks”, Phys. Rev. A 32, 1007 (1985)
- (29) D. J. Amit, H. Gutfreund, and H. Sompolinsky, “Storing Infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks”, Phys. Rev. Lett. 55, 1530 (1985)
- (30) J. J. Hopfield and D. W. Tank, “Computing with neural circuits: a model”, Science 233, 625 (1986).
- (31) H. Sompolinky, “Statistical Mechanics of Neural Networks”, Physics Today 41 (12), 70 (1988).
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, ”A learning algorithm for Boltzmann machines”, Cognitive Science9, 147-169 (1985)
- (33) A. Karpatne, G. Atluri, J. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, ”Theory-guided Data Science: A new paradigm for scientific discovery”, arXiv: 1612.08544 (2016)
- (34) K. Huang, Statistical Mechanics, John Wiley & Sons (New York, Chichester, Brisbane, Toronto, Singapore) (1987).
- (35) L. D. Landau, “On the Theory of Phase Transitions”, Zh. Eksp. Teor. Fiz. 7, 19-32 (1937).
- (36) Dandan Hu, Peter Ronhovde, and Zohar Nussinov, “Phase transition in the community detection problem: spin-glass type and dynamic perspectives”, Philosophical Magazine 92, 406 (2012); online- https://arxiv.org/pdf/1008.2699.pdf (2010).
- (37) A. Decelle, F. Krzakala, C. Moore, and L. Zdeborova, Phase transition in the detection of modules in sparse networks, Phys. Rev. Lett. 107, 065701 (2011); online- http://arxiv.org/abs/1102.1182 (2011).
- (38) D. M. W. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation”, Journal of Machine Learning Technologies. 2 (1): 37-63 (2011).
- (39) C. de Boor, “Efficient computer manipulation of tensor products”, ACM Trans. Math. Software, 5 (2) pp. 173-182 (1979); C. de Boor, A Practical Guide to Splines Springer, New York (1978); E. Grosse, “Tensor spline approximation”, Linear Algebra and its Applications 34, 29-41 (1980).
; Tin Kam Ho and Eugene M. Kleinberg, Four-class source: “Building projectable classifiers of arbitrary complexity” In Proceedings of the 13th International Conference on Pattern Recognition, 880-885, Vienna, Austria (1996); svmguide1 source: Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, “A practical guide to support vector classification”, Technical report, Department of Computer Science, National Taiwan University (2003).
Statlog (Heart Disease) data set, Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/datasets/Statlog+(Heart)]. Irvine, CA: University of California, School of Information and Computer Science;
R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, and V. Froelicher, “International application of a new probability algorithm for the diagnosis of coronary artery disease”, American Journal of Cardiology 64
, 304 (1989); J. H. Gennari, P. Langley, and D. Fisher, “Models of incremental concept formation”, Artificial Intelligence40, 11-61 (1989).
- (42) Statlog (Australian Credit Approval) data set, M. Lichman, (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science; R. Quinlan, ”Simplifying decision trees”, Int J Man-Machine Studies 27, 221 (1987).
- (43) A. Tsanas, M.A. Little, C. Fox, and L.O. Ramig, “Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease”, IEEE Transactions on Neural Systems and Rehabilitation Engineering, 22, pp. 181-190, (2014); data from: http://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation.
- (44) https://archive.ics.uci.edu/ml/datasets/internet+advertisements
- (45) R. A. Fisher ,“The use of multiple measurements in taxonomic problems”, Annals of Eugenics 7, 179-188 (1936); https://archive.ics.uci.edu/ml/datasets/iris
; O. L. Mangasarian and W. H. Wolberg, “Cancer diagnosis via linear programming”, SIAM News23, 1-18 (1990); William H. Wolberg and O.L. Mangasarian, “Multisurface method of pattern separation for medical diagnosis applied to breast cytology”, Proceedings of the National Academy of Sciences, U.S.A 87, 9193-9196 (1990); O. L. Mangasarian, R. Setiono, and W.H. Wolberg, “Pattern recognition via linear programming: Theory and application to medical diagnosis” in “Large-scale numerical optimization”, Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1999, pp. 22-30; K. P. Bennett and O. L. Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets”, Optimization Methods and Software 1, 23-34 (1992) (Gordon and Breach Science Publishers).
- (47) K. Lakshminarayan, S. A. Harp, R. Goldman, and T. Samad, “Imputation of missing data using machine learning techniques”, KDD-96 Proceedings (1996).
- (48) Jose M. Jereza, Ignacio Molinab, Pedro J. Garcia-Laencinac, Emilio Albad, Nuria Ribellesd, Miguel Martin, and Leonardo Franco, “Missing data imputation using statistical and machine learning methods in a real breast cancer problem”, Artificial Intelligence in Medicine 50, Issue 2, Pages 105-115 (2010).
- (49) W. Kruskal and W. A. Wallis, “Use of ranks in one-criterion variance analysis”, Journal of the American Statistical Association. 47, 583-621 (1952).
- (50) Student (pseudo name of William Sealy Gosset), “The Probable Error of a Mean”, Biometrika 6 1:1-25 (1908)
- (51) H. Levene, “Robust tests for equality of variances”. In I. Olkin, S. S. Ghurye, W. Hoeffding, W. G. Madow, and H. B. Mann, “Contributions to Probability and Statistics: Essays in Honor of Harold Hotellin”, Stanford University Press pages 278 - 292 (1960).
- (52) https://en.wikipedia.org/wiki/Accuracy_paradox
- (53) J. Cohen, “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement 20 (1), 37-46.(1960); N. C. Smeeton, “Early History of the Kappa Statistic”, Biometrics. 41, 795 (1985).
- (54) D. M. W. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation”, Journal of Machine Learning Technologies 2 (1): 37-63 (2011).
- (55) B. W. Matthews,“Comparison of the predicted and observed secondary structure of T4 phage lysozyme”. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2): 442-451 (1975).
- (56) R. Sutton and A. Barto,“Reinforcement Learning: An Introduction”, MIT Press (1998).
- (57) J. Epperson, “On the Runge example”, Amer. Math. Monthly. 94, 329-341 (1987).
- (58) In the “Heart” data set, as the number of anchor points increases the accuracy decreases due to overfitting. When the number of anchor points becomes equal to the number of training points in the data (i.e., when ). When the number of anchor points (i.e., the number of unknown coefficients in Eq. (1)) exceeds the number of training points in the dataset the system undergoes a transition to an under-determined. This change to an under-determined “phase” is evinced by numerous metrics (including Fig. (27)).
- (59) D. L. Stein and C. M. Newman, “Spin Glasses and Complexity” (Princeton University Press, 2013).
- (60) J. E. Avron, G. Roepstorff, and L. S. Schulman, “Ground state degeneracy and ferromagnetism in a spin glass”, J. Stat. Phys. 26, 25 (1981).
- (61) M- S. Vaezi, G. Ortiz, M. Weigel, and Z. Nussinov, “Binomial Spin Glass”, Phys. Rev. Lett. 121, 080601 (2018).
- (62) T. Hofmann, B. Scholkopf, and A. J. Smola, “Kernel Methods in Machine Learning”, The Annals of Statistics 36, 1171-1220 (2008).
- (63) Tin Kam Ho, “Random Decision Forests”, Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14-16, 278-282 (1995); Tin Kam Ho “The Random Subspace Method for Constructing Decision Forests”, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832?844 (1998); Ram n D az-Uriarte and Sara Alvarez de Andres, “Gene selection and classification of microarray data using random forest”, BMC Bioinformatics 7, 3 (2006); Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning (2nd ed.). Springer (2008) (ISBN 0-387-95284-5).
- (64) Leo Breiman, “Bagging predictors” Machine Learning 24, 123-140 (1996), doi:10.1007/BF00058655.
- (65) Robert Schapire, “The strength of weak learnability. In: Machine Learning” 5, 197?227 (1990), doi:10.1007/BF00116037; Yoav Freund and Robert E Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”, Journal of Computer and System Sciences 55, 119-139 (1997) doi:10.1006/jcss.1997.1504.
- (66) Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan, “An Introduction to MCMC for Machine Learning”, Machine Learning 50, 5-43 (2003).
- (67) T. Mazaheri, Bo Sun, A, Thind, J. Scher-Zagier, D. Magee, P. Ronhovde, T. Lookman, R. Mishra, and Z. Nussinov, “Stochastic Replica Voting Machine prediction of stable Perovskite and binary alloys”, https://arxiv.org/pdf/1705.08491.pdf (2017).
- (68) K. W. Moore, A. Pechen, Xiao-Jiang Feng, J. Dominy, V. J. Beltrania, and H. Rabitz, “Why is chemical synthesis and property optimization easier than expected?” Phys. Chem. Chem. Phys. 13, 10048-10070 (2011).
- (69) After the initial appearance of our work (Patrick Chao, Tahereh Mazaheri, Bo Sun, Nicholas B. Weingartner, and Zohar Nussinov, “The Stochastic Replica Approach to Machine Learning: Stability and Parameter Optimization”, https://arxiv.org/pdf/1708.05715.pdf (2017)) and our current speculation therein that spatial coordinates may emerge from machine learning, a slightly more recent preprint (Yi-Zhuang You, Zhao Yang, and Xiao-Liang Qi, “Machine Learning Spatial Geometry from Entanglement Features”, https://arxiv.org/pdf/1709.01223.pdf) similarly suggested from a different perspective and in some illuminating detail that geometry may transpire from machine learning.