1 Introduction
Machine learning algorithms require train and test instances to be represented using a set of features. For example, in supervised document classification [6]
, a document is often represented as a vector of its words and the value of a feature is set to the number of times the word corresponding to the feature occurs in that document. However, different features occupy different value ranges, and often one must scale the feature values before any supervised classifier is trained. In our example of document classification, there are both highly frequent words (e.g. stop words) as well as extremely rare words. Often, the relative difference of a value of a feature is more informative than its absolute value. Therefore, feature scaling has shown to improve performance in classification algorithms.
Typically, feature values are scaled to a standard range in a preprocessing step before using the scaled features in the subsequent learning task. However, this preprocessing approach to feature value scaling is problematic because of several reasons. First, often feature scaling is done in an unsupervised manner without consulting the labels assigned to the training instances. Although this is the only option in unsupervised learning tasks such as document clustering, for supervised learning tasks such as document classification, where we do have access to the label information, we can use the label information also for feature scaling. Second, it is not possible to perform feature scaling as a preprocessing step in
onepass online learning setting. In onepass online learning we are allowed to traverse through the set of training instances only once. Learning from extremely large datasets such as twitter streams or Web scale learning calls for algorithms that require only a single pass over the set of training instances. In such scenarios it is not possible to scale the feature values beforehand by using statistics from the entire training set. Third, even if we precompute scaling parameters for a feature, those values might become obsolete in an online learning setting in which the statistical properties of the training instances vary over the time. For example, a twitter text stream regarding a particular keyword might change overtime and the scaling factors computed using old data might not be appropriate for the new data.We study the problem of dynamically scaling feature values at run time for online learning. The term dynamic feature scaling is used in this paper to refer to the practice of scaling feature values at run time as opposed to performing feature scaling as a preprocessing step that happens prior to learning. We focus on binary classifiers as a specific example. However, we note that the proposed method can be easily extended to multiclass classifiers. We propose two main approaches for dynamic feature scaling in this paper: (a) Unsupervised Dynamic Feature Scaling (Section 3), in which we do not consider the label information assigned to the training instances for feature scaling, and (b) Supervised Dynamic Feature Scaling (Section 4), in which we consider the label information assigned to the training instances for feature scaling.
All algorithms we propose in this paper can be trained under the onepass online learning setting, where only a single training instance is provided at a time and only the scale parameters and feature weights are stored in the memory. This enables the proposed method to (a) efficiently adapt to the varying statistics in the data stream, (b) compute the optimal feature scales such that the likelihood of the training data under the trained model is maximized, and (c) train from large datasets where batch learning is impossible because of memory requirements. We evaluate the proposed methods in combination with different online learning algorithms using three benchmark datasets for binary classification. Our experimental results show that, interestingly, the much simpler unsupervised dynamic feature scaling method consistently improves all of the online binary classification algorithms we compare, including the stateoftheart classifier of [6].
2 Related Work
Online learning has received much attention lately because of the necessity to learn from large training datasets such as query logs in a web search engine [22], webscale document classification or clustering [19]
, and sentiment analysis on social media
[15, 11]. Online learning toolkits that can efficiently learn from large datasets are made available such as Vowpal Wabbit^{1}^{1}1https://github.com/JohnLangford/vowpal_wabbit and OLL^{2}^{2}2https://code.google.com/p/oll/ (Online Learning Library). Online learning approaches are attractive than their batch learning counterparts when the training data involved is massive due to two main reasons. First, the entire dataset might not fit into the main memory of a single computer to perform a batch optimization. Although there has been some recent progress in distributed learning algorithms [14, 13, 17] that can distribute the batch optimization process across a series of machines, setting up and debugging such a distributed learning environment remains a complex process. On the other hand, online learning algorithms consider only a small batch (often referred to as a mini batch in the literature) or in the extreme case a single training instance. Therefore, the need for large memory spaces can be avoided with online learning. Second, a batch learning algorithm requires at least one iteration over the entire dataset to produce a classifier. This can be time consuming for large training datasets. On the other hand, online learning algorithms can produce a relatively accurate classifier even after observing a handful of training instances.Online learning is a vast and active research field and numerous algorithms have been proposed in prior work to learn classifiers [7, 8, 20, 21, 12, 18]. A detailed discussion of online classification algorithms is beyond the scope of this paper. Some notable algorithms are the passiveaggressive (PA) algorithms [6], confidenceweighted linear classifiers [10] and their multiclass variants [7, 8]. In passiveaggressive learning, the weight vector for the binary classifier is updated only when a misclassification occurs. If the current training instance can be correctly classified using the current weight vector, then the weight vector is not updated. In this regard, the algorithm is considered passive. On the other hand, if a misclassification occurs, then the weight vector is aggresively
updated such that it can correctly classify the current training instance with a fixed margin. Passiveaggressive algorithm has consistently outperformed numerous other online learning algorithms across a widerange of tasks. Therefore, it is considered as a stateoftheart online binary classification algorithm. As we demonstrate later, the unsupervised dynamic feature scaling method proposed in this paper further improves the accuracy of the passiveaggressive algorithm. Moreover, activelearning
[9][24] approaches have also been proposed for online classifier learning.OnePass Online Learning (OPOL) (also known as stream learning) [11] is a special case of online learning in which only a singlepass is allowed over the set of train instances by the learning algorithm. Typically, an online learning algorithm requires multiple passes over a training dataset to reach a convergent point. This setting can be considered as an extreme case where the train batch size is limited to only one instance. The OPOL setting is more restrictive than the classical online learning setting where a learning algorithm is allowed to traverse multiple times over the training dataset. However, OPOL becomes the only possible alternative in the following scenarios.

The number of instances in the training dataset is so large that it is impossible to traverse multiple times over the dataset.

The dataset is in fact a stream where we encounter new instances continuously. For example, consider the situation where we want to train a sentiment classifier from tweets.

The data stream changes over time. In this case, even if we can store old data instances they might not be much of a help to predict the latest trends in the data stream.
It must be noted that OPOL is not the only solution for the first scenario where we have a large training dataset. One alternative approach is to select a subset of examples from the dataset at each iteration and only use that subset for training in that iteration. One promising criterion for selecting examples for training is curriculum learning [1]. In curriculum learning, a learner is presented with easy examples first and gradually with the more difficult examples. However, determining the criteria for selecting easy examples is a difficult problem itself, and the criterion for selecting easy examples might be different from one task to another. Moreover, it is not clear whether we can select easy examples from the training dataset in a sequential manner as required by online learning without consulting the unseen training examples.
The requirement for OPOL ever increases with the large training datasets and data streams we encounter on the Web such as social feeds. Most online learning algorithms require several passes over the training dataset to achieve convergence. For example, PassiveAggressive algorithms [6] require at least iterations over the training dataset to converge, whereas, for ConfidenceWeighted algorithms [10] the number of iterations has shown to be less (ca. ). Our focus in this paper is not to develop online learning algorithms that can classify instances with high accuracy by traversing only once over the dataset, but to study the effect of feature scaling in the OPOL setting. To this end, we study both an unsupervised dynamic feature scaling method (Section 3) and several variants of a supervised dynamic feature scaling methods (Section 4).
3 Unsupervised Dynamic Feature Scaling
In unsupervised dynamic feature scaling, given a feature , we compute the mean,
and the standard deviation
of the feature and perform an affine transformation as follows,(1) 
This scaling operation corresponds to a linear shift of the feature values by the mean value of the feature, followed up by a scaling by its standard deviation. From a geometric point of view, this transformation will shift the origin to the mean value and then scale axis corresponding to the th feature to unit standard deviation. It is used popularly in batch learning setting, in which one can compute the mean and the standard deviation using all the training instances in the training dataset. However, this is not possible in OPOL, in which we encounter only one instance at a time. However, even in the OPOL setting, we can compute the mean and the standard deviation on the fly and constantly update our estimates of those values as new training instances (feature vectors) are observed. The update equations for the mean and the standard deviation for the th feature are as follows [16, 4],
(2)  
(3) 
We use these estimates for the mean and the standard deviation to scale features in Equation 1. The mean and standard deviation are updated throughout the training process.
4 Supervised Dynamic Feature Scaling
We define the task of supervised dynamic feature scaling task for binary classification in the OPOL setting as follows. Given a stream of labeled training instances , in which the class label of the th training instance , denoted by a feature vector , is assumed to be either (positive class) or (negative class). Furthermore, let us assume that the feature space is dimensional and the value of the th feature of the th instance in the training data stream is denoted by . In this paper, we consider only realvalued features (i.e. ) because feature scaling is particularly important for realvalued features.
We define the feature scaling function for the th feature as a function that maps to the range as follows:
(4) 
Here, and are the scaling parameters for the th dimension of the feature space. Several important properties of the feature scaling function defined by Equation 4 are noted. First, the feature transformation function maps all feature values to the range irrespective of the original range in which each feature value was. For example, one feature might originally be limited to the range , whereas another feature might have values in the full range of . By scaling each feature into a common range we can concentrate on the relative values of those features without being biased by their absolute values. Second, the scaling parameters and
are defined perfeature basis. This enables us to scale different features using scale parameters appropriate for their value ranges. Third, the linear transformation
within the exponential term of the feature scaling function resembles the typical affine transformations performed in unsupervised feature scaling. For example, assuming the mean and the standard deviation of the th feature to be respectively and , in supervised classification, features are frequently scaled to prior to training and testing. The linear transformation within the exponential term in Equation 4 can be seen as a special case of this approach with values and .Then, the posterior probability,
ofbelonging to the positive class is given as follows according to the logistic regression model
[3]:(5)  
Here, is the weight associated with the th feature and is the bias term. We arrange the weights , scaling parameters and respectively using vectors , , and .
The crossentropy loss function per instance including the L2 regularization terms for the weight vector
and scale vector can be written as follows:(6) 
Here, we used to minimize the cluttering of symbols in Equation 6. To avoid overfitting to training instances and to minimize the distortion of the training instances, we impose L2 regularization on , , and . Therefore, the final objective function that must be minimized with respect to , , , and is give by,
(7) 
Here, , and respectively are the L2 regularization coefficients corresponding to the weight vector and the scale vectors , . Because we consider the minimization of Equation 7 per instance basis, in our experiments, we divide the regularization parameters , , and by the total number of training instances in the dataset such that we can compare the values those parameters across datasets of different sizes.
By setting the partial derivatives , , , and
to zero and applying Stochastic Gradient Descent (SGD) update rule the following updates can be derived,
(8) 
(9) 
(10) 
(11) 
In Equations 811, denotes the th update and is the learning rate for the th update. We experimented with both linear and exponential decaying and found linear decaying to perform better for the proposed method. The linear decaying function for is defined as follows,
(12) 
Here, is the total number of iterations for which the training dataset containing instances will be traversed. Because we are considering OPOL, we set . The initial learning rate is set to throughout the experiments described in the paper. This value of was found to be producing the best results in our preliminary experiments using development data, which we selected randomly from the benchmark datasets described later in Section 5.
Several observations are in order. First, note that the scaling factors and distort the original value of the feature . If this distortion is too much, then we might loose the information conveyed by the feature . To minimize the distortion of because of scaling, we have imposed regularization on both and . This treatment is similar to the slack variables often used in nonseparable classification tasks and imposing a penalty on the total slackness. Of course, the regularization on and can be removed simply by setting the corresponding regularization coefficients and to zero. Therefore, the introduction of regularization on and does not harm the generality of the proposed method. The total number of parameters to train in this model is , corresponding to , , , and . Note that we must not regularize the bias term and let it to adjust arbitrarily. This can be seen as a dynamic scaling for the score (i.e. innerproduct between and ), although this type of scaling is not feature specific. The sigmoidbased feature scaling function given by Equation 4 is by no means the only function that satisfies the requirement for a scaling function (i.e. maps all feature values to the same range such as
). However, the sigmoid function has been widely used in various fields of machine learning such as neural networks
[23], and has desirable properties such as differentiability and continuity.Next, we introduce several important variants of Equation 4 and present the update equations for each of those variants. In Section 6, we empirically study the effect of the different variants discussed in the paper. For the ease of reference, we name the original formulation given by Equation 4 as FS (Supervised Feature Scaling) method. The objective function given by Equation 7 is convex with respect to . This can be easily verified by computing the second derivative of the objective function with respect to , which becomes
(13) 
Because , , and hold, the second derivative , which proves that the objective function is convex with respect to . Likewise, the objective function can be shown to be convex with respect to the bias term . It is interesting to note that the convexity holds irrespective of the form of the scaling function for both and as long as is satisfied. If for some value of , then the convexity of also depends upon not being equal to zero. Although, in the case of sigmoid feature scaling functions when this is irrelevant because feature values are finite in practice. Unfortunately, the objective function is nonconvex with respect to and . Although SGD updates are empirically shown to work well even when the objective function is nonconvex, there is no guarantee that the update Equations 8  11 will find the global minimum of the objective function.
4.1 Fs1
In this variant we fix the scaling factor , thereby reducing the number of parameters to be tuned. However, this model cannot adjust for the different value ranges of features and can only learn the shiftings required. We name this variant as FS1 and is given by,
(14) 
The update equations for , , and are as follows,
(15) 
(16) 
(17) 
Note that although the update Equations 15, 16, and 17 appear to be similar in their form to Equations 8, 9, and 11, the transformation functions in the two sets of equations are different. As discussed earlier under FS, FS1 is also convex with respect to and , but nonconvex with respect to .
4.2 Fs2
We design a convex form of the objective function with respect to all parameters by replacing the sigmoid feature scaling function with a linear combination as follows,
(18) 
The class conditional probability is computed using the logistic sigmoid model as,
(19) 
Then the update equations for , , , and are given as follows,
(20) 
(21) 
(22) 
(23) 
Here, we used to simplify the equations.
Moreover, the secondorder partial derivatives of the objective function , with respect to , , , and can be computed as follows,
From, , , , and it follows that all of the abovementioned secondorder derivatives are positive, which proofs the convexity of the objective function. We name this convex formulation of the feature scaling method as the FS2 method.
4.3 Fs3
Although FS2 is convex, there is an issue regarding the determinability among , , and because the product between and , and the product between and appear inside the exponential term in Equation 19. This implies that the probability will be invariant under a constant scaling of , , and . We can absorb the terms from the objective function into the corresponding and terms thereby effectively both reducing the number of parameters to be trained as well as eliminating the issue regarding the determinability. We name this variant of the feature scaling method as the FS3 method.
The class conditional probability for FS3 is give by,
(24) 
This can be seen as a special case of FS2 where we set and .
The update equations for FS3 can be derived as follows,
(25) 
(26) 
(27) 
Here, we used to simplify the equations. Because FS2 is convex and FS3 is a special case of FS2, it follows that FS3 is also convex.
5 Datasets
To evaluate the performance of the numerous feature scaling methods introduced in Section 4, we train and test those methods under the onepass online learning setting. We use three datasets in our experiments: heart dataset, liver dataset, and the diabetes dataset. All three datasets are popularly used as benchmark datasets to evaluate binary classification algorithms. Moreover, all three datasets contain realvalued and unscaled features values, which are appropriate for the current evaluation purpose. All three datasets can be downloaded from the UCI Machine Learning Repository^{3}^{3}3http://archive.ics.uci.edu/ml/. Details of the three datasets are summarized in Table 1.
Dataset  Attributes  Train instances  Test instances 

heart  
liver  
diabetes 
6 Experiments and Results
To compare the performance of the different dynamic feature scaling methods we proposed in the paper, we use those methods to scale features in the following online learning algorithms.
 SGD (Stochastic Gradient Descent):

This method implements logistic regression using stochastic gradient descent. It does not use any feature scaling and uses the original feature values as they are for training a binary classifier. This method demonstrates the lower baseline performance for this task.
 SDG+avg (Stochastic Gradient Descent with Model Averaging):

This method is the same as SGD described above, except that it uses the average weight vector during training and testing. Specifically, it computes the average of the weight vector over the updates and uses this average vector for prediction. By considering the average weight vector instead of the final weight vector we can avoid any bias toward the last few training instances encountered by the online learner. Moreover, it has been shown both theoretically and empirically that consideration of the average weight vector results in faster convergence in online learning [5].
 GN (Unsupervised Dynamic Scaling):

This is the unsupervised dynamitc feature scaling method described in Section 3. It trains a binary logistic regression model by scaling the features using the unsupervised approach.
 GN+avg (Unsupervised Dynamic Scaling with Model Averaging):

This is the unsupervised feature scaling method described in Section 3 using the average weight vector for predicting instead of the final weight vector. It trains a binary logistic regression model by scaling the features using the unsupervised approach.
 FS (Supervised Dynamic Feature Scaling):

This is the supervised dynamic feature scaling method described in Section 4.
 FS+avg (Supervised Dynamic Feature Scaling with Model Averaging):

This is the FS method, where we use the average values for all parameters: , , , and .
 FS1 (Supervised Dynamic Feature Scaling variant FS1):

This is the method described in Section 4.1.
 FS1+avg (Supervised Dynamic Feature Scaling variant FS1 with Model Averaging):

This is the method described in Section 4.1 with averaged parameter vectors.
 FS2 (Supervised Dynamic Feature Scaling variant FS2):

This is the method described in Section 4.2.
 FS2+avg (Supervised Dynamic Feature Scaling variant FS1 with Model Averaging):

This is the method described in Section 4.2 with averaged parameter vectors.
 FS3 (Supervised Dynamic Feature Scaling variant FS3):

This is the method described in Section 4.3.
 FS3+avg (Supervised Dynamic Feature Scaling variant FS1 with Model Averaging):

This is the method described in Section 4.3 with averaged parameter vectors.
 PA (PassiveAggressive):

This is the PassiveAggressive binary linear classification algorithm proposed by [6].
 PA+avg (PassiveAggressive with Model Averaging):

This is the PassiveAggressive binary linear classification algorithm proposed by [6] using the averaged weight vector to predict during both training and testing stages.
 PA1 (PassiveAverage variant 1):

This is the PassiveAggressive PAI version of the binary linear classification algorithm proposed by [6].
 PA1+avg (PassiveAggressive variant 1 with Model Averaging):

This is the PassiveAggressive PA1 version of the binary linear classification algorithm proposed by [6] using the averaged weight vector to predict during both training and testing stages.
 PA2 (PassiveAggressive variant 2):

This is the PassiveAggressive PA2 version of the binary linear classification algorithm proposed by [6].
 PA2+avg (PassiveAggressive variant 2 with Model Averaging):

This is the PassiveAggressive PA2 version of the binary linear classification algorithm proposed by [6] using the averaged weight vector to predict during both training and testing stages.
Algorithm  Train Accuracy  Test Accuracy  Best Parameters 

SGD  
SGD+avg  
GN  
GN+avg  
FS  
FS+avg  
FS1  
FS1+avg  
FS2  
FS2+avg  
FS3  
FS3+avg  
PA  
PA+avg  
PA1  
PA1+avg  
PA2  
PA2+avg 
Algorithm  Train Accuracy  Test Accuracy  Best Parameters 

SGD  
SGD+avg  
GN  
GN+avg  
FS  
FS+avg  
FS1  
FS1+avg  
FS2  
FS2+avg  
FS3  
FS3+avg  
PA  
PA+avg  
PA1  
PA1+avg  
PA2  
PA2+avg 
Algorithm  Train Accuracy  Test Accuracy  Best Parameters 

SGD  
SGD+avg  
GN  
GN+avg  
FS  
FS+avg  
FS1  
FS1+avg  
FS2  
FS2+avg  
FS3  
FS3+avg  
PA  
PA+avg  
PA1  
PA1+avg  
PA2  
PA2+avg 
We measure train and test classification accuracy for each of the abovementioned algorithms. Classification accuracy is defined as follows:
(28) 
Note that all three benchmark datasets described in Section 5 are balanced (i.e. contains equal numbers of positive and negative train/test instances). Therefore, a method that randomly classifies test instances would obtain an accuracy of . The experimental results for heart, liver, and diabetes datasets are shown respectively in Tables 2, 3, and 4.
We vary the values for the numerous parameters in a predefined set of values for each parameter and experiment with all possible combinations of those values. For the regularization coefficients , , and we experiment with the values in the set . For the parameter in passiveaggressive algorithms we chose from the set . In each dataset, we randomly set aside th of all training data for validation purposes. We search for the parameter values for each algorithm that produces the highest accuracy on the validation dataset. Next, we fix those parameter values and evaluate on the test portion of the corresponding dataset. The best parameter values found through the search procedure are shown in the fourth column in Tables 24. Online learning algorithms have been shown to be sensitive to the order in which training examples are presented to them. Following the suggestions in prior work, we randomize the sequence of training data instances during training [2]. All results shown in the paper are the average of random initializations.
As can be seen from Tables 2, 3, and 4 the unsupervised dynamic feature scaling methods (GN and GN+avg) consistently outperform joint supervised feature scaling methods and PA algorithms. Model averaged version of the unsupervised dynamic feature scaling method (GN+avg) shows better performance than its counterpart that does not perform model averaging (GN) in two out of the tree datasets. Compared to the unsupervised dynamic feature scaling methods (GN and GN+avg), the supervised dynamic feature scaling methods (FS, FS1, FS2, and FS3) report lower test accuracies. Compared to the unsupervised dynamic feature scaling methods, the number of parameters that must be estimated from labeled data is larger in the supervised dynamic feature scaling methods. Although the unsupervised dynamic feature scaling method requires us to estimate the mean and standard deviation from train data, those parameters can be estimated without using the label information in the training instances. Therefore, the unsupervised dynamic feature scaling is less likely to overfit to the train data, which results in better performance.
Recall that SGD and SGD+avg do not perform any dynamic feature scaling and demonstrate the level of accuracy that we would obtain if we had not perform feature scaling. In all datasets, the GN and GN+avg
methods significantly outperform (according to a twotailed paired ttest under
confidence level) the SGD counterparts showing the effectiveness of feature scaling when training binary classifiers.Among the variants of the proposed FS methods, the FS2 method reports the best performance. We believe that this can be attributable to the convexity of the objective function. Because we are allowed only a single pass over the training dataset in OPOL setting, convergence becomes a critical issue compared to the classical online learning setting where the learning algorithm traverses multiple times over the dataset. Convex functions can be relatively easily optimized using gradient methods compared to nonconvex functions. FS3 method which constrains the parameters in the FS2 method shows poor performance in our experiments. Specifically, FS3 absorbs the weight parameters into the scaling parameters in the FS2 method. However, the experimental results show that we should keep the two sets of parameters separately. In our future work, we plan to study other possible ways to reduce the number of parameters in the supervised dynamic feature scaling methods in order to reduce the effect of overfitting. Among the three datasets, the performance differences of the methods compared are least significant on the diabetes dataset. In fact, of the methods report the same test accuracy on this dataset and learns the same classification model. However, the model averaged version of the unsupervised dynamic feature scaling method (GN+avg) outperforms all the methods compared even in the diabetes dataset that shows its ability to perform well even in situations where other methods cannot.
To study the behavior of the different learning algorithms during train time, we compute the cumulative number of errors. Cumulative number of errors represents the total misclassification errors encountered up to the current train instance. In an onepass online learning setting, we must continuously both train as well as apply the trained classifier to classify new instances on the fly. Therefore, a method that obtains a lower number of cumulative errors is desirable. To compare the different methods described in the paper, we plot the cumulative number of errors against the total number of training instances encountered as shown in Figures 1 and 2, respectively for heart and liver datasets. During training, we use the weight vector (or the averaged weight vector for the +avg methods) to classify the current training instances and if it is misclassified by the current model, then it is counted as an error. The 45 degree line in each plot corresponds to the situation where all instances encountered during training are misclassified. All algorithms must lie below this line. To avoid cluttering, we only show the cumulative number of error curves for the following six methods: FS2, FS2+avg, SGD, SGD+avg, GN, and GN+avg. Overall, we see that the unsupervised dynamic feature scaling methods GN and GN+avg stand out among the others and report lower numbers of cumulative errors.
7 Conclusion
We studied the problem of feature scaling in onepass online learning (OPOL) of binary linear classifiers. In OPOL, a learner is allowed to traverse a dataset only once. We presented both supervised as well as unsupervised approaches to dynamically scale feature under the OPOL setting. We evaluated different learning methods using three popular datasets. Our experimental results show that the unsupervised approach significantly outperforms the supervised approaches and improves the classification accuracy in a stateoftheart online learning algorithm proposed by [6]. Among the several variants of the supervised feature scaling approach we evaluated, the convex formulation performed best. In future, we plan to explore other forms of feature scaling functions and their effectiveness in numerous online learning algorithms proposed for classification.
References
 [1] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML’09, pages 41 – 48, 2009.
 [2] Dimitri P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Technical Report LIDS 2848, MIT, 2010.
 [3] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[4]
Tony F. Chan, Gene H. Golub, and Randall J. LeVeque.
Algorithms for computing the sample variance: Analysis and recommendations.
The American Statistician, 37:242 – 247, 1983. 
[5]
Michael Collins.
Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms.
In EMNLP’02, pages 1 – 8, 2002.  [6] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. Online passiveaggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.
 [7] Koby Crammer, Mark Dredze, and Alex Kulesza. Multiclass confidence weighted algorithms. In EMNLP 2009, pages 496 – 504, 2009.
 [8] Koby Crammer, Mark Dredze, and Fernando Pereira. Exact convex confidenceweighted learning. In NIPS’08, 2008.
 [9] Mark Dredze and Koby Crammer. Active learning with confidence. In ACL’08, pages 233 – 236, 2008.
 [10] Mark Dredze, Koby Crammer, and Fernando Pereira. Confidenceweighted linear classification. In ICML’08, pages 264 – 271, 2008.
 [11] Mark Dredze, Tim Oates, and Christine Piatko. We’re not in kansas anymore: Detecting domain changes in streams. In EMNLP’10, pages 585 – 595, 2010.
 [12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT’10, 2010.
 [13] John C. Duchi, Alekh Agrawal, and Martin J. Wainwright. Distributed dual averaging in networks. In NIPS’10, 2010.
 [14] Siddharth Gopal and Yiming Yang. Distributed training of largescale logistic models. In ICML’13, 2013.
 [15] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Targetdependent twitter sentiment classification. In ACL’11, pages 151 – 160, 2011.
 [16] Robert F. Ling. Comparison of several algorithms for computing sample means and variances. Journal of the American Statistical Association, 69(348):859 – 866, 1974.
 [17] Chao Liu, Hung chih Yang, Jinliang Fan, Li wei He, and YiMin Wang. Distributed nonnegative matrix factorization for webscale dyadic data analysis on mapreduce. In WWW 2010, pages 681–690, 2010.
 [18] Justin Ma, Alex Kulesza, Mark Dredze, Koby Crammer, Lawrence K. Saul, and Fernando Pereira. Exploiting feature covariance in highdimensional online learning. In AISTAT’10, 2010.
 [19] Omid Madani and Jian Huang. Largescale manyclass prediction via flat techniques. In LargeScale Hierarchical Classification Workshop of ECIR, 2010.
 [20] Avihai Mejer and Koby Crammer. Confidence in structuredprediction using confidenceweighted models. In EMNLP’10, pages 971 – 981, 2010.
 [21] Avihai Mejer and Koby Crammer. Confidence estimation in structured prediction. Technical Report 798, Center for Communication and Information Technologies, 2011.
 [22] Patrick Pantel, Thomas Lin, and Michael Gamon. Mining entity types from query logs via user intent modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 563–571, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
 [23] Guoqiang Peter Zhang. Neural networks for classification: A survey. IEEE Transactions on System, Man and Cybernetics  Part C, 30(4):451 – 462, November 2000.
 [24] Peilin Zhao and Steven C. H. Hoi. Otl: A framework of online transfer learning. In ICML 2011, 2011.
Comments
There are no comments yet.