1 Introduction
Global spending on the Internet of Things (IoT) is expected to reach 1 trillion US dollars by 2021, according to a recent spending report IoTSpendingGuide
. Machine Learning (ML) for IoT is a pressing issue from both an economic perspective and an algorithm development perspective. Modern ML algorithms are expensive in terms of latency, memory and energy. While these resource demands are not limiting in a cloud computing environment, they pose serious problems for ML on mobile and edge devices. The current trend in IoT research is to minimize the cost, memory and energy associated with inference. Since training is substantially more costly than inference, models are usually trained in the cloud before being modified and deployed to edge devices.
However, there are distinct advantages to training on the edge. Since data is generally collected from edge and mobile devices, the data is inevitably transmitted to the cloud or some other computeintensive source for training. Even with compression, data transfer is an energy intensive procedure that consumes substantial resources. Data transfer also introduces privacy vulnerabilities. In IoTnetwork , the authors argue that it is significantly more efficient to do computations on the device, rather than transmitting data out of the edge. The computation on mobile and edge devices has grown significantly, but memory is limited and large data transfer is prohibitive due to energy costs.
Foundations of Training ML Models on the Edge:
In this paper, we build the foundation of the feasibility of training machine learning models directly on edge devices. There are few existing techniques because training at the edge involves several challenges. First, the data must be compressed to an extremely small size so that it can comfortably fit on the device. Due to resource limitations, it is impossible to explicitly store the entire dataset. The compression technique must be onepass and computationally efficient. The compressed summary, or sketch, of the data must be updateable with new information collected by the device. This tiny sketch should be sufficient to train the ML model. We also anticipate that the sketches must be combined across multiple devices to train a larger, more powerful model. One can imagine a scenario where IoT devices propagate their sketches along the edges of a communication network, updating their models and passing the information forward. This type of system requires a mergeable summary rabkin2014aggregation .
Formal Setting  Sketches for Regression:
We focus on compressed and distributed empirical risk minimization. Given an point dataset of dimensional examples observed in the streaming setting fiat1998online , the goal is to construct a small structure that can estimate the regression loss over . The size of must scale well with the number of features and the size of . Ideally, the sketch should occupy only a few megabytes (MB) of memory and be small enough to transmit over a network. should also be a mergeable summary agarwal2013mergeable . If we construct a sketch of a dataset and of , we should be able to merge with to get a sketch of the combined dataset . These properties ensure that the sketch is useful in distributed streaming settings.
1.1 Related Work
Regression models are perhaps the most wellknown embodiment of empirical risk minimization, having been independently studied by the machine learning, statistics, and computer science communities. Existing work on compressed regression is largely motivated by prohibitive runtime costs for large, highdimensional linear models and prohibitive storage costs for large data matrices. The canonical leastsquares solution requires computation and storage that scales quadratically with dimensions and linearly with , leading to a proliferation of approximate solutions. Approximations generally fall into two categories: sketching and sampling. One may either apply linear projections to reduce the dimensionality of the problem or apply sampling to reduce . Regardless of the approach taken, there is a wellknown information theoretic lower bound on the space required to solve the problem within an approximation to the optimal loss clarkson2009numerical .
Sketchbased approaches to online compressed regression approximate the large data matrix with a smaller representation. Such algorithms are agnostic about the data in that they rely on subspace embeddings rather than structure within the data matrix to provide a reasonable size reduction. In clarkson2009numerical , the authors propose a sketch that attains the memory lower bound for compressed linear regression. The sketch stores a random linear projection that can recover an approximation to the regression model. Random Gaussian, Hadamard, Haar, and Bernoulli sign matrices been used in this framework with varying degrees of theoretical and practical success dobriban2019asymptotics . While these sketches can only estimate the L2 loss, the “sketchandsolve” technique has been extended to regression with the L1 objective sohler2011subspace and several other linear problems.
Strategies based on sampling are attractive in that they can dramatically reduce the size
of the data that is used with linear algebra routines. Sampling methods can also be applied to more loss functions than sketching methods. The simplest and fastest approach is random sampling. Unfortunately, random sampling has undesirable worstcase performance as it can easily miss important samples that contribute substantially to the model. As a result, adaptive sampling procedures based on leverage score sampling have been developed for the regression problem. Leverage scores can be approximated online
cohen2016online , but are somewhat computationally expensive in practice.Our Contribution:
In this work, we propose a sketch with a large set of desirable practical properties that can perform empirical risk minimization with a small memory footprint. The countbased nature of our Sketches Toward Online Risk Minimization (STORM) enables optimizations that are not possible for other methods. Specifically, our contributions are as follows:

We propose an online sketching algorithm (STORM) that can estimate a class of loss functions using only integer count values. Our sketch is appropriate for distributed data settings because it is small, embarrassingly parallel, mergeable via addition, and able to work with streaming data.

We characterize the set of loss functions that can be approximated with STORM. The price of efficient loss estimation is a restriction on the kinds of losses we can use. While the class of STORMapproximable losses does not include popular regression and classification losses, we derive STORMapproximable surrogate losses for linear regression and linear max margin classification.

We show how to perform optimization over STORM sketches using derivative free optimization and linear optimization. We provide experiments with the linear regression objective showing that STORMapproximable surrogates are a resourceefficient way to train models in distributed settings.
2 Background
2.1 Empirical Risk Minimization
In the standard statistical learning framework, we are given a training dataset of examples and asked to select a function that can predict given . In this paper, we will present our algorithms for the data space and output space , but our results also hold for other metric spaces. The learning problem is select a hypothesis (a function parameterized by ) that yields good predictions, as measured by a loss function . In empirical risk minimization (ERM), we select to minimize the average loss on the training set.
Linear Regression:
We will often use linear regression as an example. Linear regression is an embodiment of empirical risk minimization where . For most applications, is found using the leastsquares or L2 loss: . The unconstrained L2 loss is smooth and strongly convex, with desirable convergence criteria. The parameter is found using ERM, either using gradient descent or via a closedform solution from the matrix formulation of the model
. For our discussion, it will be important to express the loss in terms of the concatenated vector
:2.2 Locality Sensitive Hashing
Locality sensitive hashing (LSH) is a technique from computational geometry originally introduced for efficient approximate nearest neighbor search. An LSH family is a family of functions with the following property: Under
, similar points have a higher probability than dissimilar points of having the same hash value, or
colliding (i.e. ). The notion of similarity is usually based on the distance measure of the metric space . For instance, there are LSH families for the Jaccard broder1997minhash , Euclidean datar2004locality ; dasgupta2011fast and angular distances charikar2002similarity . If we allow asymmetric hash constructions (i.e. ), and may collide based on other properties such as their inner product shrivastava2014mips . To accommodate asymmetric LSH, we use the following definition of LSH. Our definition is a strict generalization of the original indyk1998approximate , which can be recovered by setting probability thresholds.Definition 1.
We say that a hash family is localitysensitive with collision probability if for any two points and in , with probability under a uniform random selection of from .
One example of a symmetric LSH family is the signed random projection (SRP) family for the angular distance goemans1995improved ; charikar2002similarity . The SRP family is the set of functions , where . The SRP collision probability is
The inner product hash shrivastava2014mips is an example of an asymmetric LSH. Suppose we replace with and with in the SRP function. This procedure essentially uses different hash functions for and , but the collision probability is now a monotone function of the inner product . In particular, is the same as for SRP but without the normalization by and . Here, we implicitly assume that and are inside the unit sphere, so we often scale the dataset when using this inner product hash in practice.
Collision Probabilities and Sketching:
The collision probability is a positive function of and . For most LSH functions, is a function of the distance and is a positive definite kernel function coleman2020race . For asymmetric LSH functions, can take on nontrivial shapes because . The RACE streaming algorithm luo2018ace ; coleman2020race provides an efficient sketch to estimate the sum
when forms the collision probability of an LSH function. The sketch consists of a sparse 2D array of integers that are indexed with an LSH function. RACE sketches have error guarantees that relate the quality of the approximation to the memory and computation requirements. The error depends mostly on the value of and does not directly depend on .
The RACE sketch can also be released with differential privacy without substantially increasing the error bound coleman2020private . There are two directions toward a private sketch: private LSH functions and private RACE sketches. Private LSH functions can be constructed by adding Gaussian noise to the hash function value for projectionbased hash functions Kenthapadi_Korolova_Mironov_Mishra_2013 . This strategy preserves differential privacy for the data attributes of each example in the dataset. Private RACE sketches can be constructed by adding Laplace noise to the count values for each cell in the sketch. This strategy preserves differential privacy at the examplelevel granularity.
In practice, RACE can estimate within 1% error in a 4 MB sketch coleman2020race when is a positive definite kernel. To construct a RACE sketch, we create an empty integer array with rows and columns. We construct LSH functions with collision probability . We increment column in row when an element arrives from the stream. We estimate the loss by returning the average count value at the indices .
3 STORM Sketches for Estimating Surrogate Losses
While collision probabilities have been wellstudied in the context of improving nearneighbor search gionis1999similarity , we study them from a new perspective. Our goal is to compose useful loss functions from LSH collision probabilities and perform ERM on sketches. So far, RACE sketches have enabled new applications in metagenomics coleman2019diversified and compressed near neighbor search coleman2020neighbor
because they can efficiently represent the kernel density estimate. RACE sketches reduce the computation and memory footprint because they replace complicated sampling and indexing procedures with a simple sketch
coleman2020race .We propose RACEstyle sketches to approximate the empirical risk for distributed ERM problems. In this context, we query the sketch with the parameter to estimate the empirical risk. This requires some extensions to the RACE sketch. First, we use asymmetric LSH functions to hash data and with different functions. We also propose methods that apply several LSH functions to and increment more than one index in each row, adding together collision probability functions. For instance, we propose Paired Random Projections (PRP) for an LSH surrogate linear regression loss. PRP inserts elements into the sketch with two signed random projection (SRP) hashes but queries using only one SRP. Our algorithm to generate Sketches Toward Online Risk Minimization (STORM) is presented in Figure 1 and Algorithm 1.
Intuition:
To understand why STORM sketches contain enough information to perform ERM, consider the task presented in Figure 2. By partitioning the space into
random regions, we can find the approximate location of the input data by examining the overlap of denselypopulated partitions. We can then identify a good regression model as one that mostly passes through dense regions. We can also solve classification problems by favoring hyperplanes that separate dense regions with different classes. STORM sketches extend this idea by optimizing the model
over the counts to find a that collides with many data points. The main challenge is to design an LSH function that has a large (or small) count when is a good model.[width=keepaspectratio]imgs/storm_intuition.png
Optimization:
Once we have a sketch and an appropriate hash function, we want to optimize the model parameter to minimize the STORM estimate of the empirical risk. The standard ERM technique is to apply gradient descent to find the minimum. Due to the countbased nature of the sketch, we cannot analytically find the gradient. We will resort to derivativefree optimization techniques conn2009introduction , where blackbox access to the loss function of interest (or its sharp approximation) is sufficient. Since our focus is not on derivativefree optimization, we employ a simple optimization algorithm that queries the sketch at random points in a sphere around . Using only a few (10) cheap loss evaluations, we approximate the gradient and update .
For some STORM sketches, we obtain improvements over standard derivativefree methods with linear optimization. Such methods attempt to place into the optimal set of hash partitions. Linear optimization is possible when the hash function is a projectionbased LSH in .
4 Theory: Sketchable Surrogate Loss with Same Minima
We begin with a formal discussion of the families of losses that STORM can approximate. Using the compositional properties of LSH, we can show that STORM can provide an unbiased estimator for a large class of functions. For instance, we can insert elements to the STORM sketch (or use a second sketch) to add collision probabilities. We can also concatenate LSH functions to estimate the product of collision probabilities. Theorem
1 describes the set of functions that STORM can approximate.Theorem 1.
The set of STORMapproximate functions contains all LSH collision probabilities and is closed under addition, subtraction, and multiplication.
Theorem 1 says that we can approximate any sum and/or product of LSH collision probabilities using one or more STORM sketches. Given the number of known LSH families, the flexibility of asymmetric LSH, and the closure of under addition and multiplication, Theorem 1 suggests that is an expressive space of functions. In particular, we show that contains useful losses for machine learning problems.
4.1 Constructing STORM Approximable Surrogate Loss for Linear Regression
Since we query the STORM sketch with , we must be able to express as . This condition is not terribly limiting, but it does restrict our attention to hypothesis classes where directly interacts with and in a way that can be captured by an LSH function. Linear regression is the simplest example of such a model.
Designing the Loss:
In the nonregularized linear regression objective, interacts with the data via the inner product . The linear regression loss function is a monotone function of the absolute value of this inner product. The asymmetric hash function discussed in Section 2.2 seems promising, as it depends on the inner product between and . However, its collision probability is monotone in the inner product, not the absolute value. To obtain a surrogate loss with the correct dependence on , observe that the collision probability is monotone decreasing if we hash instead of . Thus, we can obtain a function that is monotone in by adding together the collision probability for (which is monotone increasing) and for (which is monotone decreasing). We refer to our construction as paired random projections (PRP) because the method can be implemented by hashing and with the same SRP function. Rather than update a single location in the sketch, we update the pair of random projection locations. When we query with the vector , STORM estimates the following surrogate loss for linear regression. Here, is an integer power of 2 which determines the number of random projections used for the PRP hash function.
Theorem 2.
When , the PRP collision probability is a convex surrogate loss for the linear regression objective such that
The integer is a parameter that determines the number of hyperplanes used by the PRP hash functions. The hyperplanes split the data space into partitions. Intuition suggests that should be large, because we can obtain more information when we have many partitions. However, one can observe from Figure 3(a) that the surrogate loss landscape becomes very flat near the optimum for large , making the function hard to optimize. We find that results in the most strongly convex function in a localized region around the optimum. This can be visualized by examining the gradient, for example, at . Near the optimal regression line, the steepness of the loss basin varies with as shown in the figure 3 (b). This is of practical importance because fewer noisy estimates of are necessary to optimize a strongly convex function with derivativefree optimization methods.
[scale =0.4]imgs/surrogate.pdf 
[scale =0.4]imgs/bestP.pdf 
Once the STORM sketch with repetitions and bins is created, we proceed with optimization. We perform derivativefree optimization as discussed before, with an additional constraint. We initialize to zeros in dimensions. The additional dimension is due to querying the sketch with rather than . We compute the approximate gradient by querying the sketch with equidistant points in a ball around and update the parameter. After each iteration, we project the last dimension of back onto the constraint . Please refer to Algorithm 2 for details.
4.2 Constructing STORM Approximable Surrogate Loss for MaxMargin Linear Classification
Analogous to linear regression, we can solve other risk minimization problems using STORM. Consider linear hyperplane classifiers of the form
. Most popular losses used to find are classificationcalibrated margin losses. A loss function is classificationcalibrated, or Bayes consistent, if the optimal hypothesis under the loss is the same as the Bayes optimal hypothesis bartlett2006convexity . We propose the following classificationcalibrated loss function.Theorem 3.
The loss function for the linear hyperplane classifier is a classificationcalibrated margin loss and is STORMapproximable.
By optimizing in a similar fashion as with regression, we can train linear classifiers with STORM sketches. The LSH function that implements is the asymmetric inner product hash, but with the argument to the hash function multiplied by . See the appendix for details.
5 Experiments
Dataset  Description  

airfoil  1.4k  9  Airfoil parameters to predict sound level 
autos  159  26  Automobile prices and information to predict acquisition risk 
parkinsons  5.8k  21  Telemonitoring data from parkinsons patients, with disease progression 
Datasets:
We performed experiments on three UCI datasets, described in Table 1. We selected datasets with different dimensions, characteristics and sizes. We mainly consider higherdimensional regression problems (), though we provide qualitative results on simulated 2D regression data to provide intuition about the type of regression models found by STORM.
Baselines:
We compare our method against sampling baselines and sketchbased methods i.e. random sampling, leverage score sampling, and the linear algebra sketch proposed by Clarkson and Woodruff clarkson2009numerical for compressed linear regression. We implement all baselines using the smallest standard data type and compare against a range of parameters.
Experiment setup:
For our regression sketches on STORM, we use PRP with to create the sketch and we vary the number of repetitions. We report results on the training risk, since our objective is to show that the parameters found using STORM are also minimizers of the empirical risk function. We average over 10 runs for our baselines and sketches, where each run has an independentlyconstructed sketch or random sample. Thus, our average is over the random LSH functions used to construct the sketch and the stochastic derivativefree gradient descent instances.
Results:
In Figure 4, we report the mean square error for our method when compared with baselines at a variety of memory budgets. We observe a double descent phenomenon for our sampling baselines, explaining the peak near the intrinsic dimensionality of the problem. This samplewise double descent behavior was recently proved for linear regression by nakkiran2019more . STORM does not experience the double descent curve in practice because the entire dataset (not just a subsample) is used to minimize the loss. We perform favorably against baselines in memory regimes affected by double descent and STORM performs competitively in other memory regimes. We also observe that the found using STORM converges to the optimal under leastsquares ERM. This validates our theory that PRP provides a surrogate loss.
[width=1.8in]imgs/airfoil.png 
[width=1.8in]imgs/autos.png 
[width=1.8in]imgs/parkinsons.png 
We also evaluate the linear regression and classification STORM losses on 2D synthetic data (Figure 5). We generated synthetic datasets and ran the derivativefree optimizer on a STORM sketch for 100 iterations. We used for both experiments, with for regression and for the classification loss.
[height=1.6in]imgs/storm.pdf [height=1.6in]imgs/class_qual.png
6 Discussion
STORM is a scalable way to solve ERM problems, with an efficient streaming implementation and quality risk approximation. While STORM provides a pointwise and nonsmooth approximation to the empirical risk, we find that few estimators (small ) are necessary to obtain a model parameter similar to the one obtained by minimizing the full loss summation. Our sketchbased loss estimators are sufficiently sharp to compete with sampling and linear algebra baselines, while also naturally accommodating regularization, streaming settings and many useful surrogate losses. Given the simplicity and scalability of the sketch, we expect that STORM will enable distributed learning via ERM on edge devices.
7 Broader Impacts
The prospect of retaining and training useful models solely on the edge eliminates the opportunity for data interception or leakage, critical to keeping data secure and private. There are significantly reduced privacy concerns if no data is transmitted. This is key to the application and widespread adoption of machine learning in many industries, particularly in industries such as in healthcare where is there is a large demand for machine learning but similarly great data security concerns, particularly with the moral and significant financial ramifications. An important implication is the prospective elimination of central data repositories. The danger of major data leaks is curbed without a central repository. Furthermore, data transmission is one of the most energy consuming tasks devices on the edge will undergo, especially if the data is in the form of video or audio. Due to the proliferation of edge devices, reducing the data transmission energy consumption of edge devices is one of the most pressing problems in machine learning. STORM allows models to be trained at the edge with minimal storage, eliminating the privacy and energy consumption of data transmission and storage.
References
 [1] Why edge computing is critical for the IoT. NetworkWorld, https://www.networkworld.com/article/3234708/whyedgecomputingiscriticalfortheiot.html.
 [2] Worldwide Internet of Things Spending Guide. https://www.idc.com/getdoc.jsp?containerId=IDC_P29475. Accessed: 20100930.
 [3] Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):1–28, 2013.
 [4] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
 [5] Andrei Broder. On the resemblance and containment of documents. IEEE Compression and Complexity of Sequences, 1997.
 [6] Moses Charikar. Similarity estimation techniques from rounding algorithms. STOC, 2002.

[7]
Kenneth L Clarkson and David P Woodruff.
Numerical linear algebra in the streaming model.
In
Proceedings of the fortyfirst annual ACM symposium on Theory of computing
, pages 205–214, 2009. 
[8]
Michael B Cohen, Cameron Musco, and Jakub Pachocki.
Online row sampling.
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
, 2016.  [9] Benjamin Coleman, Richard G Baraniuk, and Anshumali Shrivastava. Sublinear memory sketches for near neighbor search on streaming data with race. In International Conference on Machine Learning, 2020.
 [10] Benjamin Coleman, Benito Geordie, Li Chou, RA Leo Elworth, Todd J Treangen, and Anshumali Shrivastava. Diversified race sampling on data streams applied to metagenomic sequence analysis. bioRxiv, page 852889, 2019.
 [11] Benjamin Coleman and Anshumali Shrivastava. A onepass private sketch for most machine learning tasks. arXiv preprint arXiv:2006.09352, 2020.
 [12] Benjamin Coleman and Anshumali Shrivastava. Sublinear race sketches for approximate kernel density estimation on streaming data. In Proceedings of the 2020 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 2020.
 [13] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivativefree optimization, volume 8. Siam, 2009.
 [14] Anirban Dasgupta, Ravi Kumar, and Tamas Sarlos. Fast localitysensitive hashing. KDD, 2011.
 [15] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni. Locality sensitive hashing scheme based on pstable distributions. Symposium on Computational Geometry, 2004.
 [16] Edgar Dobriban and Sifan Liu. Asymptotics for sketching in least squares regression. In Advances in Neural Information Processing Systems, pages 3670–3680, 2019.
 [17] Amos Fiat. Online algorithms: The state of the art (lecture notes in computer science). 1998.
 [18] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Vldb, volume 99, pages 518–529, 1999.
 [19] Michael Goemans and David Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. ACM, 1995.

[20]
Piotr Indyk and Rajeev Motwani.
Approximate nearest neighbors: towards removing the curse of dimensionality.
In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.  [21] Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. Privacy via the johnsonlindenstrauss transform. Journal of Privacy and Confidentiality, 5(1), Aug. 2013.

[22]
Chen Luo and Anshumali Shrivastava.
Arrays of (localitysensitive) count estimators (ace): Highspeed anomaly detection via cache lookups.
WWW, 2018.  [23] Preetum Nakkiran. More data can hurt for linear regression: Samplewise double descent. arXiv preprint arXiv:1912.07242, 2019.
 [24] Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael J Freedman. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 275–288, 2014.
 [25] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In NIPS, 2014.
 [26] Christian Sohler and David P Woodruff. Subspace embeddings for the l1norm with applications. In Proceedings of the fortythird annual ACM symposium on Theory of computing, pages 755–764, 2011.
Appendix
We provide proofs for our theorems and further discussion of our classification surrogate losses.
Proof of Theorem 1
Theorem 1.
The set of STORMapproximable functions contains all LSH collision probabilities and is closed under addition, subtraction, and multiplication.
Proof.
It is straightforward to see that STORM can approximate
as long as there is an LSH function with the collision probability . To prove the theorem, it is sufficient to show that given two LSH collision probabilities and , STORM sketches can approximate the following two functions
Note that one can always write a product of (weighted) sums as the sum of (weighted) products . Therefore, the previous two situations ensure that the set is closed under addition, subtraction and multiplication.
Addition and Subtraction: Because of the distributive property of addition,
One can then construct a STORM sketch for the summation and a second STORM sketch for summation. We can estimate any linear combination of and by with a weighted sum of and .
Multiplication: To approximate sums over the product , we rely on LSH hash function compositions. Suppose we have an LSH function with collision probability and with . Consider the hash function where is an injective (or unique) mapping from . An example of such a mapping is the function where and are coprime. Since the mapping is injective, this means that only when and . Therefore,
Make the choice of the LSH functions and independently, so that the probability factorizes
Therefore, one can construct a STORM sketch for the product using the LSH function .
∎
Proof of Theorem 2
Theorem 2.
When , the PRP collision probability is a convex surrogate loss for the linear regression objective such that
Proof.
For the surrogate ERM problem to have the same solution as the linear regression ERM problem, it is sufficient to show two things: that the surrogate loss is convex and that the global minima of the surrogate loss and the linear regresssion loss appear in the same location. The surrogate loss is
and the corresponding empirical risk minimization problem is
For the sake of notation, we will put , , and
We will use the fact that
Location of Minima:
The minimum of the surrogate loss is same as the minimum for least squares linear regression. Using the chain rule, the gradient of the surrogate loss is
When , the gradient is always zero. When , the derivative is zero when because that is where . Thus, the surrogate loss has the same minimizer as the least squares loss.
Convexity: At index , the Hessian of the surrogate loss is
Thus, the Hessian
The gradient
Simplifying, we obtain
Which gives the following expression for the Hessian
where
It is easy to see that . Hence, the Hessian is positive semidefinite and the function is convex in . Also note that the function is convex in , since the restriction of the last dimension of to is the restriction of a convex function to a convex set.
∎
Proof of Theorem 3
[width=4in,keepaspectratio]imgs/class_losses.png
Theorem 3.
Consider labels . The loss function for the linear hyperplane classifier is a classificationcalibrated margin loss and is STORMapproximable.
Proof.
First, we show that the loss is classificationcalibrated. Then, we show that the loss can be estimated using STORM.
Loss is ClassificationCalibrated: A necessary and sufficient condition for a convex^{1}^{1}1For nonconvex , the sufficient conditions are more complicated. loss function to be classificationcalibrated is for at . Here, , where is the model. For a linear hyperplane classifier, . The loss is therefore
is convex when for the same reasons discussed in the proof of Theorem 2. Note that the simple asymmetric LSH for the inner product that we have used throughout the paper^{2}^{2}2There are other asymmetric inner product LSH functions without this requirement, and in practice one usually scales the data to make the condition true. requires . The derivative is
At the origin, is . Therefore, the loss is classification calibrated. Figure 6 compares our STORM surrogate classification loss against popular margin losses.
Loss is STORMApproximable: Consider the asymmetric LSH function for the inner product where we premultiply by . The collision probability under this LSH function is
as desired. ∎
Comments
There are no comments yet.