Tighter Bound Estimation of Sensitivity Analysis for Incremental and Decremental Data Modification

03/06/2020 ∙ by Rui Zhou, et al. ∙ University of Oxford Tsinghua University 0

In large-scale classification problems, the data set may be faced with frequent updates, e.g., a small ratio of data is added to or removed from the original data set. In this case, incremental learning, which updates an existing classifier by explicitly modeling the data modification, is more efficient than retraining a new classifier from scratch. Conventional incremental learning algorithms try to solve the problem exactly. However, for some tasks, we are only interested in the lower and upper bound for some values relevant to the coefficient vector of the updated classifier without really solving it, e.g., determining whether we should update the classifier or performing some sensitivity analysis tasks. To deal with these such tasks, we propose an algorithm to make rational inferences about the updated classifier with low computational complexity. Specifically, we present a method to calculate tighter bounds of a general linear score for the updated classifier such that it's more accurate to estimate the range of interest than existing papers. The proposed method can be applied to any linear classifiers with differentiable convex L2 regularization loss function. Both theoretical analysis and experiment results show that the proposed approach is superior to existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For large-scale classification problems, dealing with modified data sets is an important research problem, since the data set may be updated under diversified conditions, e.g., Zhang et al. [32]

propose a novel unsupervised learning approach for clustering the data set with missing data. Ristin et al.

[19]

present two variants of Random Forests (RF) to treat the dynamically growing data set. With a large-scale training data set, training a classifier may be very time-consuming. Thus, when a part of the data set is modified, directly retraining the classifier on the whole new training data set may not be a good choice, especially when the amount of modified data set is limited. In this case, estimating the change caused by the modified data set and studying the sensitivity of the classifier becomes very important.

The so-called incremental learning is designed for the case in which we want to update the classifier in a different way when the data set is modified [24, 3, 6]. In the literature, there are mainly two categories of incremental learning methods.

For the first category, the updated classifier can be explicitly derived based on an optimization technique called parametric programming. For example, Gu et al. [9] extend the online -support vector classification algorithm to the modified formulation and presents an effective incremental support vector ordinal regression algorithm, which can handle a quadratic formulation with multiple constraints, where each constraint is constituted of equality and inequality. Karasuyama et al. [11] develop an extension of the incremental and decremental algorithm which can simultaneously update multiple data points.

For the second category, some warm-start approaches without explicitly deriving the updated formulation can also be helpful for reducing the incremental learning costs. Tsai et al. [25] find that the warm start setting is in general more effective to improve the primal initial solution than the dual and that the warm-start setting could speed up a high-order optimization method more effectively than a low-order one. Shilton et al. [22]

propose a new incremental learning algorithm involving using a warm-start algorithm for the training of support vector machine, which takes advantage of the natural incremental properties of the standard active set approach to linearly constrained optimization problems.

However, conventional incremental learning algorithms aim to solve the primal problem or the dual problem of the optimization problem, which signifies the high cost of computing. Xu et al. [30] propose an incremental support vector machines based on Markov re-sampling (MR-ISVM) whose computational complexity is up to , where is the number of samples. Moreover, sometimes it is unnecessary to figure out the exact values of the coefficient vector of the updated classifier if our purpose is only to obtain a lower or upper bound of some values associated with the coefficient vector. Moreover, most conventional incremental learning algorithms can only be applied to certain classification models, which largely decreases their applicability. For example, Rüping et al. [20] propose an approach for incremental learning specialized for Support Vector Machines. Ren et al. [18]

put forward an incremental algorithm only for bidirectional principal component analysis used in pattern recognition and image analysis. For these reasons, the sensitivity analysis which has broad applicability of the updated classifier is necessary.

The aim of sensitivity analysis is not to obtain the exact values relevant with the updated classifier, but to estimate the scale of values of our interest with the updated classifier. It can help researchers understand ”how uncertainty in the output of a model (numerical or otherwise) can be apportioned to different sources of uncertainty in the model input” [21]. Since sensitivity analysis doesn’t require the calculation of exact values of the updated classifier, it can be executed efficiently. Recently, Okumura et al. [17] proposed a sensitivity analysis framework which can be used to estimate the bounds of a general linear score of the updated classifier.

The performance of sensitivity analysis highly relies on the tightness of bounds it derived. Inspired by recent papers on feature screen in the L1 sparse learning framework [8, 14, 26, 29], we make use of a composite region test and propose a sensitivity analysis framework which can infer bounds for relevant values about the coefficient vector of the updated classifier. Our work aims to improve the algorithm proposed in [17] and our algorithm can estimate tighter lower and upper bounds of a general linear score in the form of , where is the coefficient vector of the updated classifier, and is the vector which has the same dimension as .

As the proposed algorithm does not have to solve any primal problem or any relevant dual problem of the optimization problem, it largely decreases the computing complexity compared with conventional incremental algorithms and its computing complexity only depends on the number of modified instances, including the number of removed instances and added instances. In addition, the proposed algorithm is more robust than the algorithm proposed in [17]. Even with a relatively large modification on the data set, the proposed algorithm can still make accurate estimations. And the proposed algorithm only requires that the estimated classifier is calculated based on the differentiable convex L2-regularization loss function, the proposed algorithm can be applied to diversified algorithms.

The estimation about lower and upper bounds of a general linear score in the form of

has numerous applications in a wide range of fields. Firstly, with this scale, the moment when the classifier should be updated can be easily estimated, which can prevent us from making unnecessary updating or omitting necessary updating. Secondly, when the scale is enough tight, classification results of the updated classifier can be also estimated based on the linear score, which can ensure both low computing complexity and high classification accuracy. Thirdly, the proposed algorithm can also be combined with some model selection algorithms in order to reduce the computing time of model selection algorithms.

Our contribution can be listed as follow:

  • Firstly, the proposed sensitivity analysis algorithm can estimate tighter bounds of a general linear score about the coefficient vector of the updated classifier.

  • Secondly, the proposed algorithm has lower computing complexity and higher robustness compared to existing methods.

  • Thirdly, the proposed algorithm has better applicability than existing methods, as it only demands that the estimated classifier is based on the differentiable convex L2-regularization loss function.

  • Fourthly, the proposed algorithm has numerous applications which can improve the efficiency when solving large-scale classification problem.

The rest of paper is organized as follows. In Section II, we will define some necessary mathematical notations and give a brief overview of the problem setup. Then we will describe two sensitivity analysis tasks that the proposed algorithm can deal with. In Section III, we will give the proof of the proposed algorithm and apply it to the sensitivity analysis tasks proposed in Section II. In Section IV, we will present the detail of our simulation and the analysis of simulation results will also be shown. In Section V, we will conclude our work and discuss the future direction of our work.

Ii Notation and Background

In this section, we first describe the background of our problem, including the mathematical notation, conventional methods, and our proposed algorithm. Then, we describe two sensitivity analysis tasks that the proposed algorithm can be applied to.

Ii-a Problem Setup

The problem we study in this paper is that there exists a trained classifier on the original data set, while the current data set is modified by a small number of instances. Instead of retraining a new classifier from scratch with high computing cost, we propose an algorithm to make an inference, in an efficient manner, about the scale for the coefficient vector of the updated linear classifier.

We denote scalars in regular ( or ), vectors in bold () and matrices in capital bold (). Specific entries in vectors or matrices follow the corresponding convention, , the dimension of vector is , and is a column vector and all of its elements are .

We use and denoting the original data set and the updated training data set respectively, where and is a binary scalar. The number of training instances in the original training set and the updated training set are denoted as and . We consider the scenario where we have an existing classifier , then a small amount of instances are added to or removed from the original data set. We denote the set of added and removed instances as and , and, and are the number of instances in set and respectively. Thus, we will have

(1)

We define a new variable

(2)

which is used to describe the ratio of the modified data to the original data set.

Here we consider a class of L2 regularized linear classification problems with the convex loss function, hence, the original and updated classifiers, trained respectively with the original data set and the modified data set, are defined as

(3)
(4)

where and are the optimal solution to the equations (3) and (4) respectively; denotes the regularization constant and is a differentiable and convex loss function. When is a squared hinge loss function, i.e.,

(5)

the problem in (3) or (4) corresponds to SVM with squared hinge loss (L2-SVM) [23, 12] 111Please note that in this paper we do not include any bias term , since we can deal with this term by appending each instance with an additional dimension, i.e., . When

is a logistic regression loss function, i.e.,

(6)

the problem in (3) or (4) corresponds to L2-regularized logistic regression [31, 7]. For any , we denote individual loss and the gradient of the individual loss as

(7)
(8)

Our main interest in this paper is to avoid calculating explicitly as in conventional incremental learning algorithms which has high computing cost when the whole data set is large. Instead, we aim to bound the linear score relevant to the coefficient vector of the updated classifier

(9)

where can be any dimension vector in ; and are the lower and upper bounds of . In addition, our framework can also be applied to nonlinear classification problems with kernel trick, i.e., , can be represented by the kernel function. Next, we introduce two applications which we use to verify the performance of the proposed algorithm.

Ii-B Sensitivity analysis tasks

Ii-B1 Task 1. Sensitivity of coefficients

When a small amount of instances is added or removed from the original data set, we care about the change about the coefficient vector of the updated classifier, which can help us to decide whether we should update our classifier or not. Unless the change is unacceptably large, we can still use the original classifier. Let be a vector of all 0 except 1 in the element, hence we can bound the coefficient of coefficient vector for the updated classifier using equation (9) as

(10)

In order to evaluate the tightness of the bound, we define the variable as following

(11)

Ii-B2 Task 2. Sensitivity of test instance labels

Using the same framework, we can also determine the label of any test instance with the updated classifier, even though we don’t know the exact value for the coefficient vector of the updated linear classifier. Let be the predicted classification result of , by setting in (9), we can calculate the lower and upper bounds of as

(12)

and by using the simple facts

(13)
(14)

we can obtain the classification result of without actually solving (4). If the ratio of modified data set is small, we can expect that won’t have huge difference compared with , which means bounds in (12) can be sufficiently tight.

In this task, we calculate the error ratio of the test instances whose lower bound and upper bound have different signs, as for these test instance, we cannot determine the classification result of . The definition of this error ratio can be written as:

(15)

where is the number of instances with different signs for their estimated lower bound and upper bound; is the size of the whole test data set. By comparing error ratios for different sensitivity analysis algorithms, we can evaluate their performance.

Iii Sensitivity Analysis via Segment Test

In this section, we demonstrate our method for sensitivity analysis tasks. The main idea of our method is trying to restrict into a region, then we can calculate the lower and upper bounds of the linear score within that region. We call this approach as region test as in [28]. The following theorem will tell us how to calculate corresponding bounds. Firstly, we will briefly review the theorem proved in [17], then we will present the proposed algorithm.

Iii-a Region Test

Iii-A1 Sphere Test

In this part, we will present the method proposed in [17] which serves as the lemma for this paper. Our work is done based on this method.

Lemma 1 (Sphere Region) Let and be the optimal solution of the problem (3) and (4) respectively. Given , then is within a sphere region

(16)

where is the center of the sphere and is the radius of the sphere. They are defined as

(17)
(18)

where is

(19)

By restricting into a sphere , we can calculate the lower and the upper bounds of , here we introduce the following proposition proposed in [17].

Proposition 1 (Sphere Test) The lower and the upper bounds of in the sphere region are respectively

(20)

and

(21)

From (20) and (21), we notice that the main computation cost of the sphere test only depends on the computation of in (19), which only involves the removed data set and the added data set . Therefore, the sphere test can efficiently decrease the computing complexity.

Iii-A2 Half Space Test

Considering that can be also bounded in other ways, in this part, we introduce half space test.

Proposition 2 (Half Space Region) Given , then the optimal solution is within in the half space

(22)

where is a unit normal vector of the plane and is the distance of the plane from the origin as in [15]. They are defined as

(23)
(24)

where

(25)

Proof: Because the loss function is convex with respect to . According to [16], we have

(26)

As is the optimal solution of the loss function, we have

(27)

By adding (26) and (27), we found

(28)

Therefore,

(29)

 

Iii-A3 Sphere Test and Half Space Test

In this part, we study the relationship between the sphere region and the half space region. What we care about is whether the sphere region can be divided by the half space region into two parts or not.

Proposition 3 (Sphere Region and Half Space Region) Given , we can find that the optimal solution is included in a specific half space region where

(30)
(31)

and

(32)

Meanwhile the optimal solution is also within a sphere region The sphere region will be divided into two parts by this special half space region, which means that we can restrict within a smaller region called the segment region.

Proof: According to Preposition 2 and the fact that is a feasible solution of the unconstrained problem (4), letting , we have

(33)

and

(34)

According the definition of , we have

(35)

By using quadratic approximation of Taylor’s Formula as in [10], we have

(36)

Moreover, is the optimal solution of the convex loss function . Thus, meets following equation:

(37)

which is equal to

(38)

By combining the equation (35), (36) and (38), we have

(39)
Fig. 1: Segment Region .

After acquired the half space region, we consider a segment region test based on nonempty intersection of the closed sphere and the closed half space . As in [2], the distance between the sphere center and the plane can written as

(40)

Thus, the coefficient and the intersection point between the plane and the radius of the sphere which is orthogonal to the plane can be defined as

(41)

where , , and are defined in (17), (18), (30) and (31). We have

(42)

So, the range of is , which means that the intersection point is located inside the sphere. Thus, the sphere region is divided by the half space region into two parts and the size of the generated segment region is smaller than the sphere region. In this way, we restrict into a smaller segment region. Figure 1 illustrates the segment region .   

Iii-A4 Segment test

Motivated by the method introduced in [13, 28], we combine the closed sphere region and the closed half space region to get tighter bounds for .

Proposition 4 (Segment Test) Using the definition of , , and in (17), (18), (30) and (31), we can found the upper and the lower bound of and the computing complexity of this calculation only depends on the added and removed data set and . The lower and the upper bound of in the segment region are respectively

(43)

and

(44)

where

Proof: We can obtain the lower bound of by solving the following optimization problem

(45)

The Lagrange function [1] of (45) can be written as

(46)

where and are Lagrange multipliers. As in [27], setting the derivative with regard to primal variable to zero yields

(47)

Substituting into ,

(48)

Substituting into (46),

(49)

where is defined in (42), and . Hence, we can divide this problem into two cases:

(a)

In order to satisfy the complementary slackness condition in (46), we found

(50)

Then substitute into (49) and set the derivative of of (49) with respect to equal to zero, we have

(51)

Based on (47), it’s simple to check out that

(52)

Therefore is equal to . Substitute and into (48), we obtain

(53)

(b)

Setting the derivatives of of (49) with respect to and equal to zero yields, we found

(54)
(55)

Using (54) and (55) together, we can easily get

(56)
(57)

According to the Karush–Kuhn–Tucker conditions, we have

(58)

Thus,

(59)

It’s simple to prove that (59) is equal to . Substitute and into (48)

(60)

By combining case (a) and case (b), the lower bound in (43) is obtained. And the upper bound (44) can be similarly derived.   

Iii-B Bound of sensitivity analysis tasks

Iii-B1 Bound of coefficients

By using segment test, we can get the lower and upper bound of the element of the coefficient vector for the updated linear classifier by substituting in (43) and (44).

Corollary 1 (Bound of coefficients) , the coefficient for coefficient vector of the updated classifier satisfies

(61)

and

(62)

where are the coefficient of defined in (19); and are defined in (18) and (42) respectively; .

Given lower and upper bounds of , we can obtain the bounds for . Here we choose the average for the tightness of the bound of each instance as the evaluation variable called bound tightness.

Iii-B2 Bound of test instance labels

Next, we use Proposition 4 for the sensitivity of test instance labels. Substituting into (43) and (44), we can get the lower and upper bounds of . If the signs of the lower and upper bound are the same, we can infer the test instance label by calculating the bounds instead of calculating the coefficient vector of the updated linear classifier.

Corollary 2 (Bound of test instance labels) For any test instance , the classification result using the updated classifier is

(63)

which satisfies

(64)

where

(65)

and

(66)

, and are defined in (17), (18) and (42) respectively, .

Iv Experimental Results

In this section, we conduct extensive experiments to evaluate our method on real-world data sets. We first provide a brief description of data sets and experimental setup, then we show the simulation result of four simulations and evaluate the performance of different algorithms.

Data set
D1 w8a 49749 300 14951 I
D2 a9a 32561 123 16281 II
D3 cod-rna 59535 8 271617 III & IV
TABLE I: Data sets used for experiments
Data sets

In this part, we use data sets from LIBSVM data set repository [4] for comparison, and data sets used for each test are presented in Table I. For each test, the regularization constant is set to be {0.2, 0.5, 1}, meanwhile the ratio of modified data set is set to be {0.01%, 0.02%, 0.05%,0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%}.

experiment setting

We use the Matlab linear system called ’liblinear-2.21’ proposed in [5] to calculate relevant bounds or values for the coefficient vector of the linear classifier with the L2-SVM or L2-logistic regression. In experiments, we compare our method with the algorithm described in [17]. Our method and the algorithm proposed by [17] are called segment test and sphere test respectively. In order to guarantee the reliability of the experiment, for each set with a different ratio of modified data set , different regularization constant

and different data set, we repeat the experiment 30 times. All the experiments were performed on a desktop with 4-core Intel(R) Core(TM) processors of 2.80 GHz and 8 GB of RAM. And the Matlab 2017 is used to conduct the simulation and the R is used for the data visualization.

Iv-a Results on sensitivity analysis of coefficients

Fig. 2: Comparison of coefficients bounds for L2-SVM. The lower the y values means that the differences between the upper and lower bounds for coefficients are smaller.
0.01 0.02 0.05 0.1 0.2 0.5 1 2 5 10
New Model Time(ms) 72.03 73.26 70.96 70.49 71.60 71.56 70.06 70.38 73.24 79.60
Sphere Test Time(ms) 0.67 0.58 0.46 0.43 0.41 0.52 0.62 0.87 1.62 2.78
Tightness 2.03e-4 3.74e-4 4.54e-4 7.40e-4 1.10e-3 2.69e-3 4.50e-3 8.66e-3 1.97e-2 3.50e-2
Segment Test Time(ms) 1.23 0.95 0.65 0.51 0.50 0.58 0.55 0.86 2.58 4.78
Tightness 2.03e-4 3.74e-4 4.54e-4 7.40e-4 1.10e-3 2.55e-3 3.73e-3 6.29e-3 9.46e-3 1.09e-2
New Model Time(ms) 66.33 67.92 66.13 64.50 65.23 64.26 67.29 67.01 69.77 74.01
Sphere Test Time(ms) 0.39 0.36 0.37 0.35 0.39 0.46 0.66 0.87 1.56 2.78
Tightness 6.45e-5 7.20e-5 2.23e-4 3.79e-4 7.26e-4 1.73e-3 3.44e-3 6.41e-3 1.52e-2 2.90e-2
Segment Test Time(ms) 0.42 0.35 0.37 0.36 0.38 0.39 0.60 0.71 2.23 4.43
Tightness 6.45e-5 7.15e-5 2.23e-4 3.61e-4 6.14e-4 1.19e-3 1.77e-3 2.36e-3 4.04e-3 5.33e-3
New Model Time(ms) 62.89 61.66 63.48 61.38 62.64 62.10 62.49 63.67 69.30 70.94
Sphere Test Time(ms) 0.31 0.28 0.29 0.31 0.33 0.41 0.55 0.81 1.48 2.71
Tightness 3.25e-5 7.92e-5 1.60e-4 2.76e-4 5.46e-4 1.31e-3 2.74e-3 5.36e-2 1.25e-2 2.41e-2
Segment Test Time(ms) 0.36 0.31 0.30 0.30 0.29 0.36 0.42 0.66 2.03 4.88
Tightness 3.25e-5 7.78e-5 1.60e-4 2.28e-4 3.91e-4 6.16e-4 1.05e-3 1.40e-3 1.95e-3 2.65e-3
TABLE II: EXPERIMENT I
Fig. 3: Comparison of coefficients bounds for L2 logistic regression. The lower the y values means that the differences between the upper and lower bounds for coefficients are smaller.
0.01 0.02 0.05 0.1 0.2 0.5 1 2 5 10
New Model Time(ms) 108.33 102.77 106.53 104.72 110.22 104.83 112.31 115.49 117.58 117.83
Sphere Test Time(ms) 0.80 0.74 0.54 0.53 0.79 0.49 0.58 0.82 1.73 2.60
Tightness 8.60e-4 1.81e-3 4.05e-3 6.94e-3 1.49e-2 3.43e-2 7.03e-2 1.35e-1 3.34e-1 6.20e-1
Segment Test Time(ms) 1.14 0.88 0.73 0.63 1.42 0.72 0.77 0.86 2.76 4.67
Tightness 5.63e-5 1.09e-4 2.82e-4 4.88e-4 1.01e-3 2.36e-3 4.88e-3 9.55e-3 2.30e-2 4.35e-2
New Model Time(ms) 100.94 104.56 104.64 99.75 106.38 101.31 98.30 100.66 108.43 113.54
Sphere Test Time(ms) 0.40 0.39 0.39 0.40 0.41 0.41 0.48 0.65 1.57 2.51
Tightness 3.66e-4 6.58e-4 1.92e-3 2.61e-3 5.08e-3 1.27e-2 2.68e-2 4.92e-2 1.21e-1 2.31e-1
Segment Test Time(ms) 0.50 0.43 0.45 0.49 0.55 0.52 0.65 0.67 2.44 4.47
Tightness 3.33e-5 6.24e-5 1.45e-4 2.40e-4 4.31e-4 1.11e-3 2.23e-3 4.20e-3 1.03e-2 1.98e-2
New Model Time(ms) 101.46 100.02 98.24 99.69 103.25 104.37 104.60 108.21 111.47 116.87
Sphere Test Time(ms) 0.39 0.43 0.43 0.40 0.44 0.43 0.61 0.75 1.70 2.79
Tightness 1.95e-4 3.51e-4 8.12e-4 1.37e-3 2.52e-3 5.44e-3 1.18e-2 2.29e-2 5.57e-2 1.05e-2
Segment Test Time(ms) 0.33 0.36 0.38 0.36 0.39 0.48 0.61 0.67 2.56 4.82
Tightness 1.96e-5 3.71e-5 7.47e-5 1.41e-4 2.57e-4 5.14e-4 1.07e-3 1.98e-3 4.77e-3 9.21e-3
TABLE III: EXPERIMENT II

Iv-A1 L2-Svm

Here we show the results for the sensitivity of coefficients task described in Section II-B1. Comparison with respect to the bound tightness and the computing time between sphere test and segment test with L2-SVM can be found in Figure 2. And detailed values can found in Table II. In this test, we found:

When increases, the bound tightness for both tests decreases, while the difference between the bound tightness of two tests becomes larger. When is equal to and is equal to , the bound tightness of the sphere test is almost times of the bound tightness of the segment test. When is equal to and is equal to , the bound tightness of the sphere test is almost times of the bound tightness of the segment test. When is equal to and is equal to , the bound tightness of the sphere test is almost times of the bound tightness of the segment test.

With the same regularization constant , when the ratio of modified data set is inferior to , the performances of two tests almost have no difference. However, the advantage of the segment test becomes more and more pronounced when the continues to increase.

In general, the computing time of the segment test is longer than the computing time of the sphere test, while less than the double computing time of the sphere times. Plus, Compared with training a new classifier from scratch, the computing time of the segment test is still very short.

Iv-A2 L2-Logistic Regression

Comparison with respect to the bound tightness and the computing time between sphere test and segment test of coefficients bounds with L2-Logistic Regression can be found in Figure 3. And detailed values can found in Table III. We found:

When increases, the bound tightness for both tests decreases faster than their bound tightness with L2-SVM, and the difference between the bound tightness of two algorithms becomes smaller, which is the difference from the trend with L2-SVM. When is equal to and is equal to , the bound tightness of the sphere test is almost times of the bound tightness of the segment test. When is equal to and is equal to