# An Aggregate and Iterative Disaggregate Algorithm with Proven Optimality in Machine Learning

We propose a clustering-based iterative algorithm to solve certain optimization problems in machine learning, where we start the algorithm by aggregating the original data, solving the problem on aggregated data, and then in subsequent steps gradually disaggregate the aggregated data. We apply the algorithm to common machine learning problems such as the least absolute deviation regression problem, support vector machines, and semi-supervised support vector machines. We derive model-specific data aggregation and disaggregation procedures. We also show optimality, convergence, and the optimality gap of the approximated solution in each iteration. A computational study is provided.

## Authors

• 7 publications
• 49 publications
• ### Improving the Interpretability of Support Vector Machines-based Fuzzy Rules

Support vector machines (SVMs) and fuzzy rule systems are functionally e...
08/22/2014 ∙ by Duc-Hien Nguyen, et al. ∙ 0

• ### A Fast, Principled Working Set Algorithm for Exploiting Piecewise Linear Structure in Convex Problems

By reducing optimization to a sequence of smaller subproblems, working s...
07/20/2018 ∙ by Tyler B. Johnson, et al. ∙ 0

• ### An Introduction to MM Algorithms for Machine Learning and Statistical

MM (majorization--minimization) algorithms are an increasingly popular t...
11/12/2016 ∙ by Hien D. Nguyen, et al. ∙ 0

• ### A tractable ellipsoidal approximation for voltage regulation problems

We present a machine learning approach to the solution of chance constra...
03/09/2019 ∙ by Pan Li, et al. ∙ 0

• ### A survey of statistical learning techniques as applied to inexpensive pediatric Obstructive Sleep Apnea data

Pediatric obstructive sleep apnea affects an estimated 1-5 elementary-sc...
02/17/2020 ∙ by Emily T. Winn, et al. ∙ 1

• ### Machine Learning for Mathematical Software

While there has been some discussion on how Symbolic Computation could b...
06/28/2018 ∙ by M. England, et al. ∙ 0

• ### Multilevel preconditioning for Ridge Regression

Solving linear systems is often the computational bottleneck in real-lif...
06/15/2018 ∙ by Joris Tavernier, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we propose a clustering-based iterative algorithm to solve certain optimization problems in machine learning when data size is large and thus it becomes impractical to use out-of-the-box algorithms. We rely on the principle of data aggregation and then subsequent disaggregations. While it is standard practice to aggregate the data and then calibrate the machine learning algorithm on aggregated data, we embed this into an iterative framework where initial aggregations are gradually disaggregated to the extent that even an optimal solution is obtainable.

Early studies in data aggregation consider transportation problems [1, 10], where either demand or supply nodes are aggregated. Zipkin [31]

studied data aggregation for linear programming (LP) and derived error bounds of the approximate solution. There are also studies on data aggregation for 0-1 integer programming

[8, 13]. The reader is referred to Rogers et al [22] and Litvinchev and Tsurkov [16] for comprehensive literature reviews for aggregation techniques applied for optimization problems.

For support vector machines (SVM), there exist several works using the concept of clustering or data aggregation. Evgeniou and Pontil [11] proposed a clustering algorithm that creates large size clusters for entries surrounded by the same class and small size clusters for entries in the mixed-class area. The clustering algorithm is used to preprocess the data and the clustered data is used to solve the problem. The algorithm tends to create large size clusters for entries far from the decision boundary and small size clusters for the other case. Wang et al [26]

developed screening rules for SVM to discard non-support vectors that do not affect the classifier. Nath

et al [19] and Doppa et al [9] proposed a second order cone programming (SOCP) formulation for SVM based on chance constraints and clusters. The key idea of the SOCP formulations is to reduce the number of constraints (from the number of the entries to number of clusters) by defining chance constraints for clusters.

After obtaining an approximate solution by solving the optimization problem with aggregated data, a natural attempt is to use less-coarsely aggregated data, in order to obtain a finer approximation. In fact, we can do this iteratively: modify the aggregated data in each iteration based on the information at hand. This framework, which iteratively passes information between the original problem and the aggregated problem [22], is known as Iterative Aggregation Disaggregation (IAD). The IAD framework has been applied for several optimization problems such as LP [17, 24, 25] and network design [2]. In machine learning, Yu et al [28, 29] used hierarchical micro clustering and a clustering feature tree to obtain an approximate solution for support vector machines.

In this paper, we propose a general optimization algorithm based on clustering and data aggregation, and apply it to three common machine learning problems: least absolute deviation regression (LAD), SVM, and semi-supervised support vector machines (SVM). The algorithm fits the IAD framework, but has additional properties shown for the selected problems in this paper. The ability to report the optimality gap and monotonic convergence to global optimum are features of our algorithm for LAD and SVM, while our algorithm guarantees optimality for SVM without monotonic convergence. Our work for SVM is distinguished from the work of Yu et al [28, 29], as we iteratively solve weighted SVM and guarantee optimality, whereas they iteratively solve the standard unweighted SVM and thus find only an approximate solution. On the other hand, it is distinguished from Evgeniou and Pontil [11], as our algorithm is iterative and guarantees global optimum, whereas they used clustering to preprocess data and obtain an approximate optimum. Nath et al [19] and Doppa et al [9] are different because we use the typical SVM formulation within an iterative framework, whereas they propose an SOCP formulation based on chance constraints.

Our data disaggregation and cluster partitioning procedure is based on the optimality condition derived in this paper: relative location of the observations to the hyperplane (for LAD, SVM, S

VM) and labels of the observations (for SVM, SVM). For example, in the SVM case, if the separating hyperplane divides a cluster, the cluster is split. The condition for SVM is even more involved since a single cluster can be split into four clusters. In the computational experiment, we show that our algorithm outperforms the current state-of-the-art algorithms when the data size is large. The implementation of our algorithms is based on in-memory processing, however the algorithms work also when data does not fit entirely in memory and has to be read from disk in batches. The algorithms never require the entire data set to be processed at once. Our contributions are summarized as follows.

1. [noitemsep]

2. We propose a clustering-based iterative algorithm to solve certain optimization problems, where an optimality condition is derived for each problem. The proposed algorithmic framework can be applied to other problems with certain structural properties (even outside of machine learning). The algorithm is most beneficial when the time complexity of the original optimization problem is high.

3. We present model specific disaggregation and cluster partitioning procedures based on the optimality condition, which is one of the keys for achieving optimality.

4. For the selected machine learning problems, i.e., LAD and SVM, we show that the algorithm monotonically converges to a global optimum, while providing the optimality gap in each iteration. For SVM, we provide the optimality condition.

We present the algorithmic framework in Section 2 and apply it to LAD, SVM, and SVM in Section 3. A computational study is provided in Section 4, followed by a discussion on the characteristic of the algorithm and how to develop the algorithm for other problems in Section 5.

## 2 Algorithm: Aggregate and Iterative Disaggregate (AID)

We start by defining a few terms. A data matrix consists of entries (rows) and attributes (columns). A machine learning optimization problem needs to be solved over the data matrix. When the entries of the original data are partitioned into several sub-groups, we call the sub-groups clusters and we require every entry of the original data to belong to exactly one cluster. Based on the clusters, an aggregated entry is created for each cluster to represent the entries in the cluster. This aggregated entry (usually the centroid) represents one cluster, and all aggregated entries are considered in the same attribute space as the entries of the original data. The notion of the aggregated data refers to the collection of the aggregated entries. The aggregated problem is a similar optimization problem to the original optimization problem, based on the aggregated data instead of the original data. Declustering is the procedure of partitioning a cluster into two or more sub-clusters.

We consider optimization problems of the type

 minx,y n∑i=1fi(xi)+f(y) s.t. g1i(xi,y)≥0, for every i=1,⋯,n, g2i(xi)≥0, for every i=1,⋯,n, g(y)≥0, (1)

where is the number of entries of , is entry of , and arbitrary functions and are defined for every . One of the common features of such problems is that the data associated with is aggregated in practice and an approximate solution can be easily obtained. Well-known problems such as LAD, SVM, and facility location fall into this category. The focus of our work is to design a computationally tractable algorithm that actually yields an optimal solution in a finite number of iterations.

Our algorithm needs four components tailored to a particular optimization problem or a machine learning model.

1. [noitemsep]

2. A definition of the aggregated data is needed to create aggregated entries.

3. Clustering and declustering procedures (and criteria) are needed to cluster the entries of the original data and to decluster the existing clusters.

4. An aggregated problem (usually weighted version of the problem with the aggregated data) should be defined.

5. An optimality condition is needed to determine whether the current solution to the aggregated problem is optimal for the original problem.

The overall algorithm is initialized by defining clusters of the original entries and creating aggregated data. In each iteration, the algorithm solves the aggregated problem. If the obtained solution to the aggregated problem satisfies the optimality condition, then the algorithm terminates with an optimal solution to the original problem. Otherwise, the selected clusters are declustered based on the declustering criteria and new aggregated data is created. The algorithm continues until the optimality condition is satisfied. We refer to this algorithm, which is summarized in Algorithm 1, as Aggregate and Iterative Disaggregate (AID). Observe that the algorithm is finite as we must stop when each cluster is an entry of the original data. In the computational experiment section, we show that in practice the algorithm terminates much earlier.

In Figure 1, we illustrate the concept of the algorithm. In Figure 1(a), small circles represent the entries of the original data. They are partitioned into three clusters (large dotted circles), where the crosses represent the aggregated data (three aggregated entries). We solve the aggregated problem with the three aggregated entries in Figure 1(a). Suppose that the aggregated solution does not satisfy the optimality condition and that the declustering criteria decide to partition all three clusters. In Figure 1(b), each cluster in Figure 1(a) is split into two sub-clusters. Suppose that the optimality condition is satisfied after several iterations. Then, we terminate the algorithm with guaranteed optimality. Figure 1(c) represents possible final clusters after several iterations from Figure 1(b). Observe that some of the clusters in Figure 1(b) remain the same in Figure 1(c), due to the fact that we selectively decluster.

We use the following notation in subsequent sections.

1. [noitemsep]

2. : Index set of entries, where is the number of entries (observations)

3. : Index set of attributes, where is the number of attributes

4. : Index set of the clusters in iteration

5. : Set of clusters in iteration , where is a subset of for any in

6. : Last iteration of the algorithm when the optimality condition is satisfied

## 3 AID for Machine Learning Problems

### 3.1 Least Absolute Deviation Regression

The multiple linear least absolute deviation regression problem (LAD) can be formulated as

 E∗=minβ∈Rm∑i∈I|yi−∑j∈Jxijβj|, (2)

where is the explanatory variable data,

is the response variable data, and

is the decision variable. Since the objective function of (2) is the summation of functions over all in , LAD fits (1), and we can use AID.

Let us first define the clustering method. Given target number of clusters , any clustering algorithm can be used to partition entries into initial clusters . Given in iteration , for each , we generate aggregated data by

, for all , and ,

where and . To balance the clusters with different cardinalities, we give weight to the absolute error associated with . Hence, we solve the aggregated problem

 Ft=minβt∈Rm∑k∈Kt|Ctk||ytk−∑j∈Jxtkjβtj|. (3)

Observe that any feasible solution to (3) is a feasible solution to (2). Let be an optimal solution to (3). Then, the objective function value of to (2) with the original data is

 Et=∑i∈I|yi−∑j∈Jxij¯βtj|. (4)

Next, we present the declustering criteria and construction of . Given and , we define the clusters for iteration as follows.

1. [noitemsep]

2. Step 1 .

3. Step 2 For each ,

1. Step 2(a) If for all have the same sign, then .

2. Step 2(b) Otherwise, decluster into two clusters: and , and set .

The above procedure keeps cluster if all original entries in the clusters are on the same side of the regression hyperplane. Otherwise, the procedure splits into two clusters and , where the two clusters contain original entries on the one and the other side of the hyperplane. It is obvious that this rule implies a finite algorithm.

In Figure 2, we illustrate AID for LAD. In Figure 2(a), the small circles and crosses represent the original and aggregated entries, respectively, where the large dotted circles are the clusters associated with the aggregated entries. The straight line represents the regression line obtained from an optimal solution to (3). In Figure 2(b), the shaded and empty circles are the original entries below and above the regression line, respectively. Observe that two clusters have original entries below and above the regression line. Hence, we decluster the two clusters based on the declustering criteria and obtain new clusters and aggregated data for the next iteration in Figure 2(c).

Now we are ready to present the optimality condition and show that is an optimal solution to (2) when the optimality condition is satisfied. The optimality condition presented in the following proposition is closely related to the clustering criteria.

###### Proposition 1.

If for all have the same sign for all , then is an optimal solution to (2). In other words, if all entries in are on the same side of the hyperplane defined by for all , then is an optimal solution to (2). Further, .

###### Proof.

Let be an optimal solution to (2). Then, we derive

 E∗ = ∑i∈I|yi−∑j∈Jxijβ∗j|=∑k∈Kt∑i∈Ctk|yi−∑j∈Jxijβ∗j| ≥ ∑k∈Kt|∑i∈Ctkyi−∑j∈Jxijβ∗j|=∑k∈Kt|Ctk||ytk−∑j∈Jxtkjβ∗j| ≥ ∑k∈Kt|Ctk||ytk−∑j∈Jxtkj¯βtj|=∑k∈Kt|∑i∈Ctkyi−∑i∈Ctk∑j∈Jxij¯βtj| = ∑k∈Kt∑i∈Ctk|yi−∑j∈Jxij¯βtj|=∑i∈I|yi−∑j∈Jxij¯βtj|=Et,

where the third line holds since is optimal to (3) and the fourth line is based on the condition that all observations in are on the same side of the hyperplane defined by , for all . Since is feasible to (2), clearly , which shows . This implies that is an optimal solution to (2). Observe that in the fifth line is equivalent to . Hence, we also showed by the fifth to ninth lines. ∎

We also show the non-decreasing property of in and the convergence.

###### Proposition 2.

We have for . Further, .

###### Proof.

For simplicity, let us assume that , , and . That is, is the only cluster in such that the entries in have both positive and negative signs, and is partitioned into and for iteration . Then, we derive

 Ft−1 = ∣∣Ct−11∣∣∣∣yt−11−∑j∈Jxt−11j¯βt−1j∣∣+∑k∈Kt−1∖{1}∣∣Ct−1k∣∣∣∣yt−1k−∑j∈Jxt−1kj¯βt−1j∣∣ ≤ = = ∣∣∑i∈Ct1yi−∑i∈Ct1∑j∈Jxij¯βtj+∑i∈Ct2yi−∑i∈Ct2∑j∈Jxij¯βtj∣∣+∑k∈Kt−1∖{1}∣∣Ct−1k∣∣∣∣yt−1k−∑j∈Jxt−1kj¯βtj∣∣ ≤ ∣∣∑i∈Ct1yi−∑i∈Ct1∑j∈Jxij¯βtj∣∣+∣∣∑i∈Ct2yi−∑i∈Ct2∑j∈Jxij¯βtj∣∣+∑k∈Kt−1∖{1}∣∣Ct−1k∣∣∣∣yt−1k−∑j∈Jxt−1kj¯βtj∣∣ = ∣∣Ct1∣∣∣∣yt1−∑j∈Jxt1j¯βtj∣∣+∣∣Ct2∣∣∣∣yt2−∑j∈Jxt2j¯βtj∣∣+∑k∈Kt−1∖{1}∣∣Ct−1k∣∣∣∣yt−1k−∑j∈Jxt−1kj¯βtj∣∣ = ∣∣Ct1∣∣∣∣yt1−∑j∈Jxt1j¯βtj∣∣+∣∣Ct2∣∣∣∣yt2−∑j∈Jxt2j¯βtj∣∣+∑k∈Kt∖{1,2}∣∣Ctk∣∣∣∣ytk−∑j∈Jxtkj¯βtj∣∣ = Ft,

where the second line holds since is an optimal solution to the aggregate problem in iteration , and the seventh line follows from the fact that there exist and such that . For the cases with multiple clusters in are declustered, we can use the similar technique. This completes the proof. ∎

By Proposition 2, in any iteration, can be interpreted as a lower bound to (2). Further, the optimality gap is non-increasing in , where .

### 3.2 Support Vector Machines

One of the most popular forms of support vector machines (SVM) includes a kernel satisfying the Mercer’s theorem [18] and soft margin. Let be the mapping function that maps from the -dimensional original feature space to -dimensional new feature space. Then, the primal optimization problem for SVM is written as

 E∗=minw,b,ξ12∥w∥2+M∥ξ∥1s.t.yi(wϕ(xi)+b)≥1−ξi,ξi≥0,i∈I, (5)

where is the feature data, is the class (label) data, and and are the decision variables, and the corresponding dual optimization problem is written as

 max∑i∈Iαi−12∑i,j∈IK(xi,xj)αiαjyiyjs.t.∑i∈Iαiyi=0,0≤αi≤M,i∈I, (6)

where is the kernel function. In this case, in (5) can be interpreted as new data in -dimensional feature space with linear kernel. Hence, without loss of generality, we derive all of our findings in this section for (5) with the linear kernel, while all of the results hold for any kernel function satisfying the Mercer’s theorem. However, in Appendix B, we also describe AID with direct use of the kernel function.

By using the linear kernel, (5) is simplified as

 E∗=minw,b,ξ12∥w∥2+M∥ξ∥1s.t.yi(wxi+b)≥1−ξi,ξi≥0,i∈I, (7)

where . Since in the objective function of (7) is the summation of the absolute values over all in and the constraints are defined for each in , SVM fits (1). Hence, we apply AID to solve (7).

Let us first define the clustering method. The algorithm maintains that the observations and with different labels cannot be in the same cluster. We first cluster all data with , and then we cluster those with . Thus we run the clustering algorithm twice. This gives initial clusters . Given in iteration , for each , we generate aggregated data by

and ,

where and . Note that, since we create a cluster with observations with the same label, we have

 ytk=yi for all i∈Ctk. (8)

By giving weight to , we obtain

 Ft=minwt,bt,ξt12∥wt∥2+M∑k∈Kt|Ctk|ξtks.t.ytk[wtxtk+bt]≥1−ξtk,ξtk≥0,k∈Kt, (9)

where and are the decision variables. Note that in (7) has size of , whereas the aggregated data has entries. Note also that (9) is weighted SVM [27], where weight is for aggregated entry .

Next we present the declustering criteria and construction of . Let and be optimal solutions to (7) and (9), respectively. Given and , we define the clusters for iteration as follows.

1. [noitemsep]

2. Step 1 .

3. Step 2 For each ,

1. [noitemsep]

2. Step 2(a) If (i) for all or (ii) for all , then .

3. Step 2(b) Otherwise, decluster into two clusters: and , and set .

In Figure 3, we illustrate AID for SVM. In Figure 3(a), the small white circles and crosses represent the original entries with labels 1 and -1, respectively. The small black circles and crosses represent the aggregated entries, where the large circles are clusters associated with the aggregated entries. The plain line represents the separating hyperplane obtained from an optimal solution to (9), where the margins are implied by the dotted lines. The shaded large circles represent the clusters violating the optimality condition in Proposition 3. In Figure 3(b), below the bottom dotted line is the area such that observations with label 1 (circles) have zero error and above the top dotted line is the area such that observations with label -1 (crosses) have zero error. Observe that two clusters have original entries below and above the corresponding dotted lines. Based on the declustering criteria, the two clusters are declustered and we obtain new clusters in Figure 3(c).

Note that a feasible solution to (7) does not have the same dimension as a feasible solution to (9). In order to analyze the algorithm, we convert feasible solutions to (7) and (9) to feasible solutions to (9) and (7), respectively.

1. [noitemsep]

2. Conversion from (7) to (9)
Given a feasible solution to (7), we define a feasible solution to (9) as follows: , , and for .

3. Conversion from (9) to (7)
Given a feasible solution to (9), we define a feasible solution to (7) as follows: , , and for .

Given an optimal solution to (7), by using the above mappings, we denote by the corresponding feasible solution to (9). Likewise, given an optimal solution to (9), we denote by the corresponding feasible solution to (7). The objective function value of to (7) is evaluated by

 Et=12∥^wt∥2+M∥^ξt∥1. (10)

In Propositions 3 and 4, we present the optimality condition and monotone convergence property.

###### Proposition 3.

For all , if (i) for all or (ii) for all , then is an optimal solution to (7). In other words, if all entries in are on the same side of the margin-shifted hyperplane of the separating hyperplane , then is an optimal solution to (7). Further, .

###### Proof.

We can derive

 12∥w∗∥2+M∑i∈Iξ∗i = 12∥w∗∥2+M∑k∈Kt|Ctk|∑i∈Ctkξ∗i|Ctk| = 12∥^w∗∥2+M∑k∈Kt|Ctk|^ξ∗k ≥ 12∥¯wt∥2+M∑k∈Kt|Ctk|¯ξtk
 12∥w∗∥2+M∑i∈Iξ∗i = 12∥¯wt∥2+M∑k∈Ktmax{0,|Ctk|−|Ctk|ytk(¯wtxtk+¯bt)} = 12∥¯wt∥2+M∑k∈Ktmax{0,|Ctk|−ytk¯wt∑i∈Ctkxi−ytk|Ctk|¯bt} = 12∥¯wt∥2+M∑k∈Ktmax{0,∑i∈Ctk[1−yi(¯wtxi+¯bt)]} = 12∥¯wt∥2+M∑k∈Kt∑i∈Ctkmax{0,1−yi(¯wtxi+¯bt)} = 12∥^wt∥2+M∑k∈Kt∑i∈Ctk^ξti ≥ 12∥w∗∥2+M∑i∈Iξ∗i,

where the second line follows from the definition of , the third line holds since is an optimal solution to (9), the fourth line is by the definition of , the fifth line is by the definition of , the eighth line is true because of the assumption such that all observations are on the same side of the margin-shifted hyperplane of the separating hyperplane (optimality condition), and the last line holds since is an optimal solution to (7). Observe that the inequalities above must hold at equality. This implies that is an optimal solution to (7). ∎

Because defines an optimal hyperplane, we are also able to obtain the corresponding dual optimal solution for (6). However, unlike primal optimal solutions, cannot be directly constructed from dual solution of for the aggregated problem within the current settings. Within a modified setting presented later in this section, we can explain the relationship between and , modified optimality condition, declustering procedure, and the construction of based on .

###### Proposition 4.

We have for . Further, .

###### Proof.

Recall that is an optimal solution to (9) with aggregated data and . Let be a feasible solution to (9) with aggregated data and such that , , and for . In other words, is a feasible solution to (9) with aggregated data and , but generated based on . For simplicity, let us assume that , , and . The cases such that more than one cluster of are declustered in iteration can be derived using the same technique.

Observe that there exists a pair , and such that

 ~ξt−1q=¯ξtk (11)

for all in and the match between and is one-to-one. This is because the aggregated data for these clusters remains same and the hyper-plane used, and , are the same. Hence, we derive

 Ft−1 = 12∥¯wt−1∥2+M|Ct−11|¯ξt−11+M∑k∈Kt−1∖{1}|Ct−1k|¯ξt−1k ≤ 12∥~wt−1∥2+M|Ct−11|~ξt−11+M∑k∈Kt−1∖{1}|Ct−1k|~ξt−1k Ft−1 ≤ 12∥¯wt∥2+M|Ct−11|~ξt−1