 # Algorithms for Generalized Cluster-wise Linear Regression

Cluster-wise linear regression (CLR), a clustering problem intertwined with regression, is to find clusters of entities such that the overall sum of squared errors from regressions performed over these clusters is minimized, where each cluster may have different variances. We generalize the CLR problem by allowing each entity to have more than one observation, and refer to it as generalized CLR. We propose an exact mathematical programming based approach relying on column generation, a column generation based heuristic algorithm that clusters predefined groups of entities, a metaheuristic genetic algorithm with adapted Lloyd's algorithm for K-means clustering, a two-stage approach, and a modified algorithm of Späth Spath1979 for solving generalized CLR. We examine the performance of our algorithms on a stock keeping unit (SKU) clustering problem employed in forecasting halo and cannibalization effects in promotions using real-world retail data from a large supermarket chain. In the SKU clustering problem, the retailer needs to cluster SKUs based on their seasonal effects in response to promotions. The seasonal effects are the results of regressions with predictors being promotion mechanisms and seasonal dummies performed over clusters generated. We compare the performance of all proposed algorithms for the SKU problem with real-world and synthetic data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering is a commonly encountered problem in many areas such as marketing, engineering, and biology, among others. In a typical clustering problem, the goal is to group entities together according to a certain similarity measure. Such a measure can be defined in many different ways, and it determines the complexity of solving the relevant clustering problem. Clustering problems with the similarity measure defined by regression errors is especially challenging because it is coupled with regression.

Consider a retailer that needs to forecast sales at the stock keeping unit (SKU) level for different promotional plans and mechanisms (e.g., 30% off the selling price) using a linear regression model. A SKU is a unique identifying number that refers to a specific item in inventory. Each SKU is often used to identify product, product size, product type, and the manufacturer. Seasonality is an important predictor and is modeled using an indicator dummy input variable for each season, with the length of one season being one week. The usable data for each SKU is limited compared to the possible number of parameters to estimate, among which the seasonality dummies compose a large proportion. More significant and useful statistical results can be obtained by clustering SKUs with similar seasonal effects from promotions together, and estimating seasonality dummies for a cluster instead of a single SKU. However, the seasonal effects of SKUs correspond to regression coefficients, which can only be obtained after grouping SKUs with similar seasonality.

A two-stage method can be used to solve such difficult clustering problems that are intertwined with regression. In the first stage, entities are clustered based on certain approximate measures of their regression coefficients. In the second stage, regressions are performed over the resultant clusters to obtain estimates for the regression coefficients for each cluster. However, good approximate measures are difficult to obtain a priori before carrying out the regressions. A better alternative is to perform clustering and regression simultaneously, which can be achieved through cluster-wise linear regression (CLR), which is also referred to as “regression clustering” in the literature. Other application areas of CLR include marketing, pavement condition prediction, and spatial modeling and analysis. More details about these other application areas can be found in Openshaw , DeSarbo and Cron , DeSarbo , and Luo and Chou .

The CLR problem bears connection to the minimum sum-of-squares clustering (MSSC) problem, the objective of which is to find clusters that minimize the sum of squared distances from each entity to the centroid of the cluster which it belongs to. Contrary to clustering entities directly based on distances, CLR generates clusters according to the effects that some independent variables have on the response variable of a preset regression model. Each entity is represented by a set of observations of a response variable and the associated predictors. CLR is to group entities with similar regression effects into a given number of clusters such that the overall sum of squared residuals within clusters is minimal. Although the MSSC problem has been extensively studied by researchers from various fields (e.g., statistics, optimization, and data mining), the work for the CLR problem is limited, most of which concerns adapting the Lloyd’s algorithm based heuristic algorithms of the MSSC problem to the CLR problem. The Lloyd’s algorithm starts randomly from some initial partition of clusters, then calculates the centroids of clusters, and assigns entities to their closest centroids until converging to a local minimum. Recently, several exact approaches have been proposed by Carbonneau et al

[5, 6, 7], which are discussed in detail in Sections 1.1 and 2.

We tackle the problem of clustering entities based on their regression coefficients by modeling it as a generalized CLR problem, in which we allow each entity to have more than one observation. We propose both a mixed integer quadratic program formulation and a set partitioning formulation for generalized CLR. Our mixed integer quadratic program formulation is more general than the one proposed by Bertsimas and Shioda , which cannot be directly applied to the SKU clustering problem since they assume each clustering entity to have only one observation and this assumption does not hold for the SKU clustering problem. We identify a connection between the generalized CLR and MSSC problems, through which we prove NP-hardness of the generalized CLR problem. Column generation is an algorithmic framework for solving large-scale linear and integer programs. Vanderbeck and Wolsey  and Barnhart et al.  overview column generation for solving large integer program. We design a column generation (CG) algorithm for the generalized CLR problem using its set partitioning formulation. The corresponding pricing problem is a mixed integer quadratic program, which we show to be NP-hard. To handle larger instances in the column generation framework, we also propose a heuristic algorithm, referred to as the CG Heuristic algorithm. This heuristic algorithm, inspired by Bertsimas and Shioda , first clusters entities to a small number of groups and then performs our column generation algorithm on these groups of entities. In addition, we propose a metaheuristic algorithm, named the GA-Lloyd algorithm, which uses an adapted Lloyd’s clustering algorithm to find locally optimal partitions and relies on the genetic algorithm (GA) to escape local optimums. Furthermore, we introduce a two-stage approach, used frequently in practice due to its simplicity, which performs clustering first and regression second. We test our algorithms using real-world data from a large retail chain. We compare the performance of the GA-Lloyd, the CG Heuristic, and the two-stage algorithms on two larger instances with 66 and 337 SKUs, corresponding to two representative subcategories under the retailer’s product hierarchy. We observe that the GA-Lloyd algorithm performs much better than the two-stage algorithm. The CG Heuristic algorithm is able to produce slightly better results than the GA-Lloyd algorithm for smaller instances, but at the cost of much longer running time. The GA-Lloyd algorithm performs the best and identifies distinctive and meaningful seasonal patterns for the tested subcategories. In addition, we find that the column generation algorithm is able to solve the SKU clustering problem with at most 20 SKUs to optimality within reasonable computation time. We benchmark the performance of the GA-Lloyd and CG Heuristic algorithms against the optimal solutions obtained by the column generation algorithm to find that both algorithms obtain close to optimal solutions.

The contributions of our work are as follows.

1. [noitemsep]

2. We are the first to model and solve the SKU clustering problem, commonly encountered in retail predictive modeling, through generalized CLR.

3. We propose four heuristic algorithms for the generalized CLR problem, including the CG Heuristic algorithm, the GA-Lloyd algorithm, the two-stage approach, and a variant of Späth algorithm.

4. We propose an exact column generation algorithm that enables us to evaluate the performance of the heuristic algorithms.

5. We prove NP-hardness of the generalized CLR problem and NP-completeness of the pricing problem of the column generation algorithm.

Note that the number of clusters is a parameter in the generalized CLR problem that needs to be decided by user beforehand or by enumeration. Although we provide comparison of models with different number of clusters for real-world data in Section 4.1.2, it is not straightforward to develop a universal rule for deciding the number of clusters. This is also a hard task for MSSC and CLR. The AIC or BIC criteria did not give a reasonable number of clusters for the data set we used for the experiment. They gave more than three times the number of hand-picked number of clusters that work in practice. Hence, in this paper, we assume that the target number of clusters is given in advance.

In Figure 1, we summarize and compare the terms in the CLR, generalized CLR and the SKU clustering problem. While CLR only has entities, generalized CLR allows multiple observations per entity. The CLR problem can be thought of as the generalized CLR with one observation per entity. Note that entity and observation in generalized CLR are the SKU and transactions in the SKU clustering problem, respectively.

The rest of the paper is organized as follows. In Section 2, we introduce both the mixed integer quadratic program and the set partitioning formulations of the generalized CLR problem. We draw the connection between the generalized CLR and MSSC problems, and prove NP-hardness of the former through this connection. In Section 3, we present the exact column generation algorithm, the CG Heuristic algorithm, the GA-Lloyd heuristic algorithm, the two-stage algorithm, and a variant of the Späth algorithm. The pricing problem of the column generation algorithm is shown to be NP-complete. In Section 4, we present numerical experiments to test the performance of all proposed algorithms. The literature review is discussed next.

### 1.1 Literature Review

To the best of authors’ knowledge, no previous work has been conducted that comprehensibly tackles the generalized CLR problem. However, an extensive collection has been proposed for the typical CLR problem, which can potentially be adapted to tackle the generalized CLR problem.

The algorithms proposed for the typical CLR problem are mainly heuristics bearing close similarity to the algorithms for the MSSC problem. For example, Späth  proposes an exchange algorithm which, starting from some initial clusters, exchanges two items between two clusters if a cost reduction is observed in the objective function. DeSarbo  presents a simulated annealing method to escape local minimums. Muruzabal et al. 

used a self organizing map to perform clusterwise regression.

On mathematical programming-based heuristics, Lau et al. 

propose a nonlinear programming formulation that it is solved approximately using commercial solvers with no guarantee to find a global optimum. Their algorithm’s performance depends heavily on the initial clusters. This initial-cluster dependency is overcome by the K-harmonic means clustering algorithm proposed by Zhang

. Moreover, Bertsimas and Shioda  introduce a compact mixed-integer linear formulation for a slight variation of the CLR problem with the sum of the absolute error as the objective. Their algorithm first divides entities into a small number of clusters, and then feeds these clusters into their mixed integer program.

For exact approaches to CLR, Carbonneau et al.  proposed a mixed logical-quadratic programming formulation by replacing big M constraints with the logical implication of the constraints. Carbonneau et al.  proposed an iterative algorithm based on sequencing the data and repetitive use of a branch and bound algorithm. Carbonneau et al.  proposed a column generation based algorithm based on  and .

There are two key differences between these works and the one we propose in this paper. First, we provide both a quadratic mixed-integer program formulation and a set partition formulation of the generalized CLR problem. The former is a generalization of the formulation in , and the latter is the set partitioning formulation for generalized CLR (recently, Carbonneau et al.  have proposed a set partitioning formulation for CLR). Second, we propose two new heuristics, namely the CG Heuristic algorithm and the GA-Lloyd algorithm for the generalized CLR problem.

There is another stream of research for the CLR problem that assumes a distribution function for regression errors where each entity is assigned to each cluster with a certain probability, i.e., using “soft” assignments. For example, DeSarbo and Cron



propose a finite conditional mixture maximum likelihood methodology, which assumes normal distribution for regression errors and is solved through the expectation maximization algorithm. Since then, a large number of mixture regression models have been developed, including probit and logit mixture regression models as examples. Lau

et al.  compare the performance of the expectation maximization algorithms with their nonlinear programming-based algorithm. Hennig  investigate idenfiability of model-based clusterwise linear regression for consistent estimate of parameters. D’Urso et al.  proposed to integrate fuzzy clustering and fuzzy regression. A recent work of Ingrassia et al.  uses linear cluster-weighted models for clustering regression. These model-based approaches allow residual variances to differ between clusters, which the least squares approaches do not allow. In the soft assignment setting, an entity can be assigned to the cluster of highest probability. We restrict the scope of our review and comparison to least squares approaches because the objective functions are different. The reader is referred to Wedel and DeSarbo  and Hennig  for reviews.

The algorithms for the MSSC problem are instructive to solving the CLR problem. There are abundant papers for solving the MSSC problem. Hansen and Jaumard  survey various forms of clustering problems and their solution methods, including MSSC, from a mathematical programming point of view. In their survey, solution methods for the MSSC problem include dynamic programming, branch-and-bound, cutting planes, and column generation methods. All these algorithms do not scale well to large size instances or in higher dimensional spaces. Heuristics are also considered, including Lloyd’s like algorithms (e.g., K-Means and H-Means) and metaheuristics such as simulated annealing, tabu search, genetic algorithms and variable neighborhood search. With respect to mathematical programming approaches, du Merle et al.  propose an interior point algorithm to exactly solve the MSSC problem. Aloise et al.  improve the algorithm of du Merle et al.  by exploiting the geometric characteristics of clusters, which enables them to solve much larger instances.

## 2 Problem Formulations

### 2.1 Mixed Integer Quadratic Program Formulation

We first provide a mixed integer quadratic formulation for the generalized CLR problem. This formulation reveals a close connection between the generalized CLR and MSSC problems, which enables us to show that the generalized CLR problem is NP-hard.

Consider set of entities. Each entity has observations of dependent variable , and independent variables with for any . In practice the number of entities depends on , but we do not show this dependency for improved readability. (For each integer we introduce .) Observation is associated with independent variables

. Note that vectors are represented in bold symbols. We want to divide these

entities into a partition of clusters where , for any , and . The minimum size of a cluster is , which is set by the user. This implies for any where denotes the cardinality of cluster . Note that the number of observations pertaining to a cluster is at least . The minimum size constraints are imposed to ensure that there are enough observations for each cluster. Further, in order to avoid regression models with zero error, we require . We also require such that there is always a feasible solution. The generalized CLR problem is formulated as follows:

 minI∑i=1L∑l=1t2il (1) til−(yil−J∑j=1βkjxijl)+M(1−zik) ≥0 i∈[I], k∈[K], l∈[L] (2) til+(yil−J∑j=1βkjxijl)+M(1−zik) ≥0 i∈[I], k∈[K], l∈[L] (3) K∑k=1zik =1 i∈[I] (4) I∑i=1zik ≥n k∈[K] (5) zik ∈{0,1} i∈[I], k∈[K] til ≥0 i∈[I], l∈[L] βkj unconstrained k∈[K], j∈[J],

where

is a binary variable, which is equal to one if and only if entity

is assigned to cluster . Value , referred to as big in the optimization literature, is a large positive constant. Due to constraints (2) and (3), is equal to the absolute error for the corresponding observation in the optimal solution, and are the regression coefficients for cluster , which are decision variables. The role of is to enforce constraints (2) and (3) only when they are needed (entity is assigned to cluster ). In detail, if , then we have , and , which implies because we are minimizing the sum of . If , constraints (2) and (3) require to be greater than a negative number, which holds trivially due to the existence of the nonnegativity constraint on . Constraint (4) requires that every entity is assigned to one cluster, and (5) imposes the limit on the cardinality of each cluster.

Unlike the CLR problem, the generalized CLR allows each entity to have more than one observation, which implies that

can be greater than one. The mixed integer linear program formulation for the CLR problem in Bertsimas and Shioda

 has equal to one, and does not have the cluster cardinality constraint (5). Besides, their objective function is the sum of the absolute errors while ours is the sum of squared errors.

Our SKU clustering problem based on the seasonal effects can be modeled as the generalized CLR problem. The entities to cluster are SKUs. The response variable corresponds to a vector of weekly sales for SKU . The independent variables ’s include promotional predictors such as promotion mechanisms, percentage discount, and seasonal dummies for SKU .

Aloise et al.  showed NP-hardness of the MSSC problem in a general dimension when the number of clusters is two. General dimension means that the size of the vectors to be clustered is not a constant but part of the input data. A similar statement can be made for the generalized CLR problem with the proof available in Appendix A.

###### Theorem 1

The generalized CLR problem with two clusters in a general dimension is NP-hard.

With the formulation presented by (1)–(5), we can solve the generalized CLR problem using any commercial optimization software that can handle quadratic mixed integer programs. However, this formulation suffers from two drawbacks, which makes it intractable for large instances. The first one relates to big . Optimality of the solution and efficiency of integer programming solvers depend on a tight value of . Unlike multiple linear regression, where obtaining a valid value of is possible , it is not trivial to calculate a valid value of in (2) and (3) for the generalized CLR or CLR. When , ’s are not from the cluster that entity belongs to, and the residual can be arbitrarily large. Carbonneau et al.  provide an empirical result that a big M based MIP formulation for CLR sometimes fails to guarantee optimality of CLR for the data sets they consider. The second one involves the symmetry of feasible solutions. Any permutation of clusters yields the same solution, yet it corresponds to different decision variables. Symmetry unnecessarily increases the search space, and renders the solution process inefficient. To overcome the symmetry problem, we propose a set partitioning formulation, which has already been used for the CLR problem in .

### 2.2 Set Partitioning Formulation

Let denote the set of all clusters of entities with the cardinality equal to or greater than , i.e., . Let equal to one if entity belongs to cluster , and equal to zero otherwise. Let denote the cost of cluster , which is equal to the sum of squared errors when performing the regression over cluster . Introducing binary variables

 zS={1 if cluster S is selected,0\ otherwise,

the generalized CLR problem can be formulated as:

 min∑S∈S cSzS (6) ∑S∈SzS =K (7) ∑S∈SaiSzS =1 i∈[I] (8) zS ∈{0,1} S∈S.

Constraint (7) ensures that the number of clusters in the partition is and constraint (8) guarantees that each entity occurs in only one cluster within the partition.

## 3 Algorithms

### 3.1 Column Generation (CG) Algorithm

The set partitioning formulation has an exponential number of binary variables. It is very challenging to solve even its linear programming relaxation because there are so many decision variables. To solve large-scale linear and integer programs, column generation algorithms have been used in the literature. The reader is referred to Vanderbeck and Wolsey  and Barnhart et al.  for reviews of column generation for solving large-scale integer programs. In our work, we employ column generation to handle its linear programming relaxation. At the high level, column generation can be understood as iteratively expanding set (a subset of ) in (6) - (8) by adding attractive candidate cluster to . The key challenge is how to select . The word column is used because adding cluster to is equivalent to adding a column in the matrix form of (6) - (8).

The column generation algorithm, referred to as the CG algorithm, starts by solving the restricted master problem which has the same formulation as the master problem (6)-(8), but with set replaced by , a smaller subset of columns. Recall that a column represents a cluster (subset of entities ). We start the algorithm with small candidate clusters rather than , the set of all possible subsets of . The algorithmic framework is presented in Algorithm 1, which follows the general column generation scheme. In Line 1, the initial subset of columns in are randomly generated. In detail, we start from empty clusters. Then, we randomly assign each entity to one of the clusters using a uniform random number. Hence, after Line 1, we have clusters and for the generation procedure. In Line 3, optimal dual variables are obtained by solving the restricted master problem and then serve as input to the pricing problem, which will be introduced hereafter, to calculate the smallest reduced cost column. In Line 4, the pricing problem returns a column with the smallest reduced cost. In Lines 5-10, if the reduced cost is nonnegative, then we conclude that the master problem is solved optimally. Otherwise, we add the column with the smallest reduced cost to the restricted master problem and repeat the process.

#### The pricing problem

The pricing problem can be stated as follows. Let be the dual variable for constraint (7), and ’s be the dual variables for constraint (8). The reduced cost for cluster is , and thus the pricing problem reads:

 min|S|≥n,β∑i∈SL∑l=1(yil−J∑j=1xijlβj)2−∑i∈Sπi. (9)

Note that we omit the subtraction of in the formulation because it is a constant which does not change the optimal solution.

###### Theorem 2

The pricing problem as stated in (9) is NP-complete.

The proof is available in Appendix A. Introducing binary variables

 zi={1 if i∈S,0\ otherwise,

the pricing problem can be formulated as a mixed integer quadratic program:

 minI∑i=1L∑l=1t2il−I∑i=1πizi (10) til−(yil−J∑j=1βjxijl)+M(1−zi) ≥0\ \ i∈[I] , l∈[L] (11) til+(yil−J∑j=1βjxijl)+M(1−zi) ≥0\ \ i∈[I] , l∈[L] (12) I∑i=1zi ≥n (13) til ≥0\ \ i∈[I] , l∈[L] zi ∈{0,1}\ \ i∈[I],

where is a large positive constant and is assumed to be valid (does not cut an optimal solution). We can use the same approach from Section 2.1 to set up a valid value for . A feasible solution’s SSE can be a valid value. By using similar arguments as those for constraints (2) and (3), is the absolute error for the corresponding observation in the optimal solution if , and it is zero otherwise. The difference from the pricing problem in  is that (10)–(13) is based on big M constraints and is for the generalized CLR, while Carbonneau et al.  used logical implications of the constraints for the CLR problem.

In the column generation algorithm, (10)–(13) are solved. Recall that reduced cost for cluster is . It is easy to see that value is equivalent to to the value of (10) with for and 0 otherwise. This follows from the fact that and is modeled by variables .

Example Let us consider a data set with 4 entities and suppose . Then, we have

,

where . In Algorithm 1, suppose we start with subset of , which is a set of initial candidate clusters. The master problem in Line 3 picks the best combination of the candidate clusters that has minimum total SSE. Suppose we obtain in Line 3 together with the associated dual solution. Here we assume that the solution is integral, albeit this might not always be the case. In Line 4, the pricing problem is solved to search if there exists a candidate cluster not in that can improve the current best solution . Suppose the pricing problem returns with a negative reduced cost. In Line 8, is updated to .

#### Column generation stabilization schemes

If the optimal solution obtained by CG is not integral, branching would have to be performed, i.e., a fractional variable needs to be selected and two new problems created, the first one would impose and the other one . However, the extensive evaluation conducted on Algorithm 2 revealed that no fractional solutions were provided by Algorithm 2. For this reason in the remainder we focus on column generation for solving the LP relaxation and not branching. Column generation is known to exhibit the tailing-off effect and for this reason we employ stabilized column generation of du Merle et al. .

The stabilized column generation algorithm for solving the CLR problem is illustrated in Algorithm 2. The algorithm takes input of stabilization parameters and , and maximum allowed iterations . In Line 1, we start with a set of initial clusters of entities. The generation procedure is identical to the one in Line 1 of Algorithm 1. For iteration , in Line 3, we solve the stabilized master problem and get the optimal solution and its corresponding dual solution , which provides input parameters for the pricing problem. The stabilized master problem additionally includes parameters , and variables , but is very similar to (6) - (8). See Appendix C for the actual formulation. By solving the pricing problem, we get a new cluster in Line 4. The reduced cost corresponding to this new cluster is equal to , where is the sum of squared residuals when performing regression over this cluster, and if and only if . In Lines 5-6, if the reduced cost is nonnegative and and are equal to zero, then the algorithm is complete with the optimal partition of clusters defined by . Otherwise, in Lines 8-12, we update and then if the reduced cost is nonnegative, we update the stabilization parameters , and .

### 3.2 CG Heuristic Algorithm

The numerical experiments introduced later reveal that the column generation algorithm does not scale well to problems with a large number of entities. To overcome the scalability problem, we propose a heuristic method called the CG Heuristic algorithm that relies on column generation.

The CG Heuristic algorithm first finds a partition with a large number of clusters by neglecting the cardinality constraint. In the second step, we combine the clusters by considering unions to obtain exactly clusters while obeying the cardinality constraint, which is a slight variant of column generation. We refer to the intermediate clusters from the first part, which are the input to the column generation algorithm in the second part, as groups.

The algorithmic framework is presented in Algorithm 3. We require that since the second step is to combine groups into clusters. Lines 1-5 represent the first step to create groups and Line 6 represents the second step to find a solution to the original problem. Line 1 follows the same procedure as Line 1 of Algorithm 1, except that we have R groups instead of K. Lines 2-5 are basic and do not need further explanations. It yields “low cost” groups. Since , in Line 6 we combine some groups so that we end up with exactly clusters, each one with cardinality at least . This regrouping of groups is performed in an optimal way by using the column generation framework.

For Line 6, we need to revise the master and pricing problems in the following way when we cluster a group of entities instead of single entities. Suppose at the end of Line 5 we clustered entities into groups , and then apply the column generation algorithm to the groups of entities. Let be the set of all subsets of such that , and let if , and otherwise. To obtain the new master problem, we need to replace with and with in the master problem (6)–(8). In addition, the range of constraints (8) changes to .

We denote the dual variables of constraints (8) in the master problem by , and introduce the binary decision variables for to indicate whether group is selected in the cluster with the minimum reduced cost. To obtain the new pricing problem, we need to replace ’s with ’s in the pricing problem (10)–(13). Constraint (13) is changed to , and the range in constraints (11) and (12) now becomes . The new pricing problem has the same number of constraints as the pricing problem (10)–(13), however, it has only binary variables, comparing to such variables in the pricing problem refEq:Pricing–(13).

### 3.3 GA-Lloyd Heuristic Algorithm

Scientific works as those presented by Maulik and Bandyopadhyay  and Chang et al. , effectively suggest to embed the concept of the Lloyd’s algorithm, into a genetic search metaheuristic framework to find proper clusters for the MSSC clustering problem. Here we discuss our proposed adaptation of the GA-based Lloyd’s clustering algorithms for solving the generalized CLR problem.

For the Lloyd’s algorithm part, a vector of regression coefficients is used to represent cluster , and an entity is recursively assigned to the cluster that gives the smallest sum of squared errors for this entity. The GA part helps escape local optimal solutions. The overall algorithmic framework is outlined in Algorithm 4.

In Line 1, we start by randomly generating partitions , each of which corresponds to clusters of entities with for . The generation procedure is based on a uniform random number and is the same as the one in Line 1 of Algorithm 1. Any randomly generated partition has to satisfy the constraint that for . A population consists of chromosomes, and chromosome is encoded as a vector of size . In any chromosome, the first genes represent the regression coefficients for the first cluster, and the next genes represent the regression coefficients for the second cluster, and so on. The encoding of chromosome is illustrated in Figure 2.

The regression coefficient is obtained by running regression over cluster . The fitness of the chromosome is defined to be

 γ(h)=1∑Kk=1∑i∈Ck(h)∑Ll=1(yil−∑Jj=1βkj(h)xilj)2. (14)

We continue by performing the following genetic operations on the population of chromosomes iteratively until the number of iterations without improvement reaches a specified maximum number maxIter. First, in Line 3, we randomly select two parent chromosomes and from population using roulette wheel selection. Chromosome is chosen with probability . Second, in Line 4, we perform crossover on chromosomes and . We select a gene position as a random integer in the range of . We require this random integer to be no more than so that there is at least one gene positioned to the right of it. The portions of the chromosome lying to the right of this gene position are exchanged to produce two child chromosomes and encoded by and . The crossover operation is illustrated in Figure 3.

Third, in Line 5, we perform mutation on these two child chromosomes. The mutation is performed on a child chromosome with a fixed probability , where is a parameter. A gene position with value is randomly picked from the child chromosome using a uniform random number. After mutation, it is changed to with equal probability if is not zero. Here

is a random number with uniform distribution between zero and one. Otherwise, when

is zero, it is changed to with equal probability. In this way, the regression coefficients can take any real values after sufficient number of iterations. Next, in Line 6, we need to decode these two mutated child chromosomes to get the partitions and of clusters they represent. To decode the child chromosome , we assign entity to cluster for

 k∗i=argminkL∑l=1(yil−J∑j=1βkj(ha)xilj)2

Then, we perform regression over each cluster of and , and update the encoding of these two child chromosomes and with the resultant regression coefficients. Fitness and are calculated for the child chromosomes using (14). In Lines 6-7, we replace the chromosome in population with the smallest fitness with the child chromosome with the smaller fitness if .

During the decoding step of the GA-Lloyd algorithm, we may need to adjust the clusters generated in order to satisfy the minimum size constraints. If cluster has size smaller than , then we sort the entities not in in the increasing order of the sum of squared regression errors, and then reassign these entities in the sorted order to cluster until the size of reaches . We also skip each entity that would reduce the size of its original cluster below . When there is more than one cluster with size smaller than , we perform this adjustment for the smallest cluster first.

### 3.4 Two-Stage Heuristic Algorithm for SKU Clustering Problem

Due to its simplicity, two-stage heuristic algorithms are frequently employed in practice for solving the CLR problem. In the first stage, entities are partitioned according to certain approximate measures of the regression coefficients. In the second stage, regression models are built over the resultant clusters. The clustering method for the first stage is usually problem specific.

• Stage 1 Partition entities (SKUs) into clusters. Let be the index set of entities in cluster , .

• Stage 2 For each , build a regression model using entities in .

In this section, we describe our two stage heuristic algorithm for the SKU clustering problem. Recall that we are given

yi weekly sales vector of SKU (entity) i independent variable vector of SKU i and dependent variable j

The first stage of our algorithm is based on hierarchical clustering and simple (one dimensional) regression. In detail,

• [noitemsep]

• We carry out one regression for each SKU to get sales without promotional effects. The dependent variable for the regression is the weekly sales, and the only independent variable is the price discount. Hence, we build the following simple regression

,

for each SKU in and price discount, where is the residual vector. Note that these regressions are one dimensional.

• We construct, for each SKU, a sales vector of dimension 52 (i.e. the number of weeks in a year). The element of the vector records the mean sales without promotional effects of the week of the year, averaged over all years. The mean sales without promotional effects are in fact , price discount, in the previous step.

• We calculate the correlations between any pair of sales vectors constructed in the previous step. The distance between any two SKUs is defined as one minus their corresponding correlation. That is, for any and in , we calculate correlation between and , price discount, and define distance as .

• We perform agglomerative hierarchical clustering over SKUs using distances generated in the previous step. We use the maximum distance between SKUs of each cluster as the distance between any two clusters as in the complete linkage clustering .

### 3.5 Späth Algorithm

The algorithm in Späth  is for clusterwise regression. The algorithm starts with an initial partition of the observations, then continues to move an observation to a different cluster while there is an improvement in the objective function value. The algorithm of Späth  can also be used for the generalized clusterwise regression with a trivial adjustment. For generalized clusterwise regression, the only change is to move an entity (collection of observations) instead of moving an observation. Let . The formal procedure is in the following.

• [noitemsep]

• Step 1 Start with some initial partition of entities such that , , and set .

• Step 2 Let , where is the index of the cluster that belongs to. If and if there exists with such that , then we pick an index and redefine and . In all other cases, set and return to Step 2.

• Step 3 Repeat Step 2 until there is no improvement in Step 2 for consecutive times.

In Step 1, each entity is assigned to a cluster based on a uniformly distributed random number. In Step 2, we move entity to the cluster that reduces the objective function the most. Step 2 is repeated until no entry can be moved with a reduction of the objective function value. In the rest of the paper, we denote this algorithm as Späth.

## 4 Numerical Experiments

In this section, we examine the performance of CG (Algorithm 2), CG Heuristic (Algorithm 3), GA-Lloyd (Algorithm 4), Späth and the two-stage algorithms on the SKU clustering problem according to its seasonal effects. The regression model for this problem has the following form:

 f0(weekly sales)=f1(promotional predictors)+f2(% seasonal predictors),

where the seasonal effects are modeled by 52 dummy variables, one for each week. There are three types of data sets we used in the experiment.

1. Real-world data: We use real-world data from a large retail chain. We omit the exact form of the regression model due to confidentiality. It includes more than two years of aggregated sales and promotional data of the entire chain. The products within this chain are grouped into subcategories for purchasing purposes. However, it is assumed that the products within the same subcategory have different seasonal patterns with regard to promotions. We tested our algorithms on two representative subcategories from the data, a smaller subcategory “Cream” and one of the largest subcategories “Medicines.” Both subcategories have more than one seasonal pattern. Each SKU in these subcategories has at least 52 weeks of data and at most 129 weeks of data from year 2006 to 2008.

2. Synthetic data type 1: We generate random instances that have similar patterns as the real-world data. The promotional predictor includes percentage of discount and the seasonal predictor captures week index. Therefore, the data set includes weekly sales, percentage of discount, and the week index. Each entity has a year of records (52 weeks). For each , we generate 10 random instances, which results into 7 different size data sets of total 70 instances. The detailed instance generation procedure is available in Appendix B.1 and the data set is available at a web site .

3. Synthetic data type 2: We generate random instances based on the procedure for synthetic data type 1. The difference is that we are given the target number of cluster and entities are assigned to a cluster. For this reason, we have a target solution for each instance that is expected to be good. The target solution can be used to evaluate heuristic algorithms. However, in our experiment, we observe that most of the proposed algorithms give better solutions than the target solution in terms of the objective function value. Hence, in this paper, we did not use the target solution for the evaluation of the algorithms. The structure and size of the data are the same as in the synthetic data type 1. The generation procedure is available in Appendix B.2 and the data set is available at a web site.

Note that the real-world data and the implementation of the two stage algorithm had to be destroyed before the publication. For this reason, the implementations used for real-world and synthetic data are slightly different. For the same reason, the computational environments used and performance measures are different for real-world and synthetic data. In Table 1, we summarize which algorithms are used for each data set.

The following computational environments were used for real-world and synthetic data sets.

1. Real-world data: All the algorithms except the two-stage algorithm were implemented in Java 1.6 with CPLEX 11 as the mathematical programming solver. The “lm” function in R 2.8 is used to perform regressions for the GA-Lloyd algorithm. The two-stage algorithm is implemented in R 2.8, and the “hclust” function for hierarchical clustering is employed to perform clustering. All numerical experiments were performed on a 64-bit server with a multi-core Intel Xeon 2 GHz CPU and 10 GB of RAM.

2. Synthetic data types 1 and 2: All the algorithms were implemented in Java 1.7 with CPLEX 12.5 and R 2.8. The experiments were performed on a 64-bit server with a multi-core Intel Xeon 2.8 GHz CPU and 32 GB of RAM.

Note also that, for all experiments in Section 4, specific values for the following parameters of the GA-Lloyd algorithm were selected based on a sensitivity analysis: (population size) and (permutation probability). Furthermore, as a termination criterion for the algorithm we specify that the number of consecutive iterations with no improvement has to reach 50. In the experiments, we use a very large big M relative to the residuals to ensure optimality. On average, is 3000 times larger than the average of absolute residuals and is 30 times the average of the response variable.

### 4.1 Real-World Data

#### 4.1.1 Comparison of the Heuristic Algorithms

Based on preliminary computational studies of the column generation algorithm and the results shown in Section 4.1.4, we observed that the algorithm does not scale well when applied to large size instances. Hence, the heuristic algorithms are employed to cluster SKUs for the “Cream” and “Medicine” subcategories.

In this section we compare the performance of the GA-Lloyd, CG Heuristic, and two-stage algorithms with and in terms of solution time and quality. As the initial important remark, we observe that the GA-Lloyd algorithm performs the best while providing a good balance between time and quality. In the experiments, an instance denoted by “” means that we divided SKUs into clusters. For example, an instance denoted by “66_2” implies that we divided 66 SKUs into two clusters. The stopping criteria for the CG Heuristic algorithm are: (a) an optimal solution is found by the column generation with groups of entities, and (b) a time limit of 10 hours is reached after the last pricing problem. For subcategory “Cream,” the CG Heuristic algorithm applied stopping criterion (a), whereas for the subcategory “Medicine,” it stopped due to (b).

In this section, for any two algorithms and , we use relative improvement (RI%) by from , which is defined by

RI() =

where the value of SSE(algorithm) is the sum of the regression errors across all clusters. Note that RI() 0 means that generates a better solution and RI() 0 implies superiority of .

First, we compare the GA-Lloyd algorithm with the two-stage algorithm. Figure 4 shows RI(Two Stage,GA-Lloyd), which is the relative improvement of the GA-Lloyd algorithm over the two-stage algorithm. We observe that a substantial improvement is achieved by the GA-Lloyd algorithm over the two-stage heuristic. Figure 4: Relative Improvement of GA-Lloyd over Two-Stage for Real-World Data

In terms of the running time, we observe that the two-stage algorithm outperforms the GA-Lloyd algorithm while converging within one and five minutes, respectively, for the “Cream” and “Medicine” subcategories, whereas the GA-Lloyd algorithm took roughly 20 minutes and one hour for the corresponding subcategories. In practice, however, run times of one-hour are acceptable for the SKU clustering problem according to our retail partner.

Figure 5 shows RI(CG Heuristic,GA-Lloyd) with for the CG Heuristic algorithm. We observe that the resulting CostDiff values between the GA-Lloyd and CG Heuristic algorithms are not significant, all within 6%. The CG Heuristic algorithm with generates better solutions than the GA-Lloyd algorithm for five out of eight instances. Among the five instances, in four of them the improvement is barely noticeable (“337_2” is the only case with more pronounced improvement). However, its running time, four to six hours for subcategory “Cream” and 10 hours for subcategory “Medicine”, is much longer than that of the GA-Lloyd algorithm, as illustrated in Figure 6. In summary, it is recommended to select largest within affordable time limit for solving the problem. Figure 5: Relative Improvement of GA-Lloyd over CG Heuristic for Real-World Data

From Figure 5, we also observe that the GA-Lloyd algorithm performs better than the CG Heuristic algorithm with for all but one instance. In addition, the GA-Lloyd algorithm also outperforms the CG Heuristic algorithm in terms of computational times for all these instances, as shown in Figure 6. When comparing the solutions of the CG Heuristic algorithm with and 8 for subcategory “Cream”, Figure 5 shows that a higher value of R does not necessarily imply a better solution.

#### 4.1.2 Seasonal Patterns Identified by GA-Lloyd

Based on the superiority of the GA-Lloyd algorithm over its counterparts shown in the previous section, in this section we present numerical results of the seasonal patterns identified by this algorithm when applied to the largest subcategory, “Medicine”.

Figure 7 shows the seasonal multipliers obtained from the GA-Lloyd algorithm for each cluster when dividing the subcategory “Medicine” into 2, 3, 4 and 5 clusters. From the figure, we observe distinct seasonal patterns: (1) U-shaped curve, (2) inverted V-shaped, and (3) flat. Observe that all of the three seasonal patterns have been found when dividing SKUs into three clusters. This observation indicates that it is not necessary to further divide them into four or five clusters since some seasonal patterns look similar. The three clusters of SKUs represent medicines that intuitively have such different seasonal patterns: one corresponding to medicines (such as for cold and flu) that sell more in the winter, one corresponding to medicines (such as for bug repellents and sunburns) with uplift in the summer, and one corresponding to medicines (such as for diarrhea and constipation) with stable sales year around.

#### 4.1.3 Optimality Gap of GA-Lloyd and CG Heuristic

In this section, we benchmark the performance of the GA-Lloyd and CG Heuristic algorithms by comparing against the CG algorithm, which is an optimal algorithm. In order to measure the performance, we calculate

 OptGap=[SSE({algo})−SSE(Column Generation)SSE(Column Generation)], (15)

where algo CG Heuristic, GA-Lloyd . For the real-world data, note that since the column generation algorithm did not cluster SKUs for the “Cream” and “Medicine” subcategories within a reasonable computational time due to their large sizes, we construct smaller instances with 15 and 20 SKUs that are randomly chosen from the large subcategory “Medicines.” For these instances, we test the algorithms for parameters and . We also studied the instance that divides the subcategory “Cream” with 66 SKUs into two clusters with minimum cluster size , which we refer to as instance in this section. The column generation exact algorithm were executed with 10 hours of time limit for the instance.

Figure 8(a) shows the optimality gap values of the GA-Lloyd algorithm. We observe that the GA-Lloyd algorithm achieves close to optimal solutions, with optimality gaps less than 2%. In addition, the GA-Lloyd algorithm finishes within five minutes for these smaller instances.

For the instance, the solution obtained from the column generation exact algorithm after 10 hours of running time is only 1.47% better than the GA-Lloyd solution, which is obtained in less than 20 minutes. This case is not shown in the figure.

Figure 8(b) shows the optimality gap values of the CG Heuristic algorithm for different values of . We observe that the CG Heuristic algorithm also finds close to optimal solutions, with optimality gaps less than 5%. In addition, the CG Heuristic algorithm finishes within five minutes for the instances of size 15, and within 20 minutes for the instances of size 20. By comparing the solutions for the instance “20_2” for equal 8 and 10, we again find that a larger does not necessary generate a better solution.

For the instance, the solution obtained from column generation exact algorithm after 10 hours of running time is only 0.13%, 1.86%, 0.75% better than the CG Heuristic solution with equal to 6, 8, and 10, respectively. The corresponding running times of the CG Heuristic algorithm are 54, 178, and 365 minutes, respectively. The performance for the 66 SKUs is not depicted graphically.

The average gap in Figure 8(a) is 1.14%, while in Figure 8(b) for it is 1.01%. This indicates that for smaller instances the CG Heuristic algorithm outperforms the GA-Lloyd algorithm if the objective value is the only performance indicator and the computational time is limited to 10 hours. On the contrary, Figure 5 indicates clearly that the GA-Lloyd algorithm suits better for larger instances. This implies that whenever a strict run time limit is imposed, the GA-Lloyd algorithm is very likely to outperform its counterparts for most of the instances.

#### 4.1.4 Time Study of the Column Generation Algorithm

The performance of the Column Generation algorithm (Algorithm 2) is assessed on a set of computational experiments conducted on instances with 15, 20 and 66 SKUs. These instances are chosen in a similar way as in the previous section. Figure 9(a) presents the running time of Algorithm 2 for specific numbers of SKUs and clusters to divide in. We can observe that it takes roughly 3 hours to divide 20 SKUs into two clusters.

For the instance with 66 SKUs and two clusters, we are unable to get an optimal solution after 10 hours. When comparing to the lower bound obtained by solving the linear relaxation of the mixed integer formulation (1)–(5), the solution gotten from column generation after 10 hours of running time is 38.23% larger than the lower bound. However, we suspect this solution is very close to the optimal one because we observe that the minimum reduced cost of the master problem is close to zero.

We also study another version of the generalized CLR problem with the sum of absolute errors as the objective to examine whether the difficulty in solving the pricing problem is due to the nonlinearity of the objective function in the pricing problem (10)-(13). More specifically, we change the objective function in the pricing problem to . Figure 9(b) presents the running time when the objective function for the CLR problem is the sum of absolute errors. From this figure, we observe that the running time also increases dramatically as the number of SKUs increases. It also takes hours to solve the instances with 20 SKUs. Therefore, we believe the nonlinear objective of the pricing problem is not the complicating factor that greatly drives up the computation time. These running times are higher due to a larger number of iterations resulting from degeneracy of LP solutions (the per-iteration time is lower).

### 4.2 Synthetic Data

#### 4.2.1 Optimality Gap of the Heuristic Algorithms

In this section, we test synthetic instances (types 1 and 2) with for parameters and . These instances are (except for type 2 with ) of the size that can be optimally solved within reasonable amount of time. We test the performance of the CG Heuristic, GA-Lloyd, and Späth algorithms by comparing against the solution of CG for the small synthetic instances with and , where the optimality gap defined in (15) is used. We only execute the CG Heuristic algorithm with . For type 2, we did not run CG algorithm for instances with due to excessive computation time. Instead, for instances with , we consider the best objective function value among all algorithms as optimal for calculating the gap. We present the result in Figures 10 and 11 for types 1 and 2 data, respectively. In Figures 10(b) and 11(b), we observe that the running times of the algorithms are of the same magnitude. We also observe that the running time of the CG Heuristic algorithm tends to decrease in and the running time of the GA-Lloyd algorithm tends to increase in . The optimality gap in Figure 10(a) shows that Späth performs best for but the optimality gap drastically increases in increasing . The GA-Lloyd algorithm gives less than 5% optimality gap for all data sizes. Figure 11(a) also shows the same pattern except for one data set with and . Hence, we conclude that the performance of GA-Lloyd is stable and good. The CG algorithm does not perform best for these small instances.

#### 4.2.2 Comparison of the Heuristic Algorithms

In this section, we compare the performance of the CG Heuristic, GA-Lloyd, and Späth algorithms with and for the synthetic instances with . Due to its excessive computational time, we did not run the CG algorithm. Instead, we use

 Gap=SSE({algo})−min{SSE(CG Heuristic),SSE(GA-Lloyd),SSE(Sp{\"{a}}th)}min{%SSE(CGHeuristic),SSE(GA-Lloyd),SSE(Sp{\"{a}}th)}. (16)

We execute the CG Heuristic algorithm with . The result is presented in Figures 12 and 13 for types 1 and 2 data, respectively. From Figures 12(a) and 13(a), we observe that Späth generally performs the best for small . However, the gap drastically increases as increases, and Späth is recommended to be used for small . The performance of the CG Heuristic is competitive for type 1 data and is the best among all for type 2 data. The performance of the GA-Lloyd is not competitive for both types of data sets. The computation times in Figures 12(b) and 13(b) show that the scalability of the algorithms is in the order of CG Heuristic, GA-Lloyd, and Späth. The running time of Späth drastically increases in problem size.

#### 4.2.3 Comparison of the Heuristic Algorithms Over Time

In this section, we compare the performance of the CG Heuristic, GA-Lloyd, and Späth algorithms over time with and for the type 1 synthetic instances with . We present how the best solution of each algorithm is updated over time, while the result in Section 4.2.2 is based on the solutions at the termination.

In Figure 14, we plot the relative gap over time from the best solution obtained in Section 4.2.2. The relative gap at time of an algorithm is obtained by plugging SSE of the algorithm’s current best solution at time in the first term of the denominator of (16). We plot the result for all 9 pairs of and in a 3 by 3 grid, where each subplot’s horizontal and vertical axes are time (in seconds) and relative gap, respectively. In each plot, dark gray, black, and light gray lines correspond to the CG Heuristic, GA-Lloyd, and Späth algorithms, respectively. The result is based on three instances of each pair and each line stops at the average execution time of the corresponding algorithm. Figure 14: Gap from the best objective over time for various K and I

Overall, the CG Heuristic and GA-Lloyd algorithms converge faster than the Späth algorithm. Between the CG Heuristic and GA-Lloyd, it is not easy to decide which algorithm converges faster. When , the CG Heuristic converges and terminates faster than GA-Lloyd. However, as and increase, there is no trend or significant difference in convergence between the CG Heuristic and GA-Lloyd, although GA-Lloyd terminates later. As increases, the performance of the Späth algorithm gets worse, although it performs better than GA-Lloyd at termination when .

#### 4.2.4 Time Study of the Column Generation Algorithm

We present the running time of the CG algorithm for the synthetic data. The running times are of a different magnitude from the one for the real-world data. This is because (1) the computational environments are different and (2) the data has smaller and the number of attributes . In Figure 15, we plot the running time of CG. Recall that we did not run CG algorithm for type 2 data with due to excessive running time. We were not able to get an optimal solution in 10 hours for some of the instances omitted. We observe that the running time drastically increases as increases. The running time decreases in with fixed .

## 5 Conclusions

We propose an exact column generation algorithm, the CG Heuristic algorithm, the GA-Lloyd metaheuristic, and the two-stage algorithm for the resolution of the generalized cluster-wise linear regression problem. We examine the performance of our algorithms on the SKU clustering problem according to seasonal effects using a real-world retail data set from a large retail chain. We find that the column generation exact algorithm can solve small instances to optimality. We use the column generation exact algorithm to benchmark the performance of the GA-Lloyd algorithm and the CG Heuristic algorithm, although the CG algorithm cannot scale well to instances of large sizes. The two-stage algorithm can produce SKU clusters very fast, but with higher objective values than the GA-Lloyd algorithm. The CG Heuristic algorithm performs slightly better than the GA-Lloyd algorithm for some instances, but with much longer running time. The GA-Lloyd algorithm provides a good balance between solution quality and time, and generates SKU clusters with distinctive seasonal patterns efficiently and effectively.

Acknowledgement: We are much obliged to Mr. Molham Aref, the CEO of Predictix Inc, for his assistance during the project. Mr. Aref allowed us to use the data and he approved a summer internship by Yan Jiang.

## References

•  Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75:245–248.
•  Aloise, D., Hansan, P., and Liberti, L. (2012). An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming Series A, 131:195–220.
•  Barnhart, C., Johnson E. L., Nemhauser G. L., Savelsbergh, M.W.P., and Vance, P.H. (1998). Branch-and-price: column generation for solving huge integer programs. Operations Research, 46:316–329.
•  Bertsimas, D. and Shioda, R. (2007). Classification and regression via integer optimization. Operations Research, 55:252–271.
•  Carbonneau, R. A., Caporossi, G., and Hansen, P. (2011). Globally optimal clusterwise regression by mixed logical-quadratic programming. European Journal of Operational Research, 212:213–222.
•  Carbonneau, R. A., Caporossi, G., and Hansen, P. (2012). Extensions to the repetitive branch and bound algorithm for globally optimal clusterwise regression. Computers and Operations Research, 39:2748–2762.
•  Carbonneau, R. A., Caporossi, G., and Hansen, P. (2014). Globally optimal clusterwise regression by column generation enhanced with heuristics, sequencing and ending subset optimization. Journal of Classification, 31:219–241.
•  Chang, D., Zhang, X., and Zheng, C. (2009). A genetic algorithm with gene rearrangement for k-means clustering. Pattern Recognition, 42:1210–1222.
•  DeSarbo, W. S. (1989). A simulated annealing methodology for clusterwise linear regression. Psychometrik, 54:707–736.
•  DeSarbo, W. S. and Cron, W. L. (1988). Regression clustering. Journal of Classification, 5:249–282.
•  du Merle, O., Hansen, P., Jaumard, B., and Mladenovi, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21:1485–1505.
•  du Merle, O., Villeneuve, D., Desrosiers, J., and Hansen, P. (1999). Stabilized column generation. Discrete Mathematics, 194:229–237.
•  D’Urso, P., Massari, R., and Santoro, A. (2010). A class of fuzzy clusterwise regression models. Information Sciences, 180:4737–4762.
•  Garey, M. R. and Johnson, D. S. (1979). Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman and Company.
•  Hansen, P. and Jaumard, B. (1997). Cluster analysis and mathematical programming. Mathematical Programming, 79:191–215.
•  C. Hennig (1999). Models and methods for clusterwise linear regression. Proceedings in Computational Statistics.
•  C. Hennig (2000). Idenfiability of models for clusterwise linear regression. Journal of Classification, 17:273-296.
•  Ingrassia, S., Minotti S.C., and Punzoa, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics and Data Analysis, 71:159-182.
•  Johnson, R. and Wichern, D. (2007). Applied multivariate statistical analysis . Pearson.
•  Lau, K., Leung, P., and Tse, K. (1999). A mathematical programming approach to clusterwise regression model and its extensions. European Journal of Operational Research, 116:640–652.
•  Luo, Z. and Chou, E. Y. (2006). Pavement condition prediction using clusterwise regression. In Transportation Research Board 2006 Annual Meeting.
•  Maulik, U. and Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33:1455–1465.
•  Muruzábal, J., Vidaurre, D., and Sánchez, J. (2012). SOMwise regression: a new clusterwise regression method. Neural Computing and Applications, 21:1229–1241.
•  Openshaw, S. (1977). A geographical solution to scale and aggregation problems in region-building, partitioning and spatial modeling. Transactions of the Institute of British Geographers, New Series, 2:459–472.
•  Park, Y.W. and Klabjan, D. (2013). Subset selection for multiple linear regression via optimization.
•  Späth, H. (1979). Clusterwise linear regression. Computing, 22:367–373.
•  Vanderbeck, F. and Wolsey, L. A. (1996). An exact algorithm for IP column generation. Operations Research Letters, 19:151–159.
•  Wedel, M. and DeSarbo, W. S. (1994). Advanced methods of marketing research, chapter A review of recent developments in latent structure regression models, pages 352–388. Blackwell Publishing, London.
•  Zhang, B. (2003). Regression clustering. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM 03), pages 451–458.

## Appendix A Proof of Theorems

### a.1 Proof of Theorem 1

Intuitively, the generalized CLR problem resembles the MSSC problem, which is known to be NP-hard. We conduct a polynomial transformation from MSSC to a special case of the CLR problem as follows.

Consider an instance of the MSSC problem with entities. Each entity has an associated vector . Let vector be the observations of the dependent variable for entity

, and let an identity matrix of size

be the observations of independent variables , which means

 xi=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣10…001…0⋮⋮⋱⋮00…1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (17)

This yields an instance of the generalized CLR problem of dimension . The regression coefficient for cluster is the centroid of the entities assigned to cluster with

 βkl=∑i∈Ckyil|Ck|.

This proves NP-hardness of the generalized CLR problem.

### a.2 Proof of Theorem 2

In this section, we show that the pricing problem is NP-complete.

Let us consider a special case of the pricing problem with the observations of the independent variables being an identity matrix as in (17). Then the pricing problem becomes

 min|S|≥n,β∑i∈SL∑l=1(yil−βl)2−∑i∈Sπi.

Given any cluster such that , by equating the first order derivative of the pricing objective function to zero, we obtain the optimal as the centroid of vector :

 βl(S)=∑i∈Syil|S|.

The Huygen’s theorem states that for a given set of vectors , the sum of squared distances to the centroid is equal to the sum of squared distances between these vectors divided by two times the cardinality of the set, which mathematically stated reads

 ∑i∈S∑j∈S,j≠i||ui−uj||22=2|S|∑i∈S||ui−¯u(S)||22

where and . Based on Huygen’s theorem, this special case of the pricing problem can also be stated as:

 min|S|≥n∑i∈S∑j∈S,j≠i||yi−yj||22−2|S|∑i∈Sπi.

By using a transformation from the independent set problem, , we show that this special case of the pricing problem with this formulation is NP-complete, which implies the pricing problem is NP-complete.

Let us now formally prove the theorem. We first introduce the independent set problem (), a known NP-complete problem, which is used to prove that the pricing problem is NP-complete.
Instance: Graph , and a positive integer ;
Question: Does contain an independent set of size , i.e., a subset with such that no two vertices in are joined by an edge in ?

We next show, through a polynomial reduction from the independent set of size problem that a constrained version of the pricing problem, which we refer to as “the subset of size problem,” is NP-complete. The subset of size problem is as follows.
Instance: vectors of dimension (i.e., ), real numbers for , and another real number ;
Question: Is there a subset of vectors with cardinality such that

 ∑i∈S∑j∈S,j≠i∥ui−uj∥22−∑i∈Sπi≤K?
###### Lemma 3

The subset of size problem is NP-complete.

Proof: We show NP-completeness of the subset of size problem using its relationship with the independent set of size problem. Consider an instance of the independent set of size problem with graph . To each node , we assign a vector of size . For , we have

 uij=⎧⎪⎨⎪⎩1,if edge (i,j)∈E and i

Let be the degree of node . If node is connected to node by edge , then

 ||ui−uj||22 =(ki−1)+(kj−1)+(1−(−1))2 =ki+kj+2,

and otherwise,

 ||ui−uj||22=ki+kj.

Let and . We next show that to answer the question whether there is a subset with such that is equivalent to answering the question whether there is an independent subset of size .

If there is an independent subset with , then . If there does not exist an independent subset with , then . Here the first inequality is because there are at least two nodes that are connected by an edge belonging to the subset .

We now show NP-completeness of the pricing problem by polynomially transforming the subset of size problem to this problem. The decision version of our pricing problem is as follows.
Instance: vectors