Triangle Lasso for Simultaneous Clustering and Optimization in Graph Datasets

08/20/2018 ∙ by Yawei Zhao, et al. ∙ zjnu 0

Recently, network lasso has drawn many attentions due to its remarkable performance on simultaneous clustering and optimization. However, it usually suffers from the imperfect data (noise, missing values etc), and yields sub-optimal solutions. The reason is that it finds the similar instances according to their features directly, which is usually impacted by the imperfect data, and thus returns sub-optimal results. In this paper, we propose triangle lasso to avoid its disadvantage. Triangle lasso finds the similar instances according to their neighbours. If two instances have many common neighbours, they tend to become similar. Although some instances are profiled by the imperfect data, it is still able to find the similar counterparts. Furthermore, we develop an efficient algorithm based on Alternating Direction Method of Multipliers (ADMM) to obtain a moderately accurate solution. In addition, we present a dual method to obtain the accurate solution with the low additional time consumption. We demonstrate through extensive numerical experiments that triangle lasso is robust to the imperfect data. It usually yields a better performance than the state-of-the-art method when performing data analysis tasks in practical scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been attractive to find the similarity among instances and conduct data analysis simultaneously via convex optimization for recent years. Let us take an example to explain this kind of tasks. Consider the price prediction of houses in New York. Suppose that we use ridge regression to conduct prediction tasks. We need to learn the weights of features (cost, area, number of rooms etc) for each house. The price of the houses situated in a district should be predicted by using the similar or identical weights due to the same location-based factors, e.g. school district etc. But, those location-based factors are usually difficult to be quantified as the additional features. Thus, it is challenging to predict the price of houses under those location-based factors. Recently, network lasso is proposed to conduct this kind of tasks, and yields remarkable performance

[1].

However, it is worth noting that some features of an instance are usually missing, noisy or unreliable in practical scenarios, which are collectively referred to as the imperfect data in the paper. For instance, the true cost of a house is usually a core secret for a company, which cannot be obtained in many cases. The market expectation is usually not stable, and has some fluctuations for a period of time. Network lasso suffers from the imperfect data (noise, missing values etc), and yields sub-optimal solutions. One of the reasons is that they use those features directly to learn the unknown weights whose accuracy is usually impaired due to such imperfect data. It is thus challenging to learn the correct weights for a house, and make a precise prediction. Therefore, it is valuable to develop a robust method to handle the imperfect data.

Many excellent researches have been conducted and obtain impressive results. There are some pioneering researches in convex clustering [2, 3]. [2]

focuses on finding and removing the outlier features.

[3] is proposed to find and remove the uninformative features in a high dimensional clustering scenario. Assuming that those targeting features are sparse, the pioneering researches successfully find and remove them via an regularization. However, their methods rely on an extra hyper-parameter for such a regularized item. The extra need-to-tune hyper-parameter limits their usefulness in practical tasks. As an extension of convex clustering, network lasso is proposed to conduct clustering and optimization simultaneously for large graphs [1]

. It formulates the empirical loss of each vertex into the loss function and each edge into the regularization. If the imperfect data exists, the formulations of the vertex and edge are inaccurate. Due to such inaccuracy, network lasso returns sub-optimal solutions. As the pioneering researches,

[4] investigates the conditions on the network topology to obtain the accurate clustering result in the network lasso. However, given a network topology, it is still not able to handle vertices with the imperfect data. Additionally, we find that it is not efficient for those previous methods, which impedes them to be used in the practical scenarios. In a nutshell, it is important to propose a robust method to handle those imperfect data and meanwhile yield the solution efficiently.

In the paper, we introduce triangle lasso to conduct data analysis and clustering simultaneously via convex optimization. Triangle lasso re-organizes a dataset as a graph or network111The graph and network have the equivalent meanings in the paper.. Each instance is represented by a vertex. If two instances are closely related in a data analysis task, they are connected by an edge. Here, the related has various meanings for specific tasks. For example, two vertices may be connected if an vertex is one of the nearest neighbours of the other one. Our key idea is illustrated in Fig. 1. If there is a shared neighbour between a vertex and its direct adjacency, a triangle exists. It implies that the vertices may be similar. If two vertices exist in multiple triangles, they tend to be very similar because that they have many shared neighbours. Benefiting from the triangles, triangle lasso is robust to the imperfect values. Although a vertex, e.g. has some noisy values, we can still find its similar counterpart and via their shared neighbours.

Fig. 1: The noisy vertex is more similar to than to because there are two shared neighbours between them.

It is worthy noting that the neighbouring information of a vertex is formulated into a sum-of-norms regularization in triangle lasso. It is challenging to solve the triangle lasso efficiently due to three reasons. First, it is non-separable for the weights of the adjacent vertices. If a vertex has a large number of neighbours, it is time-consuming to obtain the optimal weights. Second, the objective function is non-smooth at the optimum when more than one vertex belongs to a cluster. In triangle lasso, if two vertices belong to a cluster, their weights are identical. But, the sum-of-norms regularization implies that the objective function is non-differentiable in the case. There usually exist a large number of non-smooth points for a specific task. Third, we have to optimize a large number of variables, i.e. , where is the number of instances, and is the number of features. In the paper, we develop an efficient method based on ADMM to obtain a moderately accurate solution. After that, we transform the triangle lasso to an easy-to-solve Second Order Cone Programming (SOCP) problem in the dual space. Then, we propose a dual method to obtain the accurate solution efficiently. Finally, we use the learned weights to conduct various data analysis tasks. Our contributions are outlined as follows:

  • We formulate the triangle lasso as a general robust optimization framework.

  • We provide an ADMM method to yield the moderately accurate solution, and develop a dual method to obtain the accurate solution efficiently.

  • We demonstrate that triangle lasso is robust to the imperfect data, and yields the solution efficiently according to empirical studies.

The rest of paper is organized as follows. Section 2 outlines the related work. Section 3 presents the formulation of triangle lasso. Section 4 presents our ADMM method which obtains a moderately accurate solution. Section 5 presents the dual method which obtains the accurate solution. Section 6 discusses the time complexity of our proposed methods. Section 7 presents the evaluation studies. Section 8 concludes the paper.

2 Related Work

Recently, there are a lot of excellent researches on clustering and data analysis simultaneously, and they obtain impressive results.

2.1 Convex clustering

As a specific field of triangle lasso, convex clustering has drawn many attentions [5, 6, 7, 8, 9, 2, 3]. [5] proposes a new stochastic incremental algorithm to conduct convex clustering. [6] proposes a splitting method to conduct convex clustering via ADMM. [7] proposes a reduction technique to conduct graph-based convex clustering. [8] investigates the statistical properties of convex clustering. [9] formulates a new variant of convex clustering, which conducts clustering on instances and features simultaneously. Comparing with our methods, those previous researches focus on improving the efficiency of convex clustering, which cannot handle the imperfect data. [2] uses an regularization to pick noisy features when conducting convex clustering. [3] investigates to remove sparse outlier or uninformative features when conducting convex clustering. However, both of them uses more than one convex regularized items in the formulation, which needs to tune multiple hyper-parameters in practical scenarios. Specifically, the previous methods including [2, 3] focus on being robust to the imperfect data. They usually add a new regularized item, e.g. -norm regularization or -norm regularization to obtain a sparse solution. Although it is effective, the newly-added regularized item usually need to tune a hyper-parameter for the regularized item, which limits their use in the practical scenarios.

2.2 Network lasso

As the extension of convex clustering, network lasso is good at conducting clustering and optimization simultaneously [1, 10, 4]

. As a general framework, network lasso yields remarkable performance in various machine learning tasks

[1, 10]. However, its solution is easily impacted by the imperfect data, and yields sub-optimal solutions in the practical tasks. [4] investigates the network topology in order to obtain accurate solution. Triangle lasso aims to obtain a robust solution with inaccurate vertices for a known network topology, which is orthogonal to [4].

(a) clusters,
(b) clusters,
Fig. 2: The prediction of the house price in the Greater Sacramento area. With the increase of , more houses are fused to a cluster. The houses located in a cluster use an identical weight to predict their prices.

3 Problem formulation

In this section, we first present the formulation of the triangle lasso. Then, we instantiate it in some applications, and present the result in a demo example. After that, we present the workflow of the triangle lasso. Finally, we shows the symbols used in the paper and their notations.

3.1 Formulation

We formulate the triangle lasso as an unconstrained convex optimization problem:

Given a graph, represents the vertex set containing vertices, and represents the edge set containing edges. represents the response of the -th instance, i.e. . denotes the weight for the edge , which could be specifically defined according to the task in practical. is the regularization coefficient. It is highlighted that becomes

in the unsuperivised learning tasks such as clustering because that there is no response for an instance in the unsupervised learning tasks. We let

hold, where represents the neighbour set of a vertex. represents the empirical loss on the vertex . As a regularization, , and can have various formulations. In the paper, we focus on the sum-of-norms regularization, i.e.,

Since is the sum of norms, i.e. norm, we denote it by -norm. Given vertices and edges, define an auxiliary matrix . The non-zero elements of a row of represent an edge. If the edge is , and it is represented by the -th row of , we have

where is a positive known integer for a known graph. Triangle lasso is finally formulated as:

(1)

Note that is a hyper-parameter for triangle lasso, and it can be varied in order to control the similarity between instances. Additionally, the global minimum of (1) is denoted by . The -th row of with is the optimal weights for the -th instance. It is worthy noting that the regularization encourages the similar instances to use the similar or even identical weights. If some rows of are identical, it means that the corresponding instances belong to a cluster. As illustrated in Figure 2, we can obtain different clustering results by varying 222The details are presented in the empirical studies.. When is very small, each vertex represents a cluster. With the increase of , more vertices are fused into a cluster.

We explain the model by using an example which is illustrated in Figure 3. The vertex is profiled by using noisy data. As we have shown, we obtain

The regularized term is:

For similarity, we consider the case of no weights for edges, and further let . is

Fig. 3: If an edge exists in many triangles, its vertices are penalized more than other vertices. Thus, their weights tends to be more similar or even identical than others.

As illustrated in Figure 3, the vertices , and exist in a triangle. The difference of their weights, namely , and are penalized more than others. The large penalization on the difference of and makes is close to and . Although is profiled by noisy data, we can still find its similar counterparts , . More generally, if some instances have missing values, those values are usually filled by using the mean value, the maximal value, the minimal value of the corresponding features, or the constant

. Comparing with the true values, those estimated values lead to noise. The noise impairs the performance of many classic methods when conducting data analysis tasks on those values directly. Note that triangle lasso does not only use the values, but also use the relation between different instances. If the vertices have many common neighbors in the graph, they tend to be similar even though they are represented by using noisy values. That is the reason why triangle lasso is robust to the imperfect data.

Triangle lasso is a general and robust framework to simultaneously conduct clustering and optimization for various tasks. The whole workflow is presented in Figure 4. First, a graph is constructed to represent the dataset. Second, we obtain a convex optimization problem by formulating a specific data analysis task to the triangle lasso. Third, we provide two methods to solve the triangle lasso. Finally, we obtain the solution of the triangle lasso, and use it to complete data analysis tasks. Note that the graph or network datasets are the main targeting datasets for triangle lasso. For a graph or network dataset, can be obtained trivially. Otherwise, we represent the dataset as a graph as follows:
Case 1: If the dataset does not contain the imperfect data (missing, noisy, or unreliable values), we run

-Nearest Neighbours (KNN) method to find the

nearest neighbours for each an instance. After that, we can obtain the graph by the following rules.

  • Each instance is denoted by a vertex.

  • If an instance is one of the nearest neighbours of the other instance, then the vertices corresponding to them are connected by an edge.

Case 2:

If the dataset contains imperfect data, or contains redundant features in the high dimensional scenarios, we run the dimension reduction methods such as Principal Component Analysis (PCA) or feature selection to improve the quality of the dataset. Then, as mentioned above, we use KNN to find the

nearest neighbours for each an instance, and obtain the graph. The procedure is suitable to both supervised learning and unsupervised learning. Additonally, the

matrix in (1) plays an essential role in the triangle lasso. Each element of contains , and . is a hyper-parameter which needs to be given before optimizing the formulation. and are closely related to the graph. When a dataset is represented as a graph, we need to determine and in order to obtain . is the weight of the edge , which can be used to measure the importance of the edge. Some literatures recommend where is a non-negative constant [6, 11, 9]. When , it represents the uniform weights. When , it represents the Gaussian kernel. Besides, measures the similiarity of nodes and due to their common adjacent nodes.

Fig. 4: The illustration of the workflow of triangle lasso.

3.2 Applications

The previous researches including network lasso and convex clustering are the special cases of the triangle lasso. If we force , the triangle lasso degenerates to the network lasso. If we further force , the triangle lasso degenerates to the convex clustering. Specifically, we take the ridge regression and convex clustering as examples to illustrated triangle lasso in more details.

Ridge regression. In a classic ridge regression task, the loss function is

Here, represents the -th instance in the data matrix, is the response matrix, and is the need-to-learn weight. represents the number of instances, and with is the regularization coefficient to avoid overfitting. Note that is a hyper-parameter introduced by the formulation of the ridge regression, not introduced by triangle lasso. We thus instantiate (1) as:

Here, is the stack of the instances with . is the stack of weights of those instances. That is, the -th row of , namely is the weight of . In this case,

Convex clustering. In a convex clustering task, the loss function is:

Here, is the need-to-learn weights. is the number of edges in the graph. with is used to control the number of clusters. Note that is a parameter in the formulation of convex clustering, which can be varied to control the number of clusters. Different from the case of ridge regression,

is usually increased heuristically in order to obtain a cluster path. Additionally, the

-th row of the optimal is the label of the instance . If two instances and have the identical labels, it means that they belong to a cluster. In this case,

The final formulation of convex clustering is:

Note that triangle lasso generally outperforms the classic convex clustering on recovering the correct clustering membership. We present more explanations from two views.

  • Intuitively, triangle lasso uses the sum-of-norms regularization to obtain the clustering result, which is similar to the convex clustering. On the other hand, triangle lasso considers the neighbouring information of vertices, and uses it in the regularization. Since network science has claimed that the neighbours of vertices is essential to measure its importance in a graph [12], triangle lasso has advantages on finding the similarity among instances over the classic convex clustering.

  • Mathematically, triangle lasso gives large weights to a regularized item (see the equation (3.1)), if they have many common neighbours. That is, such the regularized item is punished more than other items during the optimization procedure, which makes the vertices tend to be similar or even identical. This is different from the convex clustering because that convex clustering views each a regularized item equally, which ignores their neighbouring relationship.

Demo example. To make it more clear, we take the house price prediction as a demo example to explain triangle lasso. This example is one of empirical studies in Section 7. We need to predict the price of houses in the Greater Sacramento area by using a ridge regression model. Our target is to learn the weight for each house. Generally, the houses, which are located to a district, should use similar or identical weights. Those located in different districts should use different weights. As illustrated in Figure 2, triangle lasso will yield a weight for each house, and those weights can be used to as a label to obtain multiple clusters. The houses belonging to a cluster use an identical weight. We can adjust to obtain different number of clusters. With the increase of , more houses are fused to a cluster.

3.3 Symbols and their notations

To make it easy to read, we present the symbols and their notations in Table I

. Since the vector operation is usually easier to be understood and performed than the matrix operation. We tend to use vector operation replacing of the matrix operation in the paper equivalently. In other words, when we need to handle a matrix, we usually use its column stacking vectorization replacing of itself. For example, when we need to obtain the gradient of with respect to a matrix, we usually use

to replace for simplicity. In the paper, a matrix is viewed equivalent to its vectorization. For example, is equivalent to because that we can transform them without any ambiguity. Finally, we use the notation in both supervised and unsupervised learning tasks for math brevity.

Symbols Notations
The vertex set containing vertices
The th vertex
The edge set containing edges
The edge connecting and
The data matrix
The -th instance
The response of
The weight matrix
The weights of the
The neighbours of
The weight corresponding to the edge
The convex conjugate of
The column stacking vectorization
The kronecker product
The element-wise product
The norm of a matrix defaultly
The Frobenius norm of a matrix
The dual norm
The matrix whose elements are
The unit matrix
The regularization coefficient
The th iteration of ADMM
The step length of ADMM
The sub-gradient operator
The dual variable
The th row of
Prox() The proximal operator
TABLE I: The symbols and their notations

4 ADMM method for the moderately accurate solution

In this section, we present our ADMM method to solve triangle lasso. First, we present the details of our ADMM method as a general framework. Second, we present an example to make our method easy to understand. Finally, we discuss the convergence and the stopping criterion of our method.

4.1 Details

Before presentation of our method, we need to re-formulate the unconstrained optimization (1) to be a constrained problem equivalently:

(3)

subject to:

where and . Suppose the Lagrangian dual variable is denoted by with . Its augmented Lagrangian multiplier is:

where is a positive number.

Update of . The basic update of is:

where represents the -th iteration. Suppose . Discarding constant items, we obtain

(4)

Apparently, is strongly convex and smooth. Therefore, the hardness of the update of is dominated by .

Convex case. If is convex, it is easy to know that is convex too. Thus, it is not difficult to obtain by solving the convex optimization (4). For a general convex case, we can update by solving the following equality:

where represents the sub-gradient operator.

Non-convex case. If is non-convex, may not be convex. Thus, the global minimum of (4) is not guaranteed, and we have to obtain a local minimum. Considering that the non-convex optimization may be much more difficult than the convex case, the update of may be time-consuming. Since the Lagrangian dual of (4) is always convex, we update via the dual problem of (4).

Before presentation of the method, we need to transform (4) to be a constrained problem equivalently:

subject to:

Its Lagrangian multiplier is:

where represents the column stacking vectorization of a matrix. Therefore, the Lagrangian dual is:

where is the convex conjugate of , and is the convex conjugate of . Generally, the convex conjugate function is defined as . Thus, the dual problem is:

Since the dual problem is always convex, it is easy to obtain its global minimum . According to the KKT conditions, we obtain by solving:

In some non-convex cases of , we can still obtain the global minimum when there is no duality gap, i.e. strong duality. There are various methods to verify whether there is duality gap. It is out of the scope of the paper, we recommend readers to refer the related books [13].

Update of . is a sum-of-norms regularization, which is convex but not smooth. It is not differentiable when arbitrary two rows of are identical. Unfortunately, we encourage the rows of becomes identical in order to find the similar instances. Therefore, it is non-trivial to obtain the global minimum in the triangle lasso. In the paper, we obtain the closed form of via the proximal operator of a sum-of-norms function.

The basic update of is

Discarding the constant item, we obtain

is the proximal operator of with the efficient which is defined as: . Considering is a sum-of-norms function, its proximal operator has a closed form [14], that is:

Here, the subscript ‘’ represents non-negative value for each element in the matrix. The subscript ‘’ with represents the -th row of a matrix. If some elements are negative, their values will be set to be zeros. Otherwise, the positive value will be reserved.

Update of . is updated by the following rule:

(5)

4.2 Examples

To make our ADMM easy to understand, we take the ridge regression as an example to show the details. As we have shown in Section 3, the optimization objective function is:

We thus obtain which is convex and smooth. Therefore, the update of is to solve the following equalities:

The update of is independent to , and is easy to understand. We do not re-write them again.

4.3 Convergence and stopping criterion

When and are convex, the ADMM method is convergent [15]. Recently, many researches have investigated the convergence of ADMM [16, 17]. But, it is non-trivial to obtain the convergence rate for a general and . In triangle lasso, is convex but not smooth. The convergence rate is impacted by the convexity of and the matrix . Many previous researches have claimed that if is smooth and is row full rank, the ADMM will obtain a linear convergence rate [18].

The basic ADMM has its stopping criterion [15]. But, we can re-define the stopping criterion of ADMM in triangle lasso for some specific tasks to gain a high efficiency. Taking convex clustering as an example, we do not care the specific value of . All we want to obtain is the clustering result. If two rows of are identical, the corresponding instances belong to a cluster. If the clustering result keeps same between two iterations, we can stop the method when is close to the minimum. Finally, our ADMM method is illustrated in Algorithm 1.

1:The data matrix , and a positive . .
2:Initialize , , and .
3:for Stopping criterion is not satisfied do
4:     if  is convex then
5:         Update by solving .      
6:     if  is non-convex then
7:         .
8:         Update by solving .      
9:      with .
10:     .
11:     ;
12:return The final value of .
Algorithm 1 ADMM for the triangle lasso

5 Dual method for the accurate solution

Although our ADMM is efficient to yield a moderately accurate solution, it is necessary to provide an efficient method to obtain the accurate solution in some applications. In the section, we transform (1) to be a second-order cone programming problem, and develop a method to solve it in the dual space. First, we first present the details of our Dual method. Second, we use an example to explain our dual method.

5.1 Details

We first re-formulate (1) to be a constrained optimization problem equivalently.

subject to:

Its Lagrangian multiplier is:

Thus, the dual optimization objective function is:

Here, and represent the -th row of and , respectively. holds, and is its convex conjugate. Thus, we obtain

where denotes the dual norm of . Since the dual norm of the norm is still the norm, its dual problem is:

(6)

subject to:

After that, we can obtain the optimal by solving

(7)

Here, is the minimizer of (6). Since the conjugate function is always convex no matter whether is convex. The dual problem (6) is easier to be solved than the primal problem. If there is no duality gap between (1) and (6), the global minimum of the primal problem (1) can be obtained from the solution of the dual problem (6) according to (7).

Theorem 1.

The conjugate of the sum of the independent convex functions is the sum of their conjugates. Here, ”independent” means that they have different variables [19].

According to Theorem 1, if is separable, that is, , we have . We can obtain the solution of (6) by solving each component with independently. But, when the is not separable, we have to solve (6) as an entire problem. Unfortunately, it may be time-consuming to solve (6) for a large dense graph because that we have to optimize a large number of variables, i.e. . But, we can divide the graph to multiple sub-graphs, and solve (6) for each sub-graph. Repeating those steps for different graph partitions, we can refine the final solution of (6). Finally, the details of our dual method is illustrated in Algorithm 2.

1:The data matrix , the graph , and a positive .
2:Solve (6) for , and obtain .
3:Obtain the optimal by solving (7).
Algorithm 2 Dual method for the triangle lasso

5.2 Example

To make it easy to understand, we take the ridge regression as an example to show the details of the method. As we have shown in Section 3, the optimization objective function is:

We thus obtain

Here, , and . yields a diagional matrix consisting of . Discarding the constant item, we obtain

Substituting with , we obtain the equivalent formulation is:

subject to

After solving this equivalent optimization problem, we obtain the optimal , namely . Finally, the optimal is

6 Complexity analysis

In this section, we analyze the time complexity of the proposed methods, i.e., the ADMM method and the dual method, for the case of convex .

6.1 Time complexity of the ADMM method

Consider the ADMM method. It is time-consuming for the calculation of the gradient rather than the matrix multiplication. The time complexity due to the calculation of the gradient per iteration is . Note that the number of the iterations dominates the total time complexity of the ADMM method. For example, if the number of iterations is , the total time complexity is . Generally, the large number of iterations leads to a relatively accurate solution, which leads to high time complexity. Before presenting the time complexity formally, we introduce some new notations. yielded by ADMM at the -th iteration is defined as

Given a vector , is defined as

where is defined by

Thus, when is convex, the total time complexity of our ADMM method is presented as the following theorem.

Theorem 2.

When our ADMM is convergent satisfying , the total time complexity of our ADMM is .

Proof.

[20] proves that holds when the Douglas-Rachford ADMM is performed for iterations (Theorem in [20]). Our ADMM is its special case when is convex. Thus, given an , to obtain , our ADMM needs to be run for iterations. Since the time complexity per iteration is , and is a constant, the total time complexity is .

6.2 Time complexity of the dual method

Consider the dual method. Before presenting the details of the complexity analysis. Let us present some basic definitions, which are widely used to analyze the performance of an optimization method theoretically [21, 22, 23, 24].

Definition 1 (-smooth).

A function is () smooth, if and only if, for any vecoters and , we have .

Definition 2 (-strongly convex).

A function is () strongly convex, if and only if, for any vectors and , we have .

Definition 3.

If a function is -smooth and -strongly convex, its condition number is defined by .

There are many tasks whose optimization objective function is smooth and strongly convex. Those tasks include convex clustering, ridge regression,

norm regularized logistic regression etc. We recommend to

[22] for more details. When is -smooth and -strongly convex, its convex conjugate function is thus -smooth and -strongly convex (Lemma in [23]). The condition number of is . Additionally, there are various optimization methods to solve the dual problem (6). Since Nesterov optimal method [25] is one of the widely used optimization methods, we use it to solve the dual problem.

Theorem 3.

When the Nesterov optimal method is used to solve the dual problem (6), and obtains for a given positive , then the total time complexity is .

Proof.

When we use Nesterov optimal method to solve the dual problem, the number of iterations is required to be for (Corollary in [26]). Furthermore, the Nesterov optimal method performs a gradient descent per iteration, which leads to time complexity. Thus, the total time complexity is .

7 Empirical studies

In this section, we conduct empirical studies to evaluate triangle lasso on the robustness and efficiency. First, we present the settings of the experiments. Second, we evaluate the robustness and efficiency of triangle lasso by conducting prediction tasks. Third, we evaluate the quality of the cluster path by conducting convex clustering with triangle lasso. After that, we evaluate the efficiency of our methods in various network topologies. Finally, we use triangle lasso to conduct community detection in order to show that triangle lasso is able to perform a general data analysis task.

7.1 Settings

Model and algorithms. As we have shown in the previous section, we conduct empirically studies by conducting ridge regression and convex clustering tasks. The weights of edges, i.e. in is set to be negatively proportional to the distance between the vertices. All the algorithms are implemented by using Matlab 2015b and the solver CVX [27]. The hardware is a server equipped with an i7-4790 CPU and GB memory.

The total compared algorithms are:

  • Network lasso [1]. This is the state-of-the-art method to conduct data analysis and clustering simultaneously. Both network lasso and triangle lasso can be used as a general framework. Thus, we compare the triangle lasso with it in the prediction tasks.

  • AMA [6]. This is the state-of-the-art method to conduct convex clustering. Convex clustering is a special case of the network lasso and triangle lasso. We compare triangle lasso with it in the convex clustering task.

  • Triangle lasso-basic ADMM. This is the basic version of ADMM which is used to solve triangle lasso in the evaluations. We use it as the baseline to compare our algorithms and other state-of-the-art methods.

  • Triangle lasso-ADMM. This is our proposed ADMM method to solve the triangle lasso. Since it is very fast, we use it to conduct each evaluations in default.

  • Triangle lasso-Dual. This is our proposed Dual method to solve the triangle lasso. It is not efficient when the dataset or the graph is large. We use it to conduct evaluations on some moderate graphs. As we have illustrated, triangle lasso is implemented by our ADMM method defaultly.

(a) Best
(b) Robustness: low MSE
(c) Robustness: visualization
(d) Robustness: statistics
Fig. 5: The illustration of the best , and the comparison of the prediction accuracy with the best . Triangle lasso is more robust than the network lasso because of the lower MSE.

Graph construction and metrics. If the dataset is a graph dataset, triangle lasso use the graph directly. In some cases, if the dataset is not a graph, the graph is usually generated by using the following rules in default:

  • If the dataset has missing values, those missing values are filled by the mean values of the corresponding features.

  • Each instance is represented by a vertex.

  • Given any two arbitrary vertices, if one of them is the k-nearest peers () of the other one, there is an edge between them.

Additionally, we evaluate the prediction accuracy by using the Mean Square Error (MSE). The small MSE leads to the highly accurate prediction. Given a dataset with imperfect data, if an algorithm yields smaller MSE than other algorithms, its prediction is thus more accurate than others. Therefore, it is more robust to the imperfect data than others. We record the run time (seconds) to evaluate the efficiency.

7.2 Prediction tasks

Datasets Data size Dimensions Missing values
RET
AOM
wiki4HE
DJI
cpusmall
TABLE II: Statistics of the datasets.

Datasets. The empirical studies are mainly conducted on the following four datasets. The statistics of those datasets are presented in Table II

. It is worth noting that all of them contain many missing values. Those missing values are filled by using zeros in the raw datasets. In all experiments, the values of each feature is standardized to zero mean and unit variance. We use

-fold cross validation to evaluate the robustness of triangle lasso. For each instance in the validation dataset, we find its nearest neighbour from the training dataset. Then, we use the weight of the nearest neighbour to conduct prediction and evaluate the robustness of the solutions.

  • Real estate transactions (RET). The dataset is the real estate transactions over a week period in May 2008 in the Greater Sacramento area 333https://support.spatialkey.com/spatialkey-sample-csv-data. The latitude and longitude features of each house are used to construct the graph. Each house is profiled by using features: number of beds, number of baths and square feet. The response is the sales price. The task is to predict price of a house. of the house sales are missing at least one of the features.

  • AusOpen-men-2013 (AOM). A collection containing the match statistics for men at the Australian Open tennis tournaments of the year 2013444http://archive.ics.uci.edu/ml/datasets/Tennis+Major+Tourname
    nt+Match+Statistics
    . Each instance has features, and the response is Result. The task is to predict the winner for two tennis players. This data matrix contains missing values.

  • wiki4HE. Survey of faculty members from two Spanish universities on teaching uses of Wikipedia555http://archive.ics.uci.edu/ml/datasets/wiki4HE [28]. We pick the first question and its answer from each module, and finally obtain features, namely PU1, PEU1, ENJ1, QU1, VIS1, IM1, SA1, USE1, PF1, JR1, BI1, INC1, and EXP1. The response is USERWIKI. The task is to predict whether a teacher register an account in wikipedia site. The data matrix contains missing values.

  • Dow Jones Index (DJI). This dataset contains weekly data for the Dow Jones Industrial Index 666http://archive.ics.uci.edu/ml/datasets/Dow+Jones+Index. Each instance is profiled by using features: open (price), close (price) and volume. The response is next_week_open (price). The task is to predict the open price in the next week. The raw dataset does not contain imperfect data. We thus randomly pick values in the data matrix, and set them by using zeros.

  • cpusmall. This dataset is a collection of a computer systems activity measures, which is obtained from LIBSVM website 777https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/regre
    ssion.html#cpusmall
    . Each instance is profiled by features. The task is to predict the portion of time (%) that cpus run in user mode. In the experiment, we randomly pick values in the data matrix, and fill those values to be zeros as the imperfect data.

(a) Efficiency
(b) Efficiency
Fig. 6: The comparison of the efficiency. The Dual method is more efficient than the ADMM method in a sparse graph but less efficient in a dense graph.
(a) AOM, Robustness: low MSE
(b) wiki4HE, Robustness: low MSE
(c) DJI, Robustness: low MSE
(d) cpusmall, Robustness: low MSE
Fig. 7: The comparison of the MSE by varying the imperfect data. It shows that triangle lasso is more robust than network lasso because of its low MSE.
Algorithms basic ADMM Network lasso Triangle lasso-ADMM Triangle lasso-Dual
AOM ,
,
,
,
,
,
wiki4HE