# Fast Large-Scale Discrete Optimization Based on Principal Coordinate Descent

Binary optimization, a representative subclass of discrete optimization, plays an important role in mathematical optimization and has various applications in computer vision and machine learning. Usually, binary optimization problems are NP-hard and difficult to solve due to the binary constraints, especially when the number of variables is very large. Existing methods often suffer from high computational costs or large accumulated quantization errors, or are only designed for specific tasks. In this paper, we propose a fast algorithm to find effective approximate solutions for general binary optimization problems. The proposed algorithm iteratively solves minimization problems related to the linear surrogates of loss functions, which leads to the updating of some binary variables most impacting the value of loss functions in each step. Our method supports a wide class of empirical objective functions with/without restrictions on the numbers of 1s and -1s in the binary variables. Furthermore, the theoretical convergence of our algorithm is proven, and the explicit convergence rates are derived, for objective functions with Lipschitz continuous gradients, which are commonly adopted in practice. Extensive experiments on several binary optimization tasks and large-scale datasets demonstrate the superiority of the proposed algorithm over several state-of-the-art methods in terms of both effectiveness and efficiency.

• 8 publications
• 5 publications
• 113 publications
• 31 publications
• 41 publications
• 193 publications
10/22/2020

### Model identification and local linear convergence of coordinate descent

For composite nonsmooth optimization problems, Forward-Backward algorith...
10/03/2019

### Best-first Search Algorithm for Non-convex Sparse Minimization

Non-convex sparse minimization (NSM), or ℓ_0-constrained minimization of...
11/17/2017

### A generic and fast C++ optimization framework

The development of the mlpack C++ machine learning library (http://www.m...
08/27/2019

### New Loss Functions for Fast Maximum Inner Product Search

Quantization based methods are popular for solving large scale maximum i...
08/30/2022

### Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

The Frank-Wolfe method has become increasingly useful in statistical and...
08/30/2010

### Fixed-point and coordinate descent algorithms for regularized kernel methods

In this paper, we study two general classes of optimization algorithms f...
08/28/2019

### Machine-learning techniques for the optimal design of acoustic metamaterials

Recently, an increasing research effort has been dedicated to analyse th...

## 1 Introduction

Binary optimization problems are generally formulated as follows:

 minx f(x),s.t.  x∈{±1}n. (1)

Problem (1) appears naturally in several fields of computer vision and machine learning, including clustering wang2011information , graph bisection wang2017large ; yuan2016binary , image denoising bi2014exact , dense subgraph discovery ames2015guaranteed ; balalau2015finding ; yuan2016binary ; yuan2013truncated , multi-target tracking shi2013multi , and community discovery he2016joint . In many application scenarios, such as binary hashing gui2018fast ; liu2014discrete ; shen2015supervised ; shen2018unsupervised ; wang2018survey ; weiss2009spectral , Problem (1) needs to be solved for millions of binary variables, which makes the size of the feasible set very large (far larger than the number of atoms in the universe). Usually, it is difficult to find the optimal solution. Therefore, providing a fast algorithm to approximately solve Problem (1) is very important in practice.

Furthermore, additional constraints on the numbers of s and s in the binary variables are adopted in many cases. For example, in binary hashing shen2015supervised ; shen2018unsupervised and graph bisection wang2017large ; yuan2016binary , the balance condition is often required, which means that the numbers of s and s are equal to each other. On the other hand, dense subgraph discovery ames2015guaranteed ; balalau2015finding ; yuan2016binary ; yuan2013truncated and information theoretic clustering wang2011information require that the numbers of s and s in are some fixed integers.

To handle these previously mentioned constraints, in this paper, we focus on the following binary optimization problem:

 minx f(x),s.t.  x∈\slΩr, (2)

where

is a binary vector of length

, is a differentiable objective function (which may be nonconvex), is a given integer, and the restriction on is defined as

 Ωr={{±1}n,   ifr=−1;{x∈{±1}n:1⊤x=2r−n},   % ifr∈N≥0, (3)

where denote the set of nonnegative integers. When , Problem (2) is a binary optimization problem without further constraints. When , Problem (2) becomes an optimization problem with the restriction that there are exactly s in the binary vector . For instance, when , the constraint implies that the number of s is equal to the number of s in .

In general, Problem (2) is NP-hard due to the binary constraints johnson1979computers . Many algorithms, such as continuous relaxation, equivalent optimization, signed gradient optimization and direct discrete optimization, have been proposed to solve it approximately (for details, please refer to Section 2). However, they usually suffer from high computational costs or large accumulated quantization errors, or are only designed for specific tasks.

To overcome these difficulties, in this paper, we propose a novel and fast optimization algorithm, termed Discrete Principal Coordinate Descent (DPCD), to approximately solve Problem (2). The time complexity for the binary optimization problem is relatively high when directly applying signed gradient methods (updating all variables at each time based on the gradients). On the contrary, the proposed DPCD focuses on the principal coordinates most impacting the value of the loss function. At each iteration, DPCD can adaptively decide the number of principal coordinates that need to be optimized, which can be regarded as analogous to the adaptive learning rate in the normal gradient descent methods. Different from other binary optimization algorithms in the literature, our DPCD method supports a large family of empirical objective functions with/without restrictions on the numbers of s and s in the binary variables. Furthermore, we prove theoretical convergence of DPCD for loss functions with Lipschitz continuous gradients, which cover almost every loss function in practice. Explicit convergence rates are also derived. Extensive experiments on two binary optimization tasks: dense subgraph discovery and binary hashing, show the superiority of our method over state-of-the-art methods in terms of both efficiency and effectiveness.

## 2 Related work

A very rich literature and a wide range of promising methods exist in binary optimization. We briefly review three classes of representative and related methods.

#### Continuous relaxation methods.

An intuitive method to approximately solve Problem (1

) is to relax the binary constraints to continuous variables, then threshold the continuous solutions to binary vectors. For instance, in the Linear Programming (LP) relaxation

hsieh2015pu ; komodakis2007approximate , the binary constraint is substituted with the box constraint, i.e., , which can be approximately solved by continuous optimization methods such as the interior-point method mehrotra1992implementation . On the other hand, the Semi-Definite Programming (SDP) relaxation wang2017large replaces the binary constraint with some positive semi-definite matrix constraint. In Spectral relaxation lin2013general ; olsson2007solving , the binary constraint is relaxed to some -ball, which is non-convex. One of the advantages of such continuous relaxation methods is that the relaxed problems can be approximately solved efficiently by existing continuous optimization solvers. However, the relaxation is usually too loose, and the thresholding often yields large quantization errors.

#### Equivalent optimization methods.

Unlike relaxation methods, equivalent optimization methods replace the binary constraint with some equivalent forms, which are much easier to handle. For example, motivated by linear and spectral relaxations, Wu and Ghanem wu2018lp replaced the binary constraint with the intersection of the box and the sphere , and then applied the Alternating Direction Method of Multipliers (ADMM) boyd2011distributed ; li2015global ; wang2015global to solve the optimization problem iteratively. Other methods in this direction include the MPEC-ADM and MPEC-EPM methods (yuan2016binary ; yuan2017exact ), the norm reformulation lu2013sparse ; yuan2016sparsity , the box non-separable reformulation murray2010algorithm , and the piecewise separable reformulation zhang2007binary . Usually, these equivalent optimization methods guarantee the convergence to some stationary and feasible points, but the convergence speed is often too slow, resulting in high computational costs for large-scale optimization problems.

In the Signed Gradient Method (SGM) liu2014discrete , a linear surrogate of the objective function is given at each iteration. Then, the minimization (actually a maximization problem was studied in the original paper liu2014discrete , we state an equivalent form here) of this surrogate function gives the updating rule for Problem (1) as: . The sequence obtained by this updating rule is guaranteed to converge if the objective function is concave. However, even for a very simple non-concave function, SGM may generate a divergent sequence and never converge (please refer to Lemma 1). Furthermore, SGM cannot handle binary problems with restriction on the number of s since the number of s may change during each iteration. A stochastic version of this method was given in Adaptive Discrete Minimization (ADM) liu2017discretely , in which an adaptive ratio was selected at each iteration, then some random entries of were updated by . Although ADM works well for certain loss functions, it fails when the value of the loss function depends largely on only a few variables, since the random selecting procedure may skip such important variables.

#### Discrete optimization methods.

In the field of image hashing, many direct discrete optimization methods, such as DCC shen2015supervised , SADH shen2018unsupervised , ARE hu2018hashing , ITQ gong2013iterative and FastHash lin2014fast , have been proposed, which aim to optimize binary variables directly (SGM can also be seen as a discrete optimization method). For example, the Coordinate Descent (CD) method wright2015coordinate is widely used for solving optimization problems with smooth and convex constraints. Motivated by this method, several discrete cyclic coordinate descent (DCC) methods (e.g., RDCM luo2018robust , FSDH gui2018fast , and SDH shen2015supervised ) have been proposed to handle the binary constraint directly. The main idea is that, at each iteration, a subproblem with most entries of the binary variables fixed is considered, and the loss function is minimized with respect to the remaining entries. Although such methods can work well for specific loss functions, most of them are usually difficult to be extended to handle general binary optimization problems. Furthermore, they often suffer from expensive computational costs.

## 3 Proposed algorithm

In this section, we present in detail the DPCD algorithm for solving Problem (2), which is a general binary optimization problem with/without restrictions on the numbers of s and s. We also provide a theoretical convergence analysis of the proposed method.

### 3.1 Notation and preliminaries

We first introduce some notation and preliminaries. A vector is represented by some lowercase bold character, while a matrix is represented by some uppercase bold character. Let and denote the -th and -th entries of a vector and a matrix , respectively. The transpose of a matrix is represented by . We use to denote the Euclidean inner product. Let and be the Frobenius norm and -norm of a matrix , respectively. The gradient of a differentiable function is denoted by . Let denote the element-wise sign function where for and otherwise. The Hamming distance between two binary vectors and of equal length is defined by , which is the number of positions at which the corresponding entries are different. For a set , let denote the number of elements in .

### 3.2 Main algorithm

The proposed DPCD algorithm runs iteratively between principal coordinate update and neighborhood search.

#### Principal coordinate update.

The basic idea is that, in the -th iteration, we change the sign of some adaptively chosen entries of the binary vector , such that the value of the loss function should decrease steeply after each change. To achieve this goal, we focus on -principal coordinates (see the following definition), which have major influences on the value change of the loss function.

###### Definition 1.

Let be a differentiable function and be a positive constant. A coordinate index is called an -principal coordinate of if the product  .

One motivation of our method is SGM liu2014discrete . In the -th iteration of SGM, the linear surrogate of the objective function is given as:

 (4)

Then is obtained by minimizing this surrogate function:

 xk+1=argminx∈{±1}n^fk(x)=−sgn(∇f(xk)). (5)

The sequence obtained by Eq. (5) is guaranteed to converge if is concave. However, from Lemma 1 in Subsection 3.3 we know, for non-concave functions, SGM liu2014discrete may generate divergent sequences and never converge, since changing too many entries of at a time may increase the value of the loss function. To overcome this difficulty, the proposed DPCD method only changes the signs of entries whose absolute values of directional derivatives are large (entries with principal coordinates). This yields convergence for a wide class of loss functions (please refer to Lemma 1 and Theorem 1). Also, in the proposed algorithm, the constraint is always satisfied after each iteration.

To be more specific, given at the -th iteration, we first calculate the gradient for the differentiable loss function . Next, we derive some proper thresholds based on . When is -Lipschitz continuous on , where is easy to calculate, we simply set

 L1=L2=L0+ϵ (6)

for some sufficiently small positive constant , i.e., we consider -principal coordinates. For example, in Lemma 1, it is easy to see that , then we take for some small in the algorithm. When does not exist or is difficult to compute, we let and be the averages of the absolute values of the positive and negative entries in the gradient , respectively, i.e.,

 L1=1n1∑∇if(xk)>0∇if(xk)andL2=−1n2∑∇if(xk)<0∇if(xk), (7)

where and are the numbers of positive and negative entries in , respectively. With the given thresholds and , we set and to be the sets of -principal coordinates with positive partial derivatives and -principal coordinate with negative partial derivatives, respectively, where and are some parameters in which will be learned depending on the tasks and datasets. If the restriction condition is , we update by solving in Eq. (5) with respect to (other entries of are fixed) and derive:

 xk+1i=−sgn(∇if(xk))=−xki. (8)

In other words, we change the sign of for with -principal coordinates, and for with -principal coordinates. If the number of s in is required to be fixed (i.e., the restriction condition is for some ), we update Eq. (8) with respect to the largest absolute values in and , respectively, where (such procedure guarantees that implies ). When the complexity of gradient calculations is low, the updating is very fast. A complexity analysis of the proposed DPCD is given in Subsection 4.2, which shows that the algorithm complexity of DPCD for supervised discrete hashing is linear and thus the algorithm runs very fast for large-scale datasets.

#### Neighborhood search.

We add an optional heuristic neighborhood search after the principal coordinate update to avoid saddle points. In practice, we run one neighborhood search after

principal coordinate updates, where is between 10 to 20. First, we define the concept of -neighbors for a point where is some small positive integers. When , the set of -neighbors for is denoted by , which is the set of points with a Hamming distance at most from . When , the set of -neighbors for is denoted by , which is the set of points obtained by interchanging at most pairs of entries and in . For instance, when , we have and .

In a neighbor search for some (or with ), the aim is to find the point (or ) with the minimal function value (or equivalently, ). In practice, we can sample from (or ) instead of iterating over all points, if is large or calculating is slow. This neighborhood search step is helpful for finding a local minimum point. The proposed algorithm is summarized in Algorithm 1.

### 3.3 Convergence comparison: DPCD vs. SGM

One of the differences between our DPCD and SGM liu2014discrete is the choice of that should be updated by Eq. (8) at the -th iteration. SGM updates Eq. (8) for each , while the proposed DPCD only changes the sign of when the coordinate indexes is -principal for some . This difference is crucial to the convergences of the algorithms. More specifically, SGM can only guarantee convergence for the minimization of concave functions, while DPCD converges in finite steps for any functions with Lipschitz continuous gradients. We show the superiority of the proposed DPCD by the following example. The case is presented in Figure 1.

###### Lemma 1.

Consider the problem

 minx∈{±1}nf(x):=min(x1,x2,⋯,xn)∈{±1}n 12n∑i=1(xi+βi)2,

where . Let . It holds that: (1) SGM generates a divergent sequence for any initial point; (2) With parameters and , the proposed DPCD method always converges to the optimal solution.

###### Proof.

First, we apply the method SGM to this problem. The gradient of is straightforward: . Since , we have for any . Then, by the updating rule fo SGM, . Therefore, starting from any point , SGM always generates a divergent sequence .

On the other hand, since for any , is a function with -Lipschitz continuous gradient. Therefore, in DPCD, we set . Then by definition,

 Sk+ ={1≤i≤n:xki+βi>1+ϵ,xki=1} ={1≤i≤n:xki=1},

and

 Sk− ={1≤i≤n:∇if(xk)<−1−ϵ,xki=−1} ={1≤i≤n:xki+βi<−1−ϵ,xki=−1} =∅.

This implies that

 xk+1i={−sgn(∇if(xk))=−xki=−1,ifi∈Sk+=Sk+∪Sk−xki=−1,ifi∉Sk+. (9)

Then, we have for any and , and thus for . It is easy to check that is indeed the optimal point for the problem in Lemma . Therefore, the proposed DPCD converges to the optimal solution in only one updating step for any initial point. ∎

### 3.4 Theoretical convergence results

For simplicity, we ignore the neighborhood search part in the convergence analysis. Actually, the neighborhood search does not have any influence on the convergence since it always generates some binary vectors with non-increasing values of the loss function. Thus, we can only focus on the principal coordinate update part of the proposed DPCD method.

We derive the following convergence results. When we say the algorithm converges in steps, we mean that the binary vector obtained in -th iteration equals in -th iteration (then the algorithm can stop here). Moreover, it is easy to see that our algorithm can converge to some local optimum with the help of neighborhood search (without neighborhood search, it only converges to some fixed binary vectors).

###### Theorem 1.

Let be a differentiable function such that is -Lipschitz continuous on . Setting the thresholds where , and ignoring the neighborhood search, Algorithm 1 always converges in steps at most, where and .

###### Proof.

Let . Since is -Lipschitz, we have for any ,

 ∥∇f(x)−∇f(y)∥≤L0∥x−y∥.

Then, by Cauchy-Schwarz inequality,

 (∇f(x)−∇f(y)⊤(x−y))≤∥∇f(x)−∇f(y)∥⋅∥x−y∥≤L0∥x−y∥2.

By the gradient monotonicity equivalence of convexity (borwein2010convex, , Page 40), this yields that is a convex function on . Therefore, due to the first-order equivalence of convexity (boyd2004convex, , Page 69),

 L02∥y∥2−f(y)≥L02∥x∥2−f(x)+(L0x−∇f(x))⊤(y−x),

which implies that

Let and in the above inequality, where is determined by and (8) in the paper. We have

 f(xk+1)≤f(xk)+∇f(xk)⊤(xk+1−xk)+L02∥xk+1−xk∥2. (10)

If , we obtain for each due to the updating rule.

If for some , by the updating rule we have, and . Then,

 ∇if(xk)(xk+1i−xki)=−2∇if(xk)sgn(∇if(xk))=−2∥∇if(xk)∥<−2L

and Therefore,

 ∇if(xk)⊤(xk+1i−xki)+L02∥xk+1i−xki∥2<−2(L−L0). (11)

If , we know is not empty. Then, by Eq. (10) and Eq. (11) we have,

 f(xk+1)−f(xk)≤ ∇f(xk)⊤(xk+1−xk)+L02∥xk+1−xk∥2 = n∑j=1(∇jf(xk)(xk+1j−xkj)+L02∥xk+1j−xkj∥2) = ∑j:xk+1j=xkj(∇jf(xk)(xk+1j−xkj)+L02∥xk+1j−xkj∥2) +∑j:xk+1j≠xkj(∇jf(xk)(xk+1j−xkj)+L02∥xk+1j−xkj∥2) = 0+∑j:xk+1j≠xkj(∇jf(xk)⊤(xk+1j−xkj)+L02∥xk+1j−xkj∥2) < −2(L−L0) = −2ϵ. (12)

This means that, in each updating step, the function value decreases at least , which completes the proof since the feasible set of Problem (2) in the main paper is finite. ∎

The above theorem is very versatile since most loss functions in practice are -Lipschitz continuous on for some . Furthermore, since is finite, is a finite set, thus the above and always exist and are finite. When is quadratic, we obtain the following direct corollary.

###### Corollary 1.

Let be some quadratic function where , , and , in Theorem 1. Then Algorithm 1 always converges in steps at most.

###### Proof.

Since , we have . Then . By Theorem we complete the proof. ∎

## 4 Experiments

In this section, we compare the proposed DPCD algorithm with several state-of-the-art methods on two binary optimization tasks: dense subgraph discovery and binary hashing. All codes are implemented in MATLAB using a workstation with an Intel 8-core 2.6GHz CPU and 32GB RAM.

### 4.1 Dense subgraph discovery

#### Optimization problem.

Dense subgraph discovery ravi1994heuristic ; rozenshtein2018finding ; yuan2016binary ; yuan2013truncated has many applications in graph mining, such as real-time story identification angel2012dense , finding correlated genes zhang2005general and graph visualization alvarez2006large . Let be a given undirected weighted graph with nodes, and be a given positive integer such that . The aim is to find the maximum density subgraph (the subgraph with the maximal sum of edge weights) with cardinality . Let be the symmetric adjacency matrix of the graph , where denotes the weight of the edge between vertices and . Then the problem can be formulated as the following optimization problem

 maxx∈{0,1}nx⊤Wx,s.t.  x⊤1=k. (13)

Note that, in this case, the variables are or instead of or . In order to translate the problem to the form in Problem (2), the substitution is adopted. Therefore, Problem (13) is equivalent to:

 miny∈{−1,1}n−y⊤Wy−2y⊤W1−1⊤W1,   s.t.  y⊤1=2k−n. (14)

In this way, the above problem can be approximately solved using our Algorithm 1.

#### Graph datasets.

The experiments for dense subgraph discovery are conducted on four large-scale graphs uk-2007-05, dblp-2010, eswiki-2013, and hollywood-2009 from The Laboratory for Web Algorithmics. Table 1 gives a brief description of each graph. For example, hollywood-2009 contains roughly 1.13 million nodes and 113 million edges.

#### DPCD vs. state-of-the-art methods.

The proposed DPCD is compared with the methods LP hsieh2015pu , RAVI ravi1994heuristic , L2box-ADMM wu2018lp , MPEC-EPM and MPEC-ADM yuan2016binary , using the same objective function (14). The cardinality is in the set . For the DPCD method, we run one -neighborhood search ( in the neighborhood search setting) after principal coordinate updates, and set the maximum iteration number for the principal coordinate update part to . and are updated by Eq. (7). We tune the parameters and from by cross-validation according to datasets. For other methods, we adopt the implementations and parameters suggested by the authors. The experimental results in Figure 2 are reported in terms of , which is the density of the subgraph with vertices. We can see that DPCD finds a denser subgraph than all compared methods in each case. Among the other methods, MPEC-EPM consistently outperforms LP in all the experiments. RAVI generally leads to solutions with low density. Furthermore, we provide CPU time comparisons for the six methods on the four graphs. From Table 2 we can see that, the proposed DPCD achieves the fastest runtime on all graphs. This is due to the fast updates in Algorithm 1 (the main complexity is to calculate the gradient of the loss function, which can be done quickly). Also, the efficiency of our method becomes more obvious as the graph size increases. LP and RAVI are faster than the other three methods, since MPEC-EPM needs to run the LP procedure multiple times, and L2-box ADMM and MPEC-ADM usually need more iterations to converge.

#### With and without the neighborhood search.

We conduct the comparison of the proposed method with its variant without neighborhood search technique, on the task of dense subgraph discovery with cardinality and in terms of loss functions and run time. DPCD-0 refers to our algorithm without neighborhood search. From Table 3 we can see that DPCD-0 is slightly faster, while DPCD usually achieves better results:

### 4.2 Binary hashing

Binary hashing aims to encode high-dimensional data points, such as images and videos, into compact binary hash codes such that the similarities between the original data points and hash codes are preserved. This can be used to provide a constant or sub-linear search time and reduce the storage cost dramatically for such data points. The efficiency and effectiveness of binary hashing make it a popular technique in machine learning, information retrieval and computer vision

gui2018fast ; hu2018hashing ; liu2014discrete ; shen2018unsupervised ; wang2018survey . In a typical binary hashing task, denotes a matrix of the original data points, where is the -th sample point, is the number of samples, and is the dimension of each sample. In the supervised setting, we let be the label matrix, i.e., if belongs to the -th class and otherwise. The aim is to map to some , i.e., map each to a binary code for some small integer  and preserve some similarities between the original data points in and hash codes in .

#### Image datasets.

Three large-scale image datasets, CIFAR-10

, and NUS-WIDE, are used in the binary hashing experiments. CIFAR-10 has 60k images, which are divided into 10 classes with 6k images each. We use a 384-dimensional GIST feature vector oliva2001modeling to represent each image. 59k images are selected as the training set and the test set contains the remaining 1k images. A subset of the ImageNet, ILSVRC 2012, contains about 1.2 million images with 1k categories. As in hu2018hashing ; shen2015supervised

, we use 4096-dimensional deep feature vectors for each image, take 127K training images from the 100 largest classes, and 50K images from the validation set as the test set. NUS-WIDE contains about 270K images with 81 labels. The images may have multiple labels. The 500-dimensional Bag-of-Words features are used here

nus-wide-civr09 . We adopt the 21 most frequent labels with the corresponding 193K images. For each label, 100 images are randomly selected as the test set and the remaining as the training set.

#### DPCD vs. general binary optimization methods.

To illustrate the efficiency and effectiveness of our algorithm, we compare DPCD with several general state-of-the-art binary optimization methods DCC shen2015supervised , SGM liu2014discrete , LP hsieh2015pu , L2box-ADMM wu2018lp , and MPEC-EPM yuan2016binary , on the dataset CIFAR-10. Various loss functions are designed for binary hashing (for examples, see Tables 5 and 6). For fair comparison, we adopt the widely used supervised objective function gui2018fast ; shen2015supervised

 f(B,W)=12∥Y−BW∥22+δ2∥W∥22 (15)

for each method. Thus the optimization problem becomes:

 minB,W 12∥Y−BW∥22+δ2∥W∥22      s.t.   B∈{±1}n×r,W∈Rr×c, (16)

where is a regularization parameter, and is the projection matrix (see shen2015supervised ) which will be learned jointly with . The whole optimization runs iteratively over and . When is fixed, we apply DPCD algorithm to . The key step is to calculate the gradient of as:

 ∇Bf(B,W)=(BW−Y)W⊤. (17)

Then can be obtained by Eq. (7). After deriving and , we update by Eq. (8). When is fixed, can be updated by

 W=argminW∗∈Rr×cf(B,W∗)=(B⊤B+βIr)−1B⊤Y. (18)

Finally, we adopt the linear hash function to encode onto binary codes, where can be derived by:

 P=argminP∗∈Rd×r∥XP∗−B∥2=(X⊤X)−1X⊤B. (19)

During the test phase, for a query item, first we use the above linear hash function to derive the hash code, then adopt nearest neighbor search under Hamming distance to find its similar items. For the DPCD method, we run one -neighborhood search after principal coordinate updates, and tune the parameters and from by cross-validation according to datasets and binary code lengths, and set the maximum iteration number for to each time when is fixed. We run at most five iterations for updating and iteratively. For other methods, we adopt the implementations and parameters suggested by the authors. Ground truths are defined by the label information from the datasets. The experimental results are reported in terms of mean average precision (MAP), Precision@500 (Precision@500 refers to the ratio of the number of retrieved true positive items among 500 nearest neighbors to 500.) and training time efficiency (we ignore the comparison of test time here since the test parts are similar for each algorithm). The experiments are conducted on the CIFAR-10 dataset. From Table 4 we can see that the proposed DPCD outperforms all other methods in MAP and Precision@500. For instance, on CIFAR-10 with 96 bits, DPCD outperforms DCC by 8.1, and MPEC-EPM by 8.5 in terms of MAP. It is clear that increasing the number of bits yields better performance for all methods. Also, the training time for the proposed DPCD method is always faster than other compared methods. For example, DPCD runs 20 times faster than MPEC-EPM on bits, which verifies that one major advantage of our method is the fast optimization process.

#### Algorithm complexity analysis.

Now we discuss the complexity of the above supervised DPCD algorithm. The calculation of the gradient is obtained by Eq. (17), thus the complexity is . Then and are derived by Eq. (7), which has complexity since there are additions in Eq. (7). Furthermore, and can be determined by running through all where , which also has complexity . Similarly, the update for  has complexity . Let be the maximum iteration number during the updating step (the number of principal coordinate update parts of DPCD). Then, the total time complexity of updating is . When is given, the complexity for updating in Eq. (18) is . The complexity for calculating the matrix in Eq. (19) is . Suppose that there are at most iterations for updating and iteratively. Since , the total complexity for supervised DPCD is , which is a linear function of .

#### DPCD vs. specific binary hashing methods.

The proposed method is not only fast, but also very versatile, and can thus be used to handle many different loss functions. To demonstrate this, we apply DPCD to several widely used loss functions, and compare the results with the corresponding hashing methods that were designed for each specific loss function. The optimization process is similar to the supervised case. Table 5 illustrates a comparison of DPCD with three unsupervised methods, SADH-L shen2018unsupervised , ARE hu2018hashing , and ITQ gong2013iterative , on the ImageNet dataset, while Table 6 shows a comparison with supervised hashing methods, SDH shen2015supervised , FSDH gui2018fast , and FastHash lin2014fast , on NUS-WIDE, using their specific loss functions. The proposed DPCD shows increased performance over the original methods and achieves higher MAP and Precision@500 in most cases, especially for the supervised loss functions. In terms of the training time, DPCD outperforms all methods except FSDH. The proposed method can significantly decrease the training time for unsupervised loss functions due to the fast updating of the binary codes

. Finally, we conclude that DPCD is a fast and effective optimization method for large-scale image retrieval tasks.

## 5 Conclusion and future work

This paper presents a novel fast optimization method, called Discrete Principal Coordinate Descent (DPCD), to approximately solve binary optimization problems with/without restrictions on the numbers of s and s in the variables. We derive several theoretical results on the convergence of the proposed algorithm. Experiments on dense subgraph discovery and binary hashing demonstrate that our method generally outperforms state-of-the-art methods in terms of both solution quality and optimization efficiency.

In the future, we plan to extend our algorithm to a more general framework. Our methods can be seen as a discrete version of the normal gradient descent methods. Since the gradient descent has several useful variants such as momentum, Adam and Adagrad methods qian1999momentum ; ruder2016overview , it would be possible to combine our methods with these gradient-based methods and propose some discrete versions of them. We would like to explore more on this direction in the future.

## References

• [1] J Ignacio Alvarez-Hamelin, Luca Dall’Asta, Alain Barrat, and Alessandro Vespignani. Large scale networks fingerprinting and visualization using the k-core decomposition. In Advances in Neural Information Processing Systems (NIPS), pages 41–50, 2006.
• [2] Brendan PW Ames. Guaranteed recovery of planted cliques and dense subgraphs by convex relaxation. Journal of Optimization Theory and Applications, 167(2):653–675, 2015.
• [3] Albert Angel, Nikos Sarkas, Nick Koudas, and Divesh Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. Proceedings of the VLDB Endowment, 5(6):574–585, 2012.
• [4] Oana Denisa Balalau, Francesco Bonchi, TH Chan, Francesco Gullo, and Mauro Sozio. Finding subgraphs with maximum total density and limited overlap. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 379–388. ACM, 2015.
• [5] Shujun Bi, Xiaolan Liu, and Shaohua Pan. Exact penalty decomposition method for zero-norm minimization based on mpec formulation. SIAM Journal on Scientific Computing, 36(4):A1451–A1477, 2014.
• [6] Jonathan Borwein and Adrian S Lewis. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.
• [7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
• [8] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
• [9] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), Santorini, Greece., July 8-10, 2009.
• [10] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
• [11] Jie Gui, Tongliang Liu, Zhenan Sun, Dacheng Tao, and Tieniu Tan. Fast supervised discrete hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2):490–496, 2018.
• [12] Lifang He, Chun-Ta Lu, Jiaqi Ma, Jianping Cao, Linlin Shen, and Philip S Yu. Joint community and structural hole spanner detection via harmonic modularity. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 875–884. ACM, 2016.
• [13] Cho-Jui Hsieh, Nagarajan Natarajan, and Inderjit S Dhillon. Pu learning for matrix completion. In ICML, pages 2445–2453, 2015.
• [14] Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, and Heng Tao Shen. Hashing with angular reconstructive embeddings. IEEE Transactions on Image Processing, 27(2):545–555, 2018.
• [15] David S Johnson and Michael R Garey. Computers and intractability: A guide to the theory of NP-completeness, volume 1. WH Freeman San Francisco, 1979.
• [16] Nikos Komodakis and Georgios Tziritas. Approximate labeling via graph cuts based on linear programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1436–1453, 2007.
• [17] Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.
• [18] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van den Hengel, and David Suter.

Fast supervised hashing with decision trees for high-dimensional data.

In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pages 1963–1970, 2014.
• [19] Guosheng Lin, Chunhua Shen, David Suter, and Anton Van Den Hengel. A general two-step approach to learning-based hashing. In Proceedings of the IEEE international conference on computer vision, pages 2552–2559, 2013.
• [20] Li Liu, Ling Shao, Fumin Shen, and Mengyang Yu. Discretely coding semantic rank orders for supervised image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1425–1434, 2017.
• [21] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang. Discrete graph hashing. In Advances in Neural Information Processing Systems (NIPS), pages 3419–3427, 2014.
• [22] Zhaosong Lu and Yong Zhang. Sparse approximation via penalty decomposition methods. SIAM Journal on Optimization, 23(4):2448–2478, 2013.
• [23] Yadan Luo, Yang Yang, Fumin Shen, Zi Huang, Pan Zhou, and Heng Tao Shen. Robust discrete code modeling for supervised hashing. Pattern Recognition, 75:128–135, 2018.
• [24] Sanjay Mehrotra. On the implementation of a primal-dual interior point method. SIAM Journal on Optimization, 2(4):575–601, 1992.
• [25] Walter Murray and Kien-Ming Ng. An algorithm for nonlinear optimization problems with binary variables. Computational Optimization and Applications, 47(2):257–288, 2010.
• [26] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
• [27] Carl Olsson, Anders P Eriksson, and Fredrik Kahl. Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
• [28] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
• [29] Sekharipuram S Ravi, Daniel J Rosenkrantz, and Giri Kumar Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42(2):299–310, 1994.
• [30] Polina Rozenshtein, Francesco Bonchi, Aristides Gionis, Mauro Sozio, and Nikolaj Tatti. Finding events in temporal networks: Segmentation meets densest-subgraph discovery. In 2018 IEEE International Conference on Data Mining (ICDM), pages 397–406. IEEE, 2018.
• [31] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
• [32] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 37–45, 2015.
• [33] Fumin Shen, Yan Xu, Li Liu, Yang Yang, Zi Huang, and Heng Tao Shen. Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
• [34] Xinchu Shi, Haibin Ling, Junling Xing, and Weiming Hu.

Multi-target tracking by rank-1 tensor approximation.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2394, 2013.
• [35] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):769–790, 2018.
• [36] Meihong Wang and Fei Sha. Information theoretical clustering via semidefinite programming. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 761–769, 2011.
• [37] Peng Wang, Chunhua Shen, Anton van den Hengel, and Philip HS Torr. Large-scale binary quadratic optimization using semidefinite relaxation and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(3):470–485, 2017.
• [38] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, pages 1–35, 2015.
• [39] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in Neural Information Processing Systems (NIPS), pages 1753–1760, 2009.
• [40] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
• [41] Baoyuan Wu and Bernard Ghanem. lp-box admm: A versatile framework for integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
• [42] Ganzhao Yuan and Bernard Ghanem. Binary optimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04425, 2016.
• [43] Ganzhao Yuan and Bernard Ghanem. Sparsity constrained minimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04430, 2016.
• [44] Ganzhao Yuan and Bernard Ghanem. An exact penalty method for binary optimization based on mpec formulation. In AAAI, pages 2867–2875, 2017.
• [45] Xiao-Tong Yuan and Tong Zhang.

Truncated power method for sparse eigenvalue problems.

Journal of Machine Learning Research, 14(Apr):899–925, 2013.
• [46] Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology, 4(1), 2005.
• [47] Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factorization with applications. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 391–400. IEEE, 2007.