1 Introduction
Binary optimization problems are generally formulated as follows:
(1) 
Problem (1) appears naturally in several fields of computer vision and machine learning, including clustering wang2011information , graph bisection wang2017large ; yuan2016binary , image denoising bi2014exact , dense subgraph discovery ames2015guaranteed ; balalau2015finding ; yuan2016binary ; yuan2013truncated , multitarget tracking shi2013multi , and community discovery he2016joint . In many application scenarios, such as binary hashing gui2018fast ; liu2014discrete ; shen2015supervised ; shen2018unsupervised ; wang2018survey ; weiss2009spectral , Problem (1) needs to be solved for millions of binary variables, which makes the size of the feasible set very large (far larger than the number of atoms in the universe). Usually, it is difficult to find the optimal solution. Therefore, providing a fast algorithm to approximately solve Problem (1) is very important in practice.
Furthermore, additional constraints on the numbers of s and s in the binary variables are adopted in many cases. For example, in binary hashing shen2015supervised ; shen2018unsupervised and graph bisection wang2017large ; yuan2016binary , the balance condition is often required, which means that the numbers of s and s are equal to each other. On the other hand, dense subgraph discovery ames2015guaranteed ; balalau2015finding ; yuan2016binary ; yuan2013truncated and information theoretic clustering wang2011information require that the numbers of s and s in are some fixed integers.
To handle these previously mentioned constraints, in this paper, we focus on the following binary optimization problem:
(2) 
where
is a binary vector of length
, is a differentiable objective function (which may be nonconvex), is a given integer, and the restriction on is defined as(3) 
where denote the set of nonnegative integers. When , Problem (2) is a binary optimization problem without further constraints. When , Problem (2) becomes an optimization problem with the restriction that there are exactly s in the binary vector . For instance, when , the constraint implies that the number of s is equal to the number of s in .
In general, Problem (2) is NPhard due to the binary constraints johnson1979computers . Many algorithms, such as continuous relaxation, equivalent optimization, signed gradient optimization and direct discrete optimization, have been proposed to solve it approximately (for details, please refer to Section 2). However, they usually suffer from high computational costs or large accumulated quantization errors, or are only designed for specific tasks.
To overcome these difficulties, in this paper, we propose a novel and fast optimization algorithm, termed Discrete Principal Coordinate Descent (DPCD), to approximately solve Problem (2). The time complexity for the binary optimization problem is relatively high when directly applying signed gradient methods (updating all variables at each time based on the gradients). On the contrary, the proposed DPCD focuses on the principal coordinates most impacting the value of the loss function. At each iteration, DPCD can adaptively decide the number of principal coordinates that need to be optimized, which can be regarded as analogous to the adaptive learning rate in the normal gradient descent methods. Different from other binary optimization algorithms in the literature, our DPCD method supports a large family of empirical objective functions with/without restrictions on the numbers of s and s in the binary variables. Furthermore, we prove theoretical convergence of DPCD for loss functions with Lipschitz continuous gradients, which cover almost every loss function in practice. Explicit convergence rates are also derived. Extensive experiments on two binary optimization tasks: dense subgraph discovery and binary hashing, show the superiority of our method over stateoftheart methods in terms of both efficiency and effectiveness.
2 Related work
A very rich literature and a wide range of promising methods exist in binary optimization. We briefly review three classes of representative and related methods.
Continuous relaxation methods.
An intuitive method to approximately solve Problem (1
) is to relax the binary constraints to continuous variables, then threshold the continuous solutions to binary vectors. For instance, in the Linear Programming (LP) relaxation
hsieh2015pu ; komodakis2007approximate , the binary constraint is substituted with the box constraint, i.e., , which can be approximately solved by continuous optimization methods such as the interiorpoint method mehrotra1992implementation . On the other hand, the SemiDefinite Programming (SDP) relaxation wang2017large replaces the binary constraint with some positive semidefinite matrix constraint. In Spectral relaxation lin2013general ; olsson2007solving , the binary constraint is relaxed to some ball, which is nonconvex. One of the advantages of such continuous relaxation methods is that the relaxed problems can be approximately solved efficiently by existing continuous optimization solvers. However, the relaxation is usually too loose, and the thresholding often yields large quantization errors.Equivalent optimization methods.
Unlike relaxation methods, equivalent optimization methods replace the binary constraint with some equivalent forms, which are much easier to handle. For example, motivated by linear and spectral relaxations, Wu and Ghanem wu2018lp replaced the binary constraint with the intersection of the box and the sphere , and then applied the Alternating Direction Method of Multipliers (ADMM) boyd2011distributed ; li2015global ; wang2015global to solve the optimization problem iteratively. Other methods in this direction include the MPECADM and MPECEPM methods (yuan2016binary ; yuan2017exact ), the norm reformulation lu2013sparse ; yuan2016sparsity , the box nonseparable reformulation murray2010algorithm , and the piecewise separable reformulation zhang2007binary . Usually, these equivalent optimization methods guarantee the convergence to some stationary and feasible points, but the convergence speed is often too slow, resulting in high computational costs for largescale optimization problems.
Signed gradient methods.
In the Signed Gradient Method (SGM) liu2014discrete , a linear surrogate of the objective function is given at each iteration. Then, the minimization (actually a maximization problem was studied in the original paper liu2014discrete , we state an equivalent form here) of this surrogate function gives the updating rule for Problem (1) as: . The sequence obtained by this updating rule is guaranteed to converge if the objective function is concave. However, even for a very simple nonconcave function, SGM may generate a divergent sequence and never converge (please refer to Lemma 1). Furthermore, SGM cannot handle binary problems with restriction on the number of s since the number of s may change during each iteration. A stochastic version of this method was given in Adaptive Discrete Minimization (ADM) liu2017discretely , in which an adaptive ratio was selected at each iteration, then some random entries of were updated by . Although ADM works well for certain loss functions, it fails when the value of the loss function depends largely on only a few variables, since the random selecting procedure may skip such important variables.
Discrete optimization methods.
In the field of image hashing, many direct discrete optimization methods, such as DCC shen2015supervised , SADH shen2018unsupervised , ARE hu2018hashing , ITQ gong2013iterative and FastHash lin2014fast , have been proposed, which aim to optimize binary variables directly (SGM can also be seen as a discrete optimization method). For example, the Coordinate Descent (CD) method wright2015coordinate is widely used for solving optimization problems with smooth and convex constraints. Motivated by this method, several discrete cyclic coordinate descent (DCC) methods (e.g., RDCM luo2018robust , FSDH gui2018fast , and SDH shen2015supervised ) have been proposed to handle the binary constraint directly. The main idea is that, at each iteration, a subproblem with most entries of the binary variables fixed is considered, and the loss function is minimized with respect to the remaining entries. Although such methods can work well for specific loss functions, most of them are usually difficult to be extended to handle general binary optimization problems. Furthermore, they often suffer from expensive computational costs.
3 Proposed algorithm
In this section, we present in detail the DPCD algorithm for solving Problem (2), which is a general binary optimization problem with/without restrictions on the numbers of s and s. We also provide a theoretical convergence analysis of the proposed method.
3.1 Notation and preliminaries
We first introduce some notation and preliminaries. A vector is represented by some lowercase bold character, while a matrix is represented by some uppercase bold character. Let and denote the th and th entries of a vector and a matrix , respectively. The transpose of a matrix is represented by . We use to denote the Euclidean inner product. Let and be the Frobenius norm and norm of a matrix , respectively. The gradient of a differentiable function is denoted by . Let denote the elementwise sign function where for and otherwise. The Hamming distance between two binary vectors and of equal length is defined by , which is the number of positions at which the corresponding entries are different. For a set , let denote the number of elements in .
3.2 Main algorithm
The proposed DPCD algorithm runs iteratively between principal coordinate update and neighborhood search.
Principal coordinate update.
The basic idea is that, in the th iteration, we change the sign of some adaptively chosen entries of the binary vector , such that the value of the loss function should decrease steeply after each change. To achieve this goal, we focus on principal coordinates (see the following definition), which have major influences on the value change of the loss function.
Definition 1.
Let be a differentiable function and be a positive constant. A coordinate index is called an principal coordinate of if the product .
One motivation of our method is SGM liu2014discrete . In the th iteration of SGM, the linear surrogate of the objective function is given as:
(4) 
Then is obtained by minimizing this surrogate function:
(5) 
The sequence obtained by Eq. (5) is guaranteed to converge if is concave. However, from Lemma 1 in Subsection 3.3 we know, for nonconcave functions, SGM liu2014discrete may generate divergent sequences and never converge, since changing too many entries of at a time may increase the value of the loss function. To overcome this difficulty, the proposed DPCD method only changes the signs of entries whose absolute values of directional derivatives are large (entries with principal coordinates). This yields convergence for a wide class of loss functions (please refer to Lemma 1 and Theorem 1). Also, in the proposed algorithm, the constraint is always satisfied after each iteration.
To be more specific, given at the th iteration, we first calculate the gradient for the differentiable loss function . Next, we derive some proper thresholds based on . When is Lipschitz continuous on , where is easy to calculate, we simply set
(6) 
for some sufficiently small positive constant , i.e., we consider principal coordinates. For example, in Lemma 1, it is easy to see that , then we take for some small in the algorithm. When does not exist or is difficult to compute, we let and be the averages of the absolute values of the positive and negative entries in the gradient , respectively, i.e.,
(7) 
where and are the numbers of positive and negative entries in , respectively. With the given thresholds and , we set and to be the sets of principal coordinates with positive partial derivatives and principal coordinate with negative partial derivatives, respectively, where and are some parameters in which will be learned depending on the tasks and datasets. If the restriction condition is , we update by solving in Eq. (5) with respect to (other entries of are fixed) and derive:
(8) 
In other words, we change the sign of for with principal coordinates, and for with principal coordinates. If the number of s in is required to be fixed (i.e., the restriction condition is for some ), we update Eq. (8) with respect to the largest absolute values in and , respectively, where (such procedure guarantees that implies ). When the complexity of gradient calculations is low, the updating is very fast. A complexity analysis of the proposed DPCD is given in Subsection 4.2, which shows that the algorithm complexity of DPCD for supervised discrete hashing is linear and thus the algorithm runs very fast for largescale datasets.
Neighborhood search.
We add an optional heuristic neighborhood search after the principal coordinate update to avoid saddle points. In practice, we run one neighborhood search after
principal coordinate updates, where is between 10 to 20. First, we define the concept of neighbors for a point where is some small positive integers. When , the set of neighbors for is denoted by , which is the set of points with a Hamming distance at most from . When , the set of neighbors for is denoted by , which is the set of points obtained by interchanging at most pairs of entries and in . For instance, when , we have and .In a neighbor search for some (or with ), the aim is to find the point (or ) with the minimal function value (or equivalently, ). In practice, we can sample from (or ) instead of iterating over all points, if is large or calculating is slow. This neighborhood search step is helpful for finding a local minimum point. The proposed algorithm is summarized in Algorithm 1.
3.3 Convergence comparison: DPCD vs. SGM
One of the differences between our DPCD and SGM liu2014discrete is the choice of that should be updated by Eq. (8) at the th iteration. SGM updates Eq. (8) for each , while the proposed DPCD only changes the sign of when the coordinate indexes is principal for some . This difference is crucial to the convergences of the algorithms. More specifically, SGM can only guarantee convergence for the minimization of concave functions, while DPCD converges in finite steps for any functions with Lipschitz continuous gradients. We show the superiority of the proposed DPCD by the following example. The case is presented in Figure 1.
Lemma 1.
Consider the problem
where . Let . It holds that: (1) SGM generates a divergent sequence for any initial point; (2) With parameters and , the proposed DPCD method always converges to the optimal solution.
Proof.
First, we apply the method SGM to this problem. The gradient of is straightforward: . Since , we have for any . Then, by the updating rule fo SGM, . Therefore, starting from any point , SGM always generates a divergent sequence .
On the other hand, since for any , is a function with Lipschitz continuous gradient. Therefore, in DPCD, we set . Then by definition,
and
This implies that
(9) 
Then, we have for any and , and thus for . It is easy to check that is indeed the optimal point for the problem in Lemma . Therefore, the proposed DPCD converges to the optimal solution in only one updating step for any initial point. ∎
3.4 Theoretical convergence results
For simplicity, we ignore the neighborhood search part in the convergence analysis. Actually, the neighborhood search does not have any influence on the convergence since it always generates some binary vectors with nonincreasing values of the loss function. Thus, we can only focus on the principal coordinate update part of the proposed DPCD method.
We derive the following convergence results. When we say the algorithm converges in steps, we mean that the binary vector obtained in th iteration equals in th iteration (then the algorithm can stop here). Moreover, it is easy to see that our algorithm can converge to some local optimum with the help of neighborhood search (without neighborhood search, it only converges to some fixed binary vectors).
Theorem 1.
Let be a differentiable function such that is Lipschitz continuous on . Setting the thresholds where , and ignoring the neighborhood search, Algorithm 1 always converges in steps at most, where and .
Proof.
Let . Since is Lipschitz, we have for any ,
Then, by CauchySchwarz inequality,
By the gradient monotonicity equivalence of convexity (borwein2010convex, , Page 40), this yields that is a convex function on . Therefore, due to the firstorder equivalence of convexity (boyd2004convex, , Page 69),
which implies that
Let and in the above inequality, where is determined by and (8) in the paper. We have
(10) 
If , we obtain for each due to the updating rule.
If for some , by the updating rule we have, and . Then,
and Therefore,
(11) 
If , we know is not empty. Then, by Eq. (10) and Eq. (11) we have,
(12) 
This means that, in each updating step, the function value decreases at least , which completes the proof since the feasible set of Problem (2) in the main paper is finite. ∎
The above theorem is very versatile since most loss functions in practice are Lipschitz continuous on for some . Furthermore, since is finite, is a finite set, thus the above and always exist and are finite. When is quadratic, we obtain the following direct corollary.
Corollary 1.
Proof.
Since , we have . Then . By Theorem we complete the proof. ∎
4 Experiments
In this section, we compare the proposed DPCD algorithm with several stateoftheart methods on two binary optimization tasks: dense subgraph discovery and binary hashing. All codes are implemented in MATLAB using a workstation with an Intel 8core 2.6GHz CPU and 32GB RAM.
4.1 Dense subgraph discovery
Graph  # Nodes  # Arcs  # Arcs/# Nodes 

uk200705  100000  3050615  30.506 
dblp2010  326186  1615400  4.952 
eswiki2013  972933  23041488  23.683 
hollywood2009  1139905  113891327  99.913 
Optimization problem.
Dense subgraph discovery ravi1994heuristic ; rozenshtein2018finding ; yuan2016binary ; yuan2013truncated has many applications in graph mining, such as realtime story identification angel2012dense , finding correlated genes zhang2005general and graph visualization alvarez2006large . Let be a given undirected weighted graph with nodes, and be a given positive integer such that . The aim is to find the maximum density subgraph (the subgraph with the maximal sum of edge weights) with cardinality . Let be the symmetric adjacency matrix of the graph , where denotes the weight of the edge between vertices and . Then the problem can be formulated as the following optimization problem
(13) 
Note that, in this case, the variables are or instead of or . In order to translate the problem to the form in Problem (2), the substitution is adopted. Therefore, Problem (13) is equivalent to:
(14) 
In this way, the above problem can be approximately solved using our Algorithm 1.
Graph datasets.
The experiments for dense subgraph discovery are conducted on four largescale graphs uk200705, dblp2010, eswiki2013, and hollywood2009 from The Laboratory for Web Algorithmics^{1}^{1}1http://law.di.unimi.it/datasets.php. Table 1 gives a brief description of each graph. For example, hollywood2009 contains roughly 1.13 million nodes and 113 million edges.
Method  uk200705  dblp2010  eswiki2013  hollywood2009 

DPCD  1.77  6.19  14.17  21.80 
LP  5.02  7.27  58.96  115.38 
RAVI  2.28  6.40  104.94  77.95 
L2boxADMM  35.84  74.54  335.83  610.72 
MPECEPM  45.21  158.11  1454.07  2114.20 
MPECADM  36.76  81.64  421.35  720.66 
DPCD vs. stateoftheart methods.
The proposed DPCD is compared with the methods LP hsieh2015pu , RAVI ravi1994heuristic , L2boxADMM wu2018lp , MPECEPM and MPECADM yuan2016binary , using the same objective function (14). The cardinality is in the set . For the DPCD method, we run one neighborhood search ( in the neighborhood search setting) after principal coordinate updates, and set the maximum iteration number for the principal coordinate update part to . and are updated by Eq. (7). We tune the parameters and from by crossvalidation according to datasets. For other methods, we adopt the implementations and parameters suggested by the authors. The experimental results in Figure 2 are reported in terms of , which is the density of the subgraph with vertices. We can see that DPCD finds a denser subgraph than all compared methods in each case. Among the other methods, MPECEPM consistently outperforms LP in all the experiments. RAVI generally leads to solutions with low density. Furthermore, we provide CPU time comparisons for the six methods on the four graphs. From Table 2 we can see that, the proposed DPCD achieves the fastest runtime on all graphs. This is due to the fast updates in Algorithm 1 (the main complexity is to calculate the gradient of the loss function, which can be done quickly). Also, the efficiency of our method becomes more obvious as the graph size increases. LP and RAVI are faster than the other three methods, since MPECEPM needs to run the LP procedure multiple times, and L2box ADMM and MPECADM usually need more iterations to converge.
Method (Subgraph)  uk200705  dblp2010  eswiki2013  hollywood2009 

DPCD (loss function)  96.573  45.238  158.340  1944.921 
DPCD0 (loss function)  93.5585  44.2512  147.815  1866.240 
DPCD (run time)  1.7689  6.1092  14.1745  21.8038 
DPCD0 (run time)  1.1522  5.5186  9.9596  12.8730 
With and without the neighborhood search.
We conduct the comparison of the proposed method with its variant without neighborhood search technique, on the task of dense subgraph discovery with cardinality and in terms of loss functions and run time. DPCD0 refers to our algorithm without neighborhood search. From Table 3 we can see that DPCD0 is slightly faster, while DPCD usually achieves better results:
4.2 Binary hashing
Method 
MAP  Precision@500  Training time (seconds)  


bits  bits  bits  bits  bits  bits  bits  bits  bits 
DPCD  0.7019  0.7088  0.7126  0.6337  0.6353  0.6370  3.76  6.01  9.39 
DCC  0.5941  0.6193  0.6314  0.5486  0.5766  0.5894  11.18  36.03  158.87 
SGM  0.6856  0.6986  0.7013  0.6177  0.6308  0.6360  7.87  10.83  16.30 
LP  0.5237  0.5468  0.5459  0.4704  0.4972  0.4866  4.52  7.96  13.64 
L2boxADMM  0.6399  0.6724  0.6830  0.5929  0.6095  0.6162  43.07  90.10  200.94 
MPECEPM  0.5823  0.6253  0.6276  0.5385  0.5738  0.5790  36.36  124.54  260.22 
Binary hashing aims to encode highdimensional data points, such as images and videos, into compact binary hash codes such that the similarities between the original data points and hash codes are preserved. This can be used to provide a constant or sublinear search time and reduce the storage cost dramatically for such data points. The efficiency and effectiveness of binary hashing make it a popular technique in machine learning, information retrieval and computer vision
gui2018fast ; hu2018hashing ; liu2014discrete ; shen2018unsupervised ; wang2018survey . In a typical binary hashing task, denotes a matrix of the original data points, where is the th sample point, is the number of samples, and is the dimension of each sample. In the supervised setting, we let be the label matrix, i.e., if belongs to the th class and otherwise. The aim is to map to some , i.e., map each to a binary code for some small integer and preserve some similarities between the original data points in and hash codes in .Image datasets.
Three largescale image datasets, CIFAR10^{2}^{2}2http://www.cs.toronto.edu/kriz/cifar.html.
, ImageNet
^{3}^{3}3http://www.imagenet.org/., and NUSWIDE^{4}^{4}4http://lms.comp.nus.edu.sg/research/NUSWIDE.htm., are used in the binary hashing experiments. CIFAR10 has 60k images, which are divided into 10 classes with 6k images each. We use a 384dimensional GIST feature vector oliva2001modeling to represent each image. 59k images are selected as the training set and the test set contains the remaining 1k images. A subset of the ImageNet, ILSVRC 2012, contains about 1.2 million images with 1k categories. As in hu2018hashing ; shen2015supervised, we use 4096dimensional deep feature vectors for each image, take 127K training images from the 100 largest classes, and 50K images from the validation set as the test set. NUSWIDE contains about 270K images with 81 labels. The images may have multiple labels. The 500dimensional BagofWords features are used here
nuswidecivr09 . We adopt the 21 most frequent labels with the corresponding 193K images. For each label, 100 images are randomly selected as the test set and the remaining as the training set.DPCD vs. general binary optimization methods.
To illustrate the efficiency and effectiveness of our algorithm, we compare DPCD with several general stateoftheart binary optimization methods DCC shen2015supervised , SGM liu2014discrete , LP hsieh2015pu , L2boxADMM wu2018lp , and MPECEPM yuan2016binary , on the dataset CIFAR10. Various loss functions are designed for binary hashing (for examples, see Tables 5 and 6). For fair comparison, we adopt the widely used supervised objective function gui2018fast ; shen2015supervised
(15) 
for each method. Thus the optimization problem becomes:
(16) 
where is a regularization parameter, and is the projection matrix (see shen2015supervised ) which will be learned jointly with . The whole optimization runs iteratively over and . When is fixed, we apply DPCD algorithm to . The key step is to calculate the gradient of as:
(17) 
Then can be obtained by Eq. (7). After deriving and , we update by Eq. (8). When is fixed, can be updated by
(18) 
Finally, we adopt the linear hash function to encode onto binary codes, where can be derived by:
(19) 
During the test phase, for a query item, first we use the above linear hash function to derive the hash code, then adopt nearest neighbor search under Hamming distance to find its similar items. For the DPCD method, we run one neighborhood search after principal coordinate updates, and tune the parameters and from by crossvalidation according to datasets and binary code lengths, and set the maximum iteration number for to each time when is fixed. We run at most five iterations for updating and iteratively. For other methods, we adopt the implementations and parameters suggested by the authors. Ground truths are defined by the label information from the datasets. The experimental results are reported in terms of mean average precision (MAP), Precision@500 (Precision@500 refers to the ratio of the number of retrieved true positive items among 500 nearest neighbors to 500.) and training time efficiency (we ignore the comparison of test time here since the test parts are similar for each algorithm). The experiments are conducted on the CIFAR10 dataset. From Table 4 we can see that the proposed DPCD outperforms all other methods in MAP and Precision@500. For instance, on CIFAR10 with 96 bits, DPCD outperforms DCC by 8.1, and MPECEPM by 8.5 in terms of MAP. It is clear that increasing the number of bits yields better performance for all methods. Also, the training time for the proposed DPCD method is always faster than other compared methods. For example, DPCD runs 20 times faster than MPECEPM on bits, which verifies that one major advantage of our method is the fast optimization process.
Algorithm complexity analysis.
Now we discuss the complexity of the above supervised DPCD algorithm. The calculation of the gradient is obtained by Eq. (17), thus the complexity is . Then and are derived by Eq. (7), which has complexity since there are additions in Eq. (7). Furthermore, and can be determined by running through all where , which also has complexity . Similarly, the update for has complexity . Let be the maximum iteration number during the updating step (the number of principal coordinate update parts of DPCD). Then, the total time complexity of updating is . When is given, the complexity for updating in Eq. (18) is . The complexity for calculating the matrix in Eq. (19) is . Suppose that there are at most iterations for updating and iteratively. Since , the total complexity for supervised DPCD is , which is a linear function of .
DPCD vs. specific binary hashing methods.
The proposed method is not only fast, but also very versatile, and can thus be used to handle many different loss functions. To demonstrate this, we apply DPCD to several widely used loss functions, and compare the results with the corresponding hashing methods that were designed for each specific loss function. The optimization process is similar to the supervised case. Table 5 illustrates a comparison of DPCD with three unsupervised methods, SADHL shen2018unsupervised , ARE hu2018hashing , and ITQ gong2013iterative , on the ImageNet dataset, while Table 6 shows a comparison with supervised hashing methods, SDH shen2015supervised , FSDH gui2018fast , and FastHash lin2014fast , on NUSWIDE, using their specific loss functions. The proposed DPCD shows increased performance over the original methods and achieves higher MAP and Precision@500 in most cases, especially for the supervised loss functions. In terms of the training time, DPCD outperforms all methods except FSDH. The proposed method can significantly decrease the training time for unsupervised loss functions due to the fast updating of the binary codes
. Finally, we conclude that DPCD is a fast and effective optimization method for largescale image retrieval tasks.
Method 
Loss Function  MAP  Precision@500  Training time (s)  


bits  bits  bits  bits  bits  bits  bits  bits  bits  
SADHL  0.2448  0.3104  0.3294  0.3692  0.4588  0.4835  121.90  384.01  738.33  
DPCD  0.2612  0.3085  0.3441  0.3945  0.4508  0.4992  19.03  33.50  48.29  
ARE  0.2509  0.2997  0.3276  0.3626  0.4478  0.4724  244.46  287.19  332.95  
DPCD  0.2808  0.3316  0.3607  0.3970  0.4856  0.5112  6.59  8.37  11.24  
ITQ  0.3209  0.4075  0.4388  0.4269  0.5208  0.5581  26.01  32.97  33.39  
DPCD  0.3093  0.4155  0.4467  0.4148  0.5283  0.5654  3.35  7.23  8.02 
Method 
Loss Function  MAP  Precision@500  Training time (s)  


bits  bits  bits  bits  bits  bits  bits  bits  bits  
SDH  0.5716  0.5827  0.5920  0.6041  0.6056  0.6189  24.53  107.50  486.95  
DPCD  0.6124  0.6230  0.6392  0.6287  0.6357  0.6502  6.91  16.39  22.53  
FSDH  0.5690  0.5639  0.5676  0.5918  0.5860  0.5988  2.76  3.07  5.95  
DPCD  0.6159  0.6170  0.6253  0.6260  0.6297  0.6335  5.42  8.96  10.99  
FastHash  0.5174  0.5398  0.5403  0.5867  0.6014  0.6180  1381.75  4668.13  10605.82  
DPCD  0.5621  0.5579  0.5767  0.5991  0.6145  0.6226  6.42  7.88  11.79 
5 Conclusion and future work
This paper presents a novel fast optimization method, called Discrete Principal Coordinate Descent (DPCD), to approximately solve binary optimization problems with/without restrictions on the numbers of s and s in the variables. We derive several theoretical results on the convergence of the proposed algorithm. Experiments on dense subgraph discovery and binary hashing demonstrate that our method generally outperforms stateoftheart methods in terms of both solution quality and optimization efficiency.
In the future, we plan to extend our algorithm to a more general framework. Our methods can be seen as a discrete version of the normal gradient descent methods. Since the gradient descent has several useful variants such as momentum, Adam and Adagrad methods qian1999momentum ; ruder2016overview , it would be possible to combine our methods with these gradientbased methods and propose some discrete versions of them. We would like to explore more on this direction in the future.
References
 [1] J Ignacio AlvarezHamelin, Luca Dall’Asta, Alain Barrat, and Alessandro Vespignani. Large scale networks fingerprinting and visualization using the kcore decomposition. In Advances in Neural Information Processing Systems (NIPS), pages 41–50, 2006.
 [2] Brendan PW Ames. Guaranteed recovery of planted cliques and dense subgraphs by convex relaxation. Journal of Optimization Theory and Applications, 167(2):653–675, 2015.
 [3] Albert Angel, Nikos Sarkas, Nick Koudas, and Divesh Srivastava. Dense subgraph maintenance under streaming edge weight updates for realtime story identification. Proceedings of the VLDB Endowment, 5(6):574–585, 2012.
 [4] Oana Denisa Balalau, Francesco Bonchi, TH Chan, Francesco Gullo, and Mauro Sozio. Finding subgraphs with maximum total density and limited overlap. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 379–388. ACM, 2015.
 [5] Shujun Bi, Xiaolan Liu, and Shaohua Pan. Exact penalty decomposition method for zeronorm minimization based on mpec formulation. SIAM Journal on Scientific Computing, 36(4):A1451–A1477, 2014.
 [6] Jonathan Borwein and Adrian S Lewis. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.
 [7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
 [8] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [9] TatSeng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and YanTao Zheng. Nuswide: A realworld web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), Santorini, Greece., July 810, 2009.
 [10] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
 [11] Jie Gui, Tongliang Liu, Zhenan Sun, Dacheng Tao, and Tieniu Tan. Fast supervised discrete hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2):490–496, 2018.
 [12] Lifang He, ChunTa Lu, Jiaqi Ma, Jianping Cao, Linlin Shen, and Philip S Yu. Joint community and structural hole spanner detection via harmonic modularity. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 875–884. ACM, 2016.
 [13] ChoJui Hsieh, Nagarajan Natarajan, and Inderjit S Dhillon. Pu learning for matrix completion. In ICML, pages 2445–2453, 2015.
 [14] Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, and Heng Tao Shen. Hashing with angular reconstructive embeddings. IEEE Transactions on Image Processing, 27(2):545–555, 2018.
 [15] David S Johnson and Michael R Garey. Computers and intractability: A guide to the theory of NPcompleteness, volume 1. WH Freeman San Francisco, 1979.
 [16] Nikos Komodakis and Georgios Tziritas. Approximate labeling via graph cuts based on linear programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1436–1453, 2007.
 [17] Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.

[18]
Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van den Hengel, and David Suter.
Fast supervised hashing with decision trees for highdimensional data.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1963–1970, 2014.  [19] Guosheng Lin, Chunhua Shen, David Suter, and Anton Van Den Hengel. A general twostep approach to learningbased hashing. In Proceedings of the IEEE international conference on computer vision, pages 2552–2559, 2013.
 [20] Li Liu, Ling Shao, Fumin Shen, and Mengyang Yu. Discretely coding semantic rank orders for supervised image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1425–1434, 2017.
 [21] Wei Liu, Cun Mu, Sanjiv Kumar, and ShihFu Chang. Discrete graph hashing. In Advances in Neural Information Processing Systems (NIPS), pages 3419–3427, 2014.
 [22] Zhaosong Lu and Yong Zhang. Sparse approximation via penalty decomposition methods. SIAM Journal on Optimization, 23(4):2448–2478, 2013.
 [23] Yadan Luo, Yang Yang, Fumin Shen, Zi Huang, Pan Zhou, and Heng Tao Shen. Robust discrete code modeling for supervised hashing. Pattern Recognition, 75:128–135, 2018.
 [24] Sanjay Mehrotra. On the implementation of a primaldual interior point method. SIAM Journal on Optimization, 2(4):575–601, 1992.
 [25] Walter Murray and KienMing Ng. An algorithm for nonlinear optimization problems with binary variables. Computational Optimization and Applications, 47(2):257–288, 2010.
 [26] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
 [27] Carl Olsson, Anders P Eriksson, and Fredrik Kahl. Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
 [28] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 [29] Sekharipuram S Ravi, Daniel J Rosenkrantz, and Giri Kumar Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42(2):299–310, 1994.
 [30] Polina Rozenshtein, Francesco Bonchi, Aristides Gionis, Mauro Sozio, and Nikolaj Tatti. Finding events in temporal networks: Segmentation meets densestsubgraph discovery. In 2018 IEEE International Conference on Data Mining (ICDM), pages 397–406. IEEE, 2018.
 [31] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
 [32] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 37–45, 2015.
 [33] Fumin Shen, Yan Xu, Li Liu, Yang Yang, Zi Huang, and Heng Tao Shen. Unsupervised deep hashing with similarityadaptive and discrete optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

[34]
Xinchu Shi, Haibin Ling, Junling Xing, and Weiming Hu.
Multitarget tracking by rank1 tensor approximation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2394, 2013.  [35] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):769–790, 2018.
 [36] Meihong Wang and Fei Sha. Information theoretical clustering via semidefinite programming. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 761–769, 2011.
 [37] Peng Wang, Chunhua Shen, Anton van den Hengel, and Philip HS Torr. Largescale binary quadratic optimization using semidefinite relaxation and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(3):470–485, 2017.
 [38] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, pages 1–35, 2015.
 [39] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in Neural Information Processing Systems (NIPS), pages 1753–1760, 2009.
 [40] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
 [41] Baoyuan Wu and Bernard Ghanem. lpbox admm: A versatile framework for integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [42] Ganzhao Yuan and Bernard Ghanem. Binary optimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04425, 2016.
 [43] Ganzhao Yuan and Bernard Ghanem. Sparsity constrained minimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04430, 2016.
 [44] Ganzhao Yuan and Bernard Ghanem. An exact penalty method for binary optimization based on mpec formulation. In AAAI, pages 2867–2875, 2017.

[45]
XiaoTong Yuan and Tong Zhang.
Truncated power method for sparse eigenvalue problems.
Journal of Machine Learning Research, 14(Apr):899–925, 2013.  [46] Bin Zhang and Steve Horvath. A general framework for weighted gene coexpression network analysis. Statistical applications in genetics and molecular biology, 4(1), 2005.
 [47] Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factorization with applications. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 391–400. IEEE, 2007.