I Introduction
The digital era, influencing and reshaping the behaviors, performances, and standards, etc., of societies, communities, and individuals, has presented a big challenge for the conventional mode of data processing. Data consisting of numbers, words, and measurements becomes available in such huge volume, high velocity, and wide variety that it ends up outpacing humanoriented computing. It is urgent to explore the intelligent tools necessary for processing the staggering amount of data. Machine learning (ML), dedicated to providing insights into patterns in big data and extracting pieces of information hidden inside, arises and has been used in a wide variety of applications, such as computer vision
[1], telecommunication [2], and recommendation systems [3, 4, 5, 6]. Nevertheless, traditional ML algorithms become computationally inefficient and fail to scale up well as the dimension of data grows. A major issue that remains to be addressed is to find effective ML algorithms that perform well on both predictability and interpretability as well as are capable of tackling largedimensional data with low latency.Ia Related Work
Data representation, providing driving forces to the advancing MLbased techniques, has lately attracted a great deal of interest because it transforms largedimensional data into lowdimensional alternatives by capturing their key features and make them amenable for processing, prediction, and analysis. The gamut of data representation techniques including matrix factorization (MF) [7, 8]
, singular value decomposition (SVD)based models
[9, 10], nonnegative models (NNM) [11], and deep neural networks
[12] have been shown to perform well in terms of predictive power (the capability of predicting the outcome of random variables that are outside the training set). Unfortunately, these techniques perform rather poorly on the interpretability (the capability of extracting additional information or insights that are hidden inside the data) because on the one hand, they are not developed to directly model the outcome of random variables; on the other hand, they fall under the blackbox category which lacks transparency and accountability of predictive models [13]. Recently, a Kolmogorov model (KM) that directly represents a binary random variable as a superposition of elementary events in probability space was proposed
[14]; KM models the outcome of a binary random variable as an inner product between two structured vectors, one probability mass function vector and one binary indicator vector. This inner product structure exactly represents an actual probability. Carefully examining association rules between two binary indicator vectors grants the interpretability of KM that establishes mathematically logical/causal relations between different random variables.
Previously, the KM learning was formulated as a coupled combinatorial optimization problem
[14] by decomposing it into two subproblems: i) linearlyconstrained quadratic program (LCQP) and ii) binary quadratic program (BQP), which can be alternatively solved by utilizing block coordinate descent (BCD). An elegant, lowcomplexity FrankWolfe (FW) algorithm [15] was used to optimally solve the LCQP by exploiting the unit probability simplex structure. Whereas, it is known to be unpromising to find algorithms to exactly solve the BQP problems in polynomial time. To get around this challenge, relaxation methods for linear [16, 17], quadratic [18], secondorder cone [19, 20], and semidefinite programming (SDP) [21, 22], were proffered to produce a feasible solution close to the optimal solution of the original problem. Among these relaxation methods, the semidefinite relaxation (SDR) has been shown to have a tighter approximation bound than that of others [23, 24]. Thus, a SDR with randomization (SDRwR) method [25] was employed to optimally solve the BQP of the KM learning in an asymptotic sense [14]. To address the highcomplexity issue due to the reliance on the interior point methods, a branchreduceandbound (BRB) algorithm based on discrete monotonic optimization (DMO) [26, 27] was proposed. However, the DMO approach only shows its efficacy in a lowdimensional setting and starts to collapse as the dimension increases. In short, the existing KM methods [14, 27]suffer from a similar drawback, namely, being unscalable. Unfortunately, the latter limitation hampers the application of them to largescale datasets, for instance, the MovieLens 1 million (ML1M) dataset
^{1}^{1}1https://grouplens.org/datasets/movielens/1m/. It is thus crucial to explore lowcomplexity and scalable methods for KM learning.Duality often arises in linear/nonlinear optimization models in a wide variety of applications such as communication networks [28], economic markets [29], and structural design [30]. Simultaneously, the dual problem possesses some good mathematical, geometric, or computational structures that can be exploited to provide an alternative way of handling the intricate primal problems by using iterative methods, such as the firstorder gradient descent (GD) [31, 32] and quasiNewton method [33, 34]. It is for this reason that the firstorder iterative methods are widely used when optimizing/training largescale data representations (e.g., deep neural networks) and machine learning algorithms. We are motivated by these iterative firstorder methods to effectively resolve the combinatorial challenge of KM learning.
IB Overview of Methodologies and Contributions
We present a computationally scalable approach to the KM learning problem, based on dual optimization, by proposing an enhanced GD algorithm. Our main contributions are listed below.

We provide a reformulation of the BQP subproblem of KM learning to a regularized dual optimization problem that ensures strong duality and is amenable to be solved by simple GD. Compared to the existing SDRwR [14] and DMO [27] methods, the proposed dual optimization approach proffers a more efficient and scalable solution to KM learning.

Motivated by the fact that EVD is required at each iteration of GD, which introduces a computational bottleneck when applied to largescale datasets, an enhanced GD that eliminates the EVD computation when it is feasible is proposed to accelerate the computational speed. When the elimination is infeasible and EVD must be computed, we explore a proximal EVD based on the Lanczos method [35] by taking account of the fact that computing exact, entire EVD is usually impractical. We focus on analyzing the approximation error of the proximal EVD. A thresholding scheme is then proposed to determine the number of iterations of the proximal EVD by exploiting the upper bound of the approximation error and leveraging a normalized Minkowski norm.

Extensive numerical simulation results are presented to demonstrate the efficacy of the proposed KM learning algorithm. When applied to largescale datasets (e.g., ML1M dataset), it is shown that the proposed method can achieve comparable training and prediction performance with significantly reduced computational cost of roughly two orders of magnitude, compared to the existing KM learning algorithms. Finally, the interpretability of the proposed method is validated by exploiting the mathematically logical relations. We show that the accuracy of logical relation mining by using the proposed method exceeds .
Notation: A bold lowercase letter is a vector and a bold capital letter is a matrix. , , , , , , and denote the th entry,
th column, trace, main diagonal elements, rank, largest eigenvalue, and largest singular value of
, respectively. is the th entry of , , and is a diagonal matrix with on its main diagonal. is the Frobenius inner product of two matrices and , i.e., . indicates that the matrix is positive semidefinite (PSD). is theth column of the identity matrix of appropriate size.
and denote the allone and allzero vectors, respectively. , , and denote the symmetric matrix space, nonnegative realvalued vector space, and binary vector space with each entry chosen from , respectively. For , where is the th eigenvalue of , . is the support set of and denotes the cardinality of a set . Finally, indicates that one outcome completely implies another one .Ii System Model and Preliminaries
Iia Preliminaries
We consider a doubleindex set of binary random variables , , where ( and are the index sets of and , respectively) denotes the set of all index pairs. Thus, can represent any twodimensional learning applications (involving matrices) such as movie recommendation systems [11], DNA methylation for cancer detection [37], and beam alignment in multipleantenna systems [38, 39]. We let be the probability that the event occurs. Since the random variable considered here is binary, the following holds . Without loss of generality, we can focus on one outcome, for instance, . Then, the dimensional KM of the random variable is given by
(1) 
where is the probability mass function vector and is the binary indicator vector. Specifically, is in the unit probability simplex , i.e., , and denotes the support set of (associated with the case when ). The KM in (1) is built under a measurable probability space defined on ( denotes the sample space and is the event space consisting of subsets of ) and satisfies the following conditions: i) , (nonnegativity), ii) (normalization), and iii) for the disjoint events , (countable additivity) [40]. By (1), is modeled as stochastic mixtures of Kolmogorov elementary events. In addition, note that .
IiB KM Learning
Assume that the empirical probability of , denoted by , is available from the training set . Obtaining the empirical probabilities for the training set depends on the application and context in practical systems; we will illustrate an example for recommendation systems at the end of this section. The KM learning involves training, prediction, and interpretation as described below.
IiB1 Training
The KM training proceeds to optimize and by solving the norm minimization problem:
(2) 
To deal with the coupled combinatorial nature of (2), a BCD method [14, 41] was proposed by dividing the problem in (2) into two subproblems: i) LCQP:
(3) 
where , , , , and is the index of BCD iterations, and ii) BQP:
(4) 
where , , , and . By exploiting the fact that the optimization in (3) was carried out over the unit probability simplex , a simple iterative FW algorithm [15] was employed to optimally solve (3), while the SDRwR was employed to asymptotically solve the BQP in (4) [25]. It is also possible to solve (4) directly without a relaxation and/or randomization, based on the DMO approach [27]. However, the DMO in [27] was shown only to be efficient when the dimension is small (e.g., ); its computational cost blows up as increases (e.g., ).
IiB2 Prediction
Similar to other supervised learning methods, the trained KM parameters
are used to predict probabilities over a test set as(5) 
where and .
IiB3 Interpretation
KM offers a distinct advantage, namely, the interpretability by drawing on fundamental insights into the mathematically logical relations among the data. For two random variables and taken from the training set , i.e., and , if the support sets of and satisfy , then two logical relations between the outcomes of and can be inferred: the first outcome of implies the same one for while the second outcome of implies the second one for , i.e., and [14, Proposition 1]. It is important to note that logical relations emerged from KM are based on the formalism of implications. Thus, they hold from a strictly mathematical perspective, and are general.
An implication of the introduced KM learning is illustrated by taking an example of movie recommendation systems as follows.

Suppose there are two users () who have rated two movie items (). In this example, denotes the event that user likes the movie item , . Then, denotes the probability that user likes item (conversely, denotes the probability that user dislikes item ). Suppose in (1). Then, the four elementary events can represent four different movie genres including i) Comedy, ii) Thriller, iii) Action, and iv) Drama. The empirical probability corresponding to can be obtained by
(6) where denotes the rating score that user has provided for item and is the maximum rating score. In a 5star rating system (), we consider the following matrix as an example:
(7) where is unknown (as in the ‘*’ entry) and constitutes the training set of empirical probability where . By solving the KM learning problem in (2) for the empirical probabilities provided in (7), one can find the optimal model parameters, and (an optimal solution to (2)), which is given by
Then, we can predict () by using the learned KM parameters and as . In this example, the following inclusion holds . Thus, if a certain user (user 1 or 2) likes movie item 1, this logically implies that the user also likes movie item 2.
Remark 1 (Extended Discussion on Related Work)
In co ntrast to the KM in (1), the stateoftheart method, MF [7, 8], considers an inner product of two arbitrary vectors without having implicit or desired structures in place. While NNM [11] has a similar structure as (1), the distinction is that NNM relaxes the binary constraints on to a nonnegative box, i.e., , and thus sacrifices the highly interpretable nature of KM. Unlike the existing data representation techniques, the KM can exactly represent the outcome of random variables in a Kolmogorov sense. As illustrated in Section V, this in turn improves the prediction performance of the KM compared to other existing data representation techniques. Despite its predictability benefit, the existing KM learning methods [14, 27], however, suffer from high computational complexity and a lack of scalability. In particular, the LCQP subproblem, which can be efficiently solved by the FW algorithm, has been wellinvestigated, while resolving the BQP introduces a major computational bottleneck. It is thus of great importance to study more efficient and fast KM learning algorithms that are readily applicable to largescale problems.
Iii Proposed Method
To scale the KM learning, we propose an efficient, firstorder method to the BQP subproblem in (4).
Iiia Dual Problem Formulation
We transform the BQP subproblem in (4) to a dual problem. To this end, we formulate an equivalent form to the BQP in (4) as
(8) 
where in (4) is ignored in (8), , , and . For simplicity, the iteration index is omitted hereinafter. By introducing and , the problem in (8) can be rewritten as
(9a)  
s.t.  (9b)  
(9c)  
(9d) 
Solving (9) directly is NPhard due to the rank constraint in (9d), thus we turn to convex relaxation methods. The SDR to (9) can be expressed in a homogenized form with respect to as
(10a)  
s.t.  (10b)  
(10c) 
where and . Note that the diagonal constraint in (9b) has been equivalently transformed to equality constraints in (10b). While the problem in (9) is combinatorial due to the rank constraint, the relaxed problem in (10) is a convex SDP. Moreover, the relaxation is done by dropping the rank constraint.
We further formulate a regularized SDP formulation of (10) as
(11)  
s.t.  
where is a regularization parameter. With a Frobeniusnorm term regularized, the strict convexity of (11) is ensured, which in turn makes strong duality hold for the feasible dual problem of (11). In this work, we leverage this fact that the duality gap is zero for (11) (a consequence of strong duality) to solve the dual problem. In addition, the two problems in (10) and (11) are equivalent as .
Given the regularized SDP formulation in (11), its dual problem and the gradient of the objective function are of interest.
Lemma 1
Suppose the problem in (11) is feasible. Then, the dual problem of (11) is given by
(12) 
where is the vector of Lagrange multipliers associated with each of the equality constraints of (11), , and , in which and ,
, respectively, are the eigenvalues and corresponding eigenvectors of
. The gradient of with respect to is(13) 
where .
Proof: See Appendix A.
It is well known that in (12) is a strongly concave (piecewise linear) function, thereby making the Lagrange dual problem (12) a strongly convex problem having a unique global optimal solution [31]. Furthermore, the special structure of of Lemma 1, i.e., being symmetric, allows us to propose computationally efficient and scalable KM learning algorithms which can be applied to handle largescale datasets with low latency.
IiiB Fast GD Methods For The Dual Problem
IiiB1 Gd
The dual problem in (12), having a strongly concave function , is equivalent to the following unconstrained convex minimization problem
(14) 
with the gradient being . We first introduce a GD, which is detailed in Algorithm 1, to solve (14). Note that, due to the fact that the dual problem in (14) is unconstrained, a simple GD method is proposed here: indeed, we would need a projected GD method if there is constraint included, for which the computational complexity would be much larger (because of the projection at each iteration).
In Algorithm 1, only the gradient of , i.e., , is required to determine the descent direction. It is therefore a more practical and costsaving method compared to standard Newton methods which demand the calculation of secondorder derivatives and the inverse of the Hessian matrix. Moreover, Algorithm 1 does not rely on any approximation of the inverse of the Hessian matrix such as the quasiNewton methods [42]. To find a step size in Step 4, we apply the backtracking line search method [43], which is based on the ArmijoGoldstein condition [44]. The algorithm is terminated when the predesigned stopping criterion (for instance, in Step 5, where is a predefined tolerance) is satisfied. Finally, the computational complexity of Algorithm 1 is dominated by the EVD of a matrix, needed to compute in Step 2, which is given as .
IiiB2 Enhanced GD
In Algorithm 1, an EVD of is required at each iteration to determine and . However, it is difficult to employ EVD per iteration as they require high computational cost () when largescale datasets are involved (with very large , , and ). It is critical to reduce the computational cost of Algorithm 1 by avoiding the full computation of EVD or even discarding them.
In relation to the original SDP problem in (10), we can understand the PSD constraint in (10c) is now penalized as the penalty term in , i.e., . Thus, one of the key insights we will use is that: i) if the PSD constraint is not satisfied, the penalty term equals to zero, simplifying the objective function as ; in this case, the gradient is simply , eliminating the computation of EVD, and ii) if the PSD constraint is satisfied, the penalty term becomes nonzero and it requires the computation of EVD to find out . This fact leads to the following proposition showcasing the rule of updating for the enhanced GD.
Proposition 1
The enhanced GD includes two cases depending on the condition of the PSD constraint as
The key is to check if the PSD constraint in Proposition 1 is satisfied or not without the need of computing EVD. We propose a simple sufficient condition, based on the Weyl’s inequality [45], as demonstrated in the proposed Algorithm 2.
In Algorithm 2, we focus on modifying Step 2 in Algorithm 1 by using an initial with equal entries and exploiting the fact that if is not PSD (Case A in Proposition 1) to reduce the computational cost of EVD. Step in Algorithm 2 is due to the fact that the th eigenvalue of is . One of the key insights we leverage is that the choice of the sequence of gradient directions, i.e., , , does not alter the optimality of the dual problem in (14). We approach the design of with the goal of eliminating the computation of EVD to the most extent. Moreover, in Step of Algorithm 2, the condition (Case A in Proposition 1), holds because of the Weyl’s inequality [45]. Thus the EVD of is required only when both the conditions “all the elements of are the same” and “ ” are violated, as in Phase IIB.
Notice that Step 2 in computing of Algorithm 1 has been transformed into two different phases (each phase includes two subphases) in Algorithm 2. Algorithm 2 executes the four subphases in order and irreversibly. To be specific, the algorithm first enters Phase I at the initial iteration and ends up with Phase II. Once the algorithm enters Phase II, there is no way to return back to Phase I. Algorithms 1 and 2 are based on GD, and thus the enhanced GD does not alter the convergency of Algorithm 1 [31].
Proposition 2 (Convergence of Enhanced GD)
This phenomenon is captured in Fig. 1 in which the objective function value as a function of the iteration number are depicted for Algorithms 1 and 2. In terms of flops, Algorithm 2 is more efficient than the original GD in Algorithm 1. This results in a dramatic reduction in the running time of Algorithm 2 since we mainly move on the direction obtained without the computation of EVD.
IiiC Randomization
The solution to the dual problem in (14) (or equivalently (12)) produced by Algorithm 2, is not yet a feasible solution to the BQP in (4). A randomization procedure [46] can be employed to extract a feasible binary solution to (4) from the SDP solution of (11). One typical design of the randomization procedure for BQP is to generate feasible points from the Gaussian random samples via rounding [47]. The Gaussian randomization procedure provides a tight approximation with probability , asymptotically in [48]. By leveraging the fact that the eigenvalues and corresponding eigenvectors of can be found by Steps 13 and 14 of Algorithm 2, we have
where and . A detailed randomization procedure is provided in Algorithm 3.
In Step 8 of Algorithm 3, the dimensional vector is first recovered from a dimensional vector by considering the structure of in (9), and then used to approximate the BQP solution based on (8). Also note that the randomization performance improves with . In practice, we only need to choose a sufficient but not excessive (for instance, ) achieving a good approximation for the BQP solution. Moreover, its overall computational complexity is much smaller than the conventional randomization algorithms [46, 47, 14] because our proposed Algorithm 3 does not require the computation of the Cholesky factorization.
IiiD Overall KM Learning Algorithm
Incorporating Algorithm 2 and Algorithm 3, the overall KM learning framework is described in Algorithm 4.
Note that the index of BCD iterations that has been omitted is recovered here and denotes the total number of BCD iterations for KM learning. In Algorithm 4, the BCD method is adopted to refine and until it converges to a stationary point of (2). In fact, the proof of convergence (to stationary solution) for Algorithm 4 is exactly the same as that of Algorithm 1 in [14]. In practice, we can use to control the termination of Algorithm 4.
Iv Proximal EVD and Error Analysis
In this section, several techniques are discussed to further accelerate Algorithm 2.
Iva Initial Step Size
A good initial step size is crucial for the convergence speed of the enhanced GD. In Phase IA of Algorithm 2, we have
If , the following holds where . Therefore, in the first iteration of Phase IA, we can set an appropriate step size so that has at least one positive eigenvalue, where . With the above modification of Algorithm 2, we can reduce the execution time spent in Phase IA, and thus, the total number of iterations required by the enhanced GD can be reduced as shown in Fig. 2.
IvB Proximal EVD
Compared to the original GD in Algorithm 1, the enhanced GD in Algorithm 2 has reduced the costly EVD substantially. Nevertheless, the EVD is still necessary in Algorithm 2 when the algorithm enters into Phase IIB. In order to further accelerate the algorithm, we employ and modify the Lanczos method to numerically compute the proximal EVD of in Algorithm 2.
The Lanczos algorithm [49] is a special case of the Arnoldi method [50] when the matrix is symmetric. In principle, it is based on an orthogonal projection of onto the Krylov subspace where denotes the dimension of Krylov subspace. An algorithmic description of a modified Lanczos method is presented in Algorithm 5.
Different from the Arnoldi method, the matrix constructed by Algorithm 5 is tridiagonal and symmetric, i.e., the entries of in Algorithm 5 satisfy that , , and , . Also, Algorithm 5 iteratively builds an orthonormal basis, i.e., , for such that and , where . Let , , be the eigenpairs of . Then, the eigenvalues/eigenvectors of can be approximated by the Ritz pairs , i.e.,
(15) 
With the increase of the dimension of Krylov subspace , the approximation performance improves at the price of additional computations. Thus, in practice, we adopt the value of balancing the tradeoff between the accuracy of approximation and the computational complexity.
IvC Analysis of Approximation Error and Thresholding Scheme
In this subsection, we analyze the approximation error of the proximal EVD and propose a thresholding scheme for selecting an appropriate in Algorithm 5. The main results are provided in the following lemmas.
Lemma 2
Let be any eigenpair of and in (15) is an approximated eigenpair (Ritz pair) of in Algorithm 5. Then the following holds:
i) The residual error is upper bounded by
(16) 
ii) The maximum approximation error of eigenvalues of is bounded by
(17) 
where is the associated true eigenvalue of .
iii) The minimum approximation error of eigenvalues of is bounded by
(18) 
Proof: See Appendix B.
Lemma 2 indicates that the error bounds of the approximate eigenvalues of by using the proximal EVD in Algorithm 5 largely depends on . Indeed, the upper bounds in (17) and (18) are quite tight as will be seen in Section V. Inspired by Lemma 2, finding an upper bound of that only depends on the trace of is of interest.
Comments
There are no comments yet.