The digital era, influencing and reshaping the behaviors, performances, and standards, etc., of societies, communities, and individuals, has presented a big challenge for the conventional mode of data processing. Data consisting of numbers, words, and measurements becomes available in such huge volume, high velocity, and wide variety that it ends up outpacing human-oriented computing. It is urgent to explore the intelligent tools necessary for processing the staggering amount of data. Machine learning (ML), dedicated to providing insights into patterns in big data and extracting pieces of information hidden inside, arises and has been used in a wide variety of applications, such as computer vision, telecommunication , and recommendation systems [3, 4, 5, 6]. Nevertheless, traditional ML algorithms become computationally inefficient and fail to scale up well as the dimension of data grows. A major issue that remains to be addressed is to find effective ML algorithms that perform well on both predictability and interpretability as well as are capable of tackling large-dimensional data with low latency.
I-a Related Work
Data representation, providing driving forces to the advancing ML-based techniques, has lately attracted a great deal of interest because it transforms large-dimensional data into low-dimensional alternatives by capturing their key features and make them amenable for processing, prediction, and analysis. The gamut of data representation techniques including matrix factorization (MF) [7, 8]
, singular value decomposition (SVD)-based models[9, 10], nonnegative models (NNM) 
, and deep neural networks have been shown to perform well in terms of predictive power (the capability of predicting the outcome of random variables that are outside the training set). Unfortunately, these techniques perform rather poorly on the interpretability (the capability of extracting additional information or insights that are hidden inside the data) because on the one hand, they are not developed to directly model the outcome of random variables; on the other hand, they fall under the black-box category which lacks transparency and accountability of predictive models 
. Recently, a Kolmogorov model (KM) that directly represents a binary random variable as a superposition of elementary events in probability space was proposed
; KM models the outcome of a binary random variable as an inner product between two structured vectors, one probability mass function vector and one binary indicator vector. This inner product structure exactly represents an actual probability. Carefully examining association rules between two binary indicator vectors grants the interpretability of KM that establishes mathematically logical/causal relations between different random variables.
Previously, the KM learning was formulated as a coupled combinatorial optimization problem by decomposing it into two subproblems: i) linearly-constrained quadratic program (LCQP) and ii) binary quadratic program (BQP), which can be alternatively solved by utilizing block coordinate descent (BCD). An elegant, low-complexity Frank-Wolfe (FW) algorithm  was used to optimally solve the LCQP by exploiting the unit probability simplex structure. Whereas, it is known to be unpromising to find algorithms to exactly solve the BQP problems in polynomial time. To get around this challenge, relaxation methods for linear [16, 17], quadratic , second-order cone [19, 20], and semi-definite programming (SDP) [21, 22], were proffered to produce a feasible solution close to the optimal solution of the original problem. Among these relaxation methods, the semi-definite relaxation (SDR) has been shown to have a tighter approximation bound than that of others [23, 24]. Thus, a SDR with randomization (SDRwR) method  was employed to optimally solve the BQP of the KM learning in an asymptotic sense . To address the high-complexity issue due to the reliance on the interior point methods, a branch-reduce-and-bound (BRB) algorithm based on discrete monotonic optimization (DMO) [26, 27] was proposed. However, the DMO approach only shows its efficacy in a low-dimensional setting and starts to collapse as the dimension increases. In short, the existing KM methods [14, 27]
suffer from a similar drawback, namely, being unscalable. Unfortunately, the latter limitation hampers the application of them to large-scale datasets, for instance, the MovieLens 1 million (ML1M) dataset111https://grouplens.org/datasets/movielens/1m/. It is thus crucial to explore low-complexity and scalable methods for KM learning.
Duality often arises in linear/nonlinear optimization models in a wide variety of applications such as communication networks , economic markets , and structural design . Simultaneously, the dual problem possesses some good mathematical, geometric, or computational structures that can be exploited to provide an alternative way of handling the intricate primal problems by using iterative methods, such as the first-order gradient descent (GD) [31, 32] and quasi-Newton method [33, 34]. It is for this reason that the first-order iterative methods are widely used when optimizing/training large-scale data representations (e.g., deep neural networks) and machine learning algorithms. We are motivated by these iterative first-order methods to effectively resolve the combinatorial challenge of KM learning.
I-B Overview of Methodologies and Contributions
We present a computationally scalable approach to the KM learning problem, based on dual optimization, by proposing an enhanced GD algorithm. Our main contributions are listed below.
We provide a reformulation of the BQP subproblem of KM learning to a regularized dual optimization problem that ensures strong duality and is amenable to be solved by simple GD. Compared to the existing SDRwR  and DMO  methods, the proposed dual optimization approach proffers a more efficient and scalable solution to KM learning.
Motivated by the fact that EVD is required at each iteration of GD, which introduces a computational bottleneck when applied to large-scale datasets, an enhanced GD that eliminates the EVD computation when it is feasible is proposed to accelerate the computational speed. When the elimination is infeasible and EVD must be computed, we explore a proximal EVD based on the Lanczos method  by taking account of the fact that computing exact, entire EVD is usually impractical. We focus on analyzing the approximation error of the proximal EVD. A thresholding scheme is then proposed to determine the number of iterations of the proximal EVD by exploiting the upper bound of the approximation error and leveraging a normalized Minkowski -norm.
Extensive numerical simulation results are presented to demonstrate the efficacy of the proposed KM learning algorithm. When applied to large-scale datasets (e.g., ML1M dataset), it is shown that the proposed method can achieve comparable training and prediction performance with significantly reduced computational cost of roughly two orders of magnitude, compared to the existing KM learning algorithms. Finally, the interpretability of the proposed method is validated by exploiting the mathematically logical relations. We show that the accuracy of logical relation mining by using the proposed method exceeds .
Notation: A bold lowercase letter is a vector and a bold capital letter is a matrix. , , , , , , and denote the th entry,
th column, trace, main diagonal elements, rank, largest eigenvalue, and largest singular value of, respectively. is the th entry of , , and is a diagonal matrix with on its main diagonal. is the Frobenius inner product of two matrices and , i.e., . indicates that the matrix is positive semi-definite (PSD). is the
th column of the identity matrix of appropriate size.and denote the all-one and all-zero vectors, respectively. , , and denote the symmetric matrix space, nonnegative real-valued vector space, and binary vector space with each entry chosen from , respectively. For , where is the th eigenvalue of , . is the support set of and denotes the cardinality of a set . Finally, indicates that one outcome completely implies another one .
Ii System Model and Preliminaries
We consider a double-index set of binary random variables , , where ( and are the index sets of and , respectively) denotes the set of all index pairs. Thus, can represent any two-dimensional learning applications (involving matrices) such as movie recommendation systems , DNA methylation for cancer detection , and beam alignment in multiple-antenna systems [38, 39]. We let be the probability that the event occurs. Since the random variable considered here is binary, the following holds . Without loss of generality, we can focus on one outcome, for instance, . Then, the -dimensional KM of the random variable is given by
where is the probability mass function vector and is the binary indicator vector. Specifically, is in the unit probability simplex , i.e., , and denotes the support set of (associated with the case when ). The KM in (1) is built under a measurable probability space defined on ( denotes the sample space and is the event space consisting of subsets of ) and satisfies the following conditions: i) , (nonnegativity), ii) (normalization), and iii) for the disjoint events , (countable additivity) . By (1), is modeled as stochastic mixtures of Kolmogorov elementary events. In addition, note that .
Ii-B KM Learning
Assume that the empirical probability of , denoted by , is available from the training set . Obtaining the empirical probabilities for the training set depends on the application and context in practical systems; we will illustrate an example for recommendation systems at the end of this section. The KM learning involves training, prediction, and interpretation as described below.
The KM training proceeds to optimize and by solving the -norm minimization problem:
where , , , , and is the index of BCD iterations, and ii) BQP:
where , , , and . By exploiting the fact that the optimization in (3) was carried out over the unit probability simplex , a simple iterative FW algorithm  was employed to optimally solve (3), while the SDRwR was employed to asymptotically solve the BQP in (4) . It is also possible to solve (4) directly without a relaxation and/or randomization, based on the DMO approach . However, the DMO in  was shown only to be efficient when the dimension is small (e.g., ); its computational cost blows up as increases (e.g., ).
Similar to other supervised learning methods, the trained KM parametersare used to predict probabilities over a test set as
where and .
KM offers a distinct advantage, namely, the interpretability by drawing on fundamental insights into the mathematically logical relations among the data. For two random variables and taken from the training set , i.e., and , if the support sets of and satisfy , then two logical relations between the outcomes of and can be inferred: the first outcome of implies the same one for while the second outcome of implies the second one for , i.e., and [14, Proposition 1]. It is important to note that logical relations emerged from KM are based on the formalism of implications. Thus, they hold from a strictly mathematical perspective, and are general.
An implication of the introduced KM learning is illustrated by taking an example of movie recommendation systems as follows.
Suppose there are two users () who have rated two movie items (). In this example, denotes the event that user likes the movie item , . Then, denotes the probability that user likes item (conversely, denotes the probability that user dislikes item ). Suppose in (1). Then, the four elementary events can represent four different movie genres including i) Comedy, ii) Thriller, iii) Action, and iv) Drama. The empirical probability corresponding to can be obtained by
where denotes the rating score that user has provided for item and is the maximum rating score. In a 5-star rating system (), we consider the following matrix as an example:
where is unknown (as in the ‘*’ entry) and constitutes the training set of empirical probability where . By solving the KM learning problem in (2) for the empirical probabilities provided in (7), one can find the optimal model parameters, and (an optimal solution to (2)), which is given by
Then, we can predict () by using the learned KM parameters and as . In this example, the following inclusion holds . Thus, if a certain user (user 1 or 2) likes movie item 1, this logically implies that the user also likes movie item 2.
Remark 1 (Extended Discussion on Related Work)
In co- ntrast to the KM in (1), the state-of-the-art method, MF [7, 8], considers an inner product of two arbitrary vectors without having implicit or desired structures in place. While NNM  has a similar structure as (1), the distinction is that NNM relaxes the binary constraints on to a nonnegative box, i.e., , and thus sacrifices the highly interpretable nature of KM. Unlike the existing data representation techniques, the KM can exactly represent the outcome of random variables in a Kolmogorov sense. As illustrated in Section V, this in turn improves the prediction performance of the KM compared to other existing data representation techniques. Despite its predictability benefit, the existing KM learning methods [14, 27], however, suffer from high computational complexity and a lack of scalability. In particular, the LCQP subproblem, which can be efficiently solved by the FW algorithm, has been well-investigated, while resolving the BQP introduces a major computational bottleneck. It is thus of great importance to study more efficient and fast KM learning algorithms that are readily applicable to large-scale problems.
Iii Proposed Method
To scale the KM learning, we propose an efficient, first-order method to the BQP subproblem in (4).
Iii-a Dual Problem Formulation
where and . Note that the diagonal constraint in (9b) has been equivalently transformed to equality constraints in (10b). While the problem in (9) is combinatorial due to the rank constraint, the relaxed problem in (10) is a convex SDP. Moreover, the relaxation is done by dropping the rank constraint.
We further formulate a regularized SDP formulation of (10) as
where is a regularization parameter. With a Frobenius-norm term regularized, the strict convexity of (11) is ensured, which in turn makes strong duality hold for the feasible dual problem of (11). In this work, we leverage this fact that the duality gap is zero for (11) (a consequence of strong duality) to solve the dual problem. In addition, the two problems in (10) and (11) are equivalent as .
Given the regularized SDP formulation in (11), its dual problem and the gradient of the objective function are of interest.
where is the vector of Lagrange multipliers associated with each of the equality constraints of (11), , and , in which and , , respectively, are the eigenvalues and corresponding eigenvectors of
, respectively, are the eigenvalues and corresponding eigenvectors of. The gradient of with respect to is
Proof: See Appendix A.
It is well known that in (12) is a strongly concave (piecewise linear) function, thereby making the Lagrange dual problem (12) a strongly convex problem having a unique global optimal solution . Furthermore, the special structure of of Lemma 1, i.e., being symmetric, allows us to propose computationally efficient and scalable KM learning algorithms which can be applied to handle large-scale datasets with low latency.
Iii-B Fast GD Methods For The Dual Problem
The dual problem in (12), having a strongly concave function , is equivalent to the following unconstrained convex minimization problem
with the gradient being . We first introduce a GD, which is detailed in Algorithm 1, to solve (14). Note that, due to the fact that the dual problem in (14) is unconstrained, a simple GD method is proposed here: indeed, we would need a projected GD method if there is constraint included, for which the computational complexity would be much larger (because of the projection at each iteration).
In Algorithm 1, only the gradient of , i.e., , is required to determine the descent direction. It is therefore a more practical and cost-saving method compared to standard Newton methods which demand the calculation of second-order derivatives and the inverse of the Hessian matrix. Moreover, Algorithm 1 does not rely on any approximation of the inverse of the Hessian matrix such as the quasi-Newton methods . To find a step size in Step 4, we apply the backtracking line search method , which is based on the Armijo-Goldstein condition . The algorithm is terminated when the pre-designed stopping criterion (for instance, in Step 5, where is a predefined tolerance) is satisfied. Finally, the computational complexity of Algorithm 1 is dominated by the EVD of a matrix, needed to compute in Step 2, which is given as .
Iii-B2 Enhanced GD
In Algorithm 1, an EVD of is required at each iteration to determine and . However, it is difficult to employ EVD per iteration as they require high computational cost () when large-scale datasets are involved (with very large , , and ). It is critical to reduce the computational cost of Algorithm 1 by avoiding the full computation of EVD or even discarding them.
In relation to the original SDP problem in (10), we can understand the PSD constraint in (10c) is now penalized as the penalty term in , i.e., . Thus, one of the key insights we will use is that: i) if the PSD constraint is not satisfied, the penalty term equals to zero, simplifying the objective function as ; in this case, the gradient is simply , eliminating the computation of EVD, and ii) if the PSD constraint is satisfied, the penalty term becomes nonzero and it requires the computation of EVD to find out . This fact leads to the following proposition showcasing the rule of updating for the enhanced GD.
The enhanced GD includes two cases depending on the condition of the PSD constraint as
The key is to check if the PSD constraint in Proposition 1 is satisfied or not without the need of computing EVD. We propose a simple sufficient condition, based on the Weyl’s inequality , as demonstrated in the proposed Algorithm 2.
In Algorithm 2, we focus on modifying Step 2 in Algorithm 1 by using an initial with equal entries and exploiting the fact that if is not PSD (Case A in Proposition 1) to reduce the computational cost of EVD. Step in Algorithm 2 is due to the fact that the th eigenvalue of is . One of the key insights we leverage is that the choice of the sequence of gradient directions, i.e., , , does not alter the optimality of the dual problem in (14). We approach the design of with the goal of eliminating the computation of EVD to the most extent. Moreover, in Step of Algorithm 2, the condition (Case A in Proposition 1), holds because of the Weyl’s inequality . Thus the EVD of is required only when both the conditions “all the elements of are the same” and “ ” are violated, as in Phase II-B.
Notice that Step 2 in computing of Algorithm 1 has been transformed into two different phases (each phase includes two sub-phases) in Algorithm 2. Algorithm 2 executes the four sub-phases in order and irreversibly. To be specific, the algorithm first enters Phase I at the initial iteration and ends up with Phase II. Once the algorithm enters Phase II, there is no way to return back to Phase I. Algorithms 1 and 2 are based on GD, and thus the enhanced GD does not alter the convergency of Algorithm 1 .
Proposition 2 (Convergence of Enhanced GD)
This phenomenon is captured in Fig. 1 in which the objective function value as a function of the iteration number are depicted for Algorithms 1 and 2. In terms of flops, Algorithm 2 is more efficient than the original GD in Algorithm 1. This results in a dramatic reduction in the running time of Algorithm 2 since we mainly move on the direction obtained without the computation of EVD.
The solution to the dual problem in (14) (or equivalently (12)) produced by Algorithm 2, is not yet a feasible solution to the BQP in (4). A randomization procedure  can be employed to extract a feasible binary solution to (4) from the SDP solution of (11). One typical design of the randomization procedure for BQP is to generate feasible points from the Gaussian random samples via rounding . The Gaussian randomization procedure provides a tight approximation with probability , asymptotically in . By leveraging the fact that the eigenvalues and corresponding eigenvectors of can be found by Steps 13 and 14 of Algorithm 2, we have
where and . A detailed randomization procedure is provided in Algorithm 3.
In Step 8 of Algorithm 3, the -dimensional vector is first recovered from a -dimensional vector by considering the structure of in (9), and then used to approximate the BQP solution based on (8). Also note that the randomization performance improves with . In practice, we only need to choose a sufficient but not excessive (for instance, ) achieving a good approximation for the BQP solution. Moreover, its overall computational complexity is much smaller than the conventional randomization algorithms [46, 47, 14] because our proposed Algorithm 3 does not require the computation of the Cholesky factorization.
Iii-D Overall KM Learning Algorithm
Note that the index of BCD iterations that has been omitted is recovered here and denotes the total number of BCD iterations for KM learning. In Algorithm 4, the BCD method is adopted to refine and until it converges to a stationary point of (2). In fact, the proof of convergence (to stationary solution) for Algorithm 4 is exactly the same as that of Algorithm 1 in . In practice, we can use to control the termination of Algorithm 4.
Iv Proximal EVD and Error Analysis
In this section, several techniques are discussed to further accelerate Algorithm 2.
Iv-a Initial Step Size
A good initial step size is crucial for the convergence speed of the enhanced GD. In Phase I-A of Algorithm 2, we have
If , the following holds where . Therefore, in the first iteration of Phase I-A, we can set an appropriate step size so that has at least one positive eigenvalue, where . With the above modification of Algorithm 2, we can reduce the execution time spent in Phase I-A, and thus, the total number of iterations required by the enhanced GD can be reduced as shown in Fig. 2.
Iv-B Proximal EVD
Compared to the original GD in Algorithm 1, the enhanced GD in Algorithm 2 has reduced the costly EVD substantially. Nevertheless, the EVD is still necessary in Algorithm 2 when the algorithm enters into Phase II-B. In order to further accelerate the algorithm, we employ and modify the Lanczos method to numerically compute the proximal EVD of in Algorithm 2.
The Lanczos algorithm  is a special case of the Arnoldi method  when the matrix is symmetric. In principle, it is based on an orthogonal projection of onto the Krylov subspace where denotes the dimension of Krylov subspace. An algorithmic description of a modified Lanczos method is presented in Algorithm 5.
Different from the Arnoldi method, the matrix constructed by Algorithm 5 is tridiagonal and symmetric, i.e., the entries of in Algorithm 5 satisfy that , , and , . Also, Algorithm 5 iteratively builds an orthonormal basis, i.e., , for such that and , where . Let , , be the eigenpairs of . Then, the eigenvalues/eigenvectors of can be approximated by the Ritz pairs , i.e.,
With the increase of the dimension of Krylov subspace , the approximation performance improves at the price of additional computations. Thus, in practice, we adopt the value of balancing the tradeoff between the accuracy of approximation and the computational complexity.
Iv-C Analysis of Approximation Error and Thresholding Scheme
In this subsection, we analyze the approximation error of the proximal EVD and propose a thresholding scheme for selecting an appropriate in Algorithm 5. The main results are provided in the following lemmas.
i) The residual error is upper bounded by
ii) The maximum approximation error of eigenvalues of is bounded by
where is the associated true eigenvalue of .
iii) The minimum approximation error of eigenvalues of is bounded by
Proof: See Appendix B.
Lemma 2 indicates that the error bounds of the approximate eigenvalues of by using the proximal EVD in Algorithm 5 largely depends on . Indeed, the upper bounds in (17) and (18) are quite tight as will be seen in Section V. Inspired by Lemma 2, finding an upper bound of that only depends on the trace of is of interest.
in Algorithm 5 is upper bounded by