 # Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks

Bayesian Networks (BNs) represent conditional probability relations among a set of random variables (nodes) in the form of a directed acyclic graph (DAG), and have found diverse applications in knowledge discovery. We study the problem of learning the sparse DAG structure of a BN from continuous observational data. The central problem can be modeled as a mixed-integer program with an objective function composed of a convex quadratic loss function and a regularization penalty subject to linear constraints. The optimal solution to this mathematical program is known to have desirable statistical properties under certain conditions. However, the state-of-the-art optimization solvers are not able to obtain provably optimal solutions to the existing mathematical formulations for medium-size problems within reasonable computational times. To address this difficulty, we tackle the problem from both computational and statistical perspectives. On the one hand, we propose a concrete early stopping criterion to terminate the branch-and-bound process in order to obtain a near-optimal solution to the mixed-integer program, and establish the consistency of this approximate solution. On the other hand, we improve the existing formulations by replacing the linear "big-M" constraints that represent the relationship between the continuous and binary indicator variables with second-order conic constraints. Our numerical results demonstrate the effectiveness of the proposed approaches.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A Bayesian network (BN) is a probabilistic graphical model consisting of a labeled directed acyclic graph (DAG) , in which the vertex set corresponds to random variables, and the edge set

prescribes a decomposition of the joint probability distribution of the random variables based on their parents in

. The edge set encodes Markov relations on the nodes in the sense that each node is conditionally independent of its non-descendents given its parents. BNs have been used in knowledge discovery [spirtes2000causation, chen2018causal], classification [aliferis2010local][gao2015structured], latent variable discovery [lazic2013structural] and genetics [ott2003finding]. They also play a vital part in causal inference [pearl2009causal].

In this paper, we propose mixed-integer quadratic programming (MIQP) formulations for learning the optimal DAG structure of BNs given continuous observations from a system of linear structural equation models (SEMs). While there exist exact integer-programming (IP) formulations for learning DAG structure with discrete data [cussens2010maximum, cussens2012bayesian, hemmecke2012characteristic, studeny2013polyhedral, bartlett2013advances, JMLR:v17:14-479, oates2016exact, bartlett2017integer, cussens2017polyhedral, cussens2017bayesian], the development of tailored computational tools for learning the optimal DAG structure from continuous

data has received less attention. In principle, exact methods developed for discrete data can be applied to continuous data. However, such methods result in exponentially sized formulations in terms of the number of binary variables. A common practice to circumvent the exponential number of binary variables is to limit the in-degree of each node

[cussens2012bayesian, cussens2017bayesian, bartlett2017integer]. But, this may result in sub-optimal solutions. On the contrary, MIQP formulations for learning DAGs corresponding to linear SEMs require a polynomial number of binary variables. This is because for BNs with linear SEMs, the score function — i.e., the penalized negative log-likelihood (PNL) — can be explicitly written as a function of the coefficients of linear SEMs [shojaie2010penalized, van2013ell, park2017bayesian, manzour2019integer].

Continuous BNs with linear SEMs have witnessed a growing interest in the statistics and computer science communities [van2013ell, raskutti2013learning, loh2014high, ghoshal2016information, solus2017consistency]. In particular, it has been shown that the solution obtained from solving the PNL augmented by regularization achieves desirable statistical properties [peters2013identifiability, van2013ell, loh2014high]. Moreover, if the model is identifiable [peters2013identifiability, loh2014high], such a solution is guaranteed to uncover the true causal DAG when the sample size is large enough. However, given the difficulty of obtaining exact solutions, existing approaches for learning DAGs from linear SEMs have primarily relied on heuristics, using techniques such as coordinate descent [fu2013learning, aragam2015concave, han2016estimation] and non-convex continuous optimization [zheng2018dags]

. Unfortunately, these heuristics are not guaranteed to achieve the desirable properties of the global optimal solution. Moreover, it is difficult to evaluate the statistical properties of a sub-optimal solution with no optimality guarantees

[koivisto2012advances]. To bridge this gap, in this paper we develop mathematical formulations for learning optimal BNs from linear SEMs using a PNL objective with regularization. By connecting the optimality gap of the mixed-integer program to the statistical properties of the solution, we also establish an early stopping criterion under which we can terminate the branch-and-bound procedure and attain a solution which asymptotically recovers the true parameters with high probability.

Our work is related to recent efforts to develop exact tailored methods for DAG learning from continuous data. [xiang2013lasso] show that -lasso algorithm tailored for DAG structure learning from continuous data with -regularization is more effective than the previous approaches based on dynamic programming [e.g., silander2006simple] that are suitable for both discrete and continuous data. [park2017bayesian] develop a mathematical program for DAG structure learning with regularization. [manzour2019integer] improve and extend the formulation by [park2017bayesian] for DAG learning from continuous data with both and regularizations. The numerical experiments by [manzour2019integer] demonstrate that as the number of nodes grows, their MIQP formulation outperforms -lasso and the existing IP methods; this improvement is both in terms of reducing the IP optimality gap, when the algorithm is stopped due to a time limit, and in terms of computational time, when the instances can be solved to optimality. In light of these recent efforts, the current paper makes important contributions to this problem at the intersection of statistics and optimization.

• The statistical properties of optimal PNL with regularization have been studied extensively [loh2014high, van2013ell]

. However, it is often difficult to obtain an optimal solution and no results have been established on the statistical properties of approximate solutions. In this paper, we give an early stopping criterion for the branch-and-bound process under which the approximate solution gives consistent estimates of the true coefficients of the linear SEM. Our result leverages the statistical consistency of the PNL estimate with

regularization [van2013ell, peters2013identifiability] along with the properties of the branch-and-bound method wherein both lower and upper bound values on the objective function are available at each iteration. By connecting these two properties, we obtain a concrete early stopping criterion, as well as a simple proof of consistency of the approximate solution. To the best of our knowledge, this result is the first of its kind for DAG learning.

• In spite of recent progress, a key challenge in learning DAGs from linear SEMs is enforcing bounds on arc weights. This is commonly modeled using the standard “big- constraint” approach [park2017bayesian, manzour2019integer]. As shown by [manzour2019integer], this strategy leads to poor continuous relaxations for the problem, which in turn results in slow lower bound improvement in the branch-and-bound tree. In particular, [manzour2019integer] establish that all existing big- formulations achieve the same continuous relaxation objective function under a mild condition (see Proposition 4). To circumvent this issue, we present a mixed-integer second-order cone program (MISOCP), which gives a tighter continuous relaxation than existing big- formulations. This formulation can be solved by powerful state-of-the-art optimization packages. Our numerical results show the superior performance of MISOCP compared to the existing big- formulations in terms of improving the lower bound and reducing the optimality gap.

The rest of the paper is organized as follows. In Section 2, we define the DAG structure learning problem corresponding to linear SEMs, and give a general framework for the problem. In Section 3, we present our early stopping criterion and establish the asymptotic properties of the solution obtained under this stopping rule. We review existing mathematical formulations in Section 4, and present our proposed mathematical formulations in Section 5. Results of comprehensive numerical studies are presented in Section 6. We end the paper with a summary in Section 7.

## 2 Problem setup: Penalized DAG estimation with linear SEMs

Let be an undirected and possibly cyclic super-structure graph with node set and edge set ; let be the corresponding bi-directional graph with . We refer to undirected edges as edges and directed edges as arcs.

We assume that causal effects of continuous random variables in a DAG

are represented by linear regressions of the form

 Xk=∑j∈paG0kβjkXj+ϵk,k=1,…,m, (1)

where is the random variable associated with node , represents the parents of node in , i.e., the set of nodes with arcs pointing to ; the latent random variable denotes the unexplained variation in node ; and BN parameter specifies the effect of node on for . The above model is known as a linear SEM [pearl2009causal].

Let be the data matrix with rows representing i.i.d. samples from each random variable, and columns representing random variables . The linear SEM (1) can be compactly written in matrix form as , where is a matrix with for , for all , and is the ‘noise’ matrix. Then, denotes the directed graph on nodes such that arc appears in if and only if . Throughout the paper, we will use and

to denote the matrix of coefficients and its vectorized version.

A key challenge when estimating DAGs by minimizing the loss function (2) is that the true DAG is generally not identifiable from observational data. However, for certain SEM distributions, the true DAG is in fact identifiable from observational data. Two important examples are linear SEMs with possibly non-Gaussian homoscedastic noise variables [peters2013identifiability]

, as well as linear SEMs with unequal noise variances that are known up to a constant

[loh2014high]. In these special cases, the true DAG can be identified from observational data, without requiring the (strong) ‘faithfulness’ assumption, which is known to be restrictive in high dimensions [uhler2013geometry, sondhi2019reduced]. Given these important implications, in this paper we focus on learning Bayesian networks corresponding to the above identifiable linear SEMs.

The negative log likelihood for an identifiable linear SEM (1) with equal noise variances is proportional to

 l(β;X)=ntr{(I−B)(I−B)⊤ˆΣ}; (2)

here is the empirical covariance matrix, and

is the identity matrix

[shojaie2010penalized, van2013ell].

To learn sparse DAGs, van2013ell propose to augment the negative log likelihood with an regularization term. Given a super-structure , the optimization problem corresponding to this penalized negative log-likelihood (PNL) is given by

 PNLM minB∈Rm×mL(β):=l(β;X)+λn∥β∥0 (3a) s.t. G(B)induces a DAG from−−→M, (3b)

where the tuning parameter controls the degree of the regularization, and the constraint (3b) stipulates that the resulting directed subgraph is a DAG induced from . When corresponds to a complete graph, PNL reduces to the original PNL of van2013ell.

The choice of regularization in (3) is deliberate. Although regularization has attractive computational and statistical properties in high-dimensional regression [bulmann2011statistics], many of these advantages disappear in the context of DAG structure learning [fu2013learning, aragam2015concave]. By considering regularization, [van2013ell] establish the consistency of PNL under appropriate assumptions. More specifically, for a Gaussian SEM, they show that the estimated DAG has (asymptotically) the same number of edges as the DAG with minimal number of edges (minimal-edge I-MAP), and establish the consistency of PNL for learning sparse DAGs. These results are formally stated in Proposition 1 in the next section.

###### Remark 1.

A Tikhonov () regularization term, , for a given , can also be added to the objective (3a) to obtain more stable solutions [bertsimas2016best].

In our earlier work [manzour2019integer], we observe that existing mathematical formulations are slow to converge to a provably optimal solution, , of (3) using the state-of-the-art optimization solvers. Therefore, the solution process needs to be terminated early to yield a feasible solution, with a positive optimality gap, i.e., a positive difference between the upper bound on provided by and a lower bound on provided by the best continuous relaxation obtained by the branch-and-bound algorithm upon termination. However, statistical properties of such a sub-optimal solution are not well-understood. Therefore, there exists a gap between theory and computation: while the optimal solution has nice statistical properties, the properties of the solutions obtained from approximate computational algorithms are not known. Moreover, due to the non-convex and complex nature of the problem, characterizing the properties of the solutions provided by heuristics is especially challenging. In the next section, we bridge this gap by developing a concrete early stopping criterion and establishing the consistency of the solution obtained using this criterion.

## 3 Early stopping criterion for DAG learning

In this section, we establish a sufficient condition for the approximate solution of PNL, to be consistent for the true coefficients, ; that is , where is the number of arcs in the true DAG, and means that converges to asymptotically. This result is obtained by leveraging an important property of the branch-and-bound process for integer programming that provides both lower and upper bounds on the objective function upon early stopping, as well as the consistency results of the PNL estimate with regularization. Using the insight from this new result, we then propose a concrete stopping criterion for terminating the branch-and-bound process that results in consistent parameter estimates.

Let and respectively denote the lower and upper bounds on the optimal objective function value (3a) obtained from solving (3) under an early stopping criterion (i.e., when the obtained solution is not necessarily optimal). We define the difference between the upper and lower bounds as the absolute optimality gap: . Let and denote the structure of the DAG and coefficients of the arcs from optimization model (3) under the early stopping condition with sample size and regularization parameter . Let and denote the DAG structure and coefficients of arcs obtained from the optimal solution of (3), and and denote the true DAG structure and the coefficient of arcs, respectively. We denote the number of arcs in , , and by , , and , respectively. The score value in (3a) of each solution is denoted by where .

Next, we present our main result. Our result extends van2013ell’s result on consistency of PNL for the optimal, but computationally unattainable, estimator, to an approximate estimator, , obtained from early stopping. In the following (including the statement of our main result, Proposition 2), we assume that the super-structure is known a priori. The setting where is estimated from data is discussed at the end of the section. We begin by stating the key result from [van2013ell] and the required assumptions. Throughout, we consider a Gaussian linear SEM of the form (1). We denote the variance of error terms, , by and the true covariance matrix of the set of random variables, by the matrix .

###### Assumption 1.

For some constant , it holds that

. Moreover, the smallest eigenvalue of

, , is nonzero.

###### Assumption 2.

Let, as in [van2013ell], be the precision matrix of the vector of noise variables for an SEM given permutation of nodes. Denoting the diagonal entires of this matrix by , there exists a constant such that if is not a multiple of the identity matrix, then

 m−1m∑j=1((~ωjj)2−1)2>1/ω0.
###### Proposition 1.

(Theorem 5.1 in [van2013ell]) Suppose Assumptions 1 and 2 hold. Let . Then for an regularization parameter , it holds with probability at least that

 ∥β⋆−β0∥22+λs⋆=O(λs0).

Here, , because the loss function (2) is that of [van2013ell] scaled by the sample size . Before presenting our main result, we state one more condition on the covariance matrix of the random variables generated by linear SEM. For a given subset , let denote its complement, i.e.,

Definition [raskutti2010restricted]. Define the set for a given subset and constant . The sample covariance matrix satisfies the restricted eigenvalue (RE) condition over with parameters if

 1nv⊤X⊤Xv=1n∥Xv∥22≥γ2∥v∥22,∀v∈C(S;η).

If this condition holds for all subsets with cardinality , we say that satisfies a restricted eigenvalue (RE) condition of order with parameters . The population covariance matrix is said to satisfy the RE condition if

 ∥Σ1/2v∥2≥γ∥v∥2∀v∈C(S;η).

raskutti2010restricted show that if satisfies the RE condition, then there exists constants and such that with probability at least , also satisfies the RE condition with parameters . More specifically, their proof of Corollary 1 shows that for any ,

 ∥v∥22≤c1∥Xv∥22, (4)

where for defined in Assumption 1. In fact, in the low-dimensional setting implied by condition (5), the inequality (4) holds with probability one because, when , for any we have Thus, (4) holds with .

###### Proposition 2.

Suppose satisfies the RE condition of order with parameters and that for constants ,

 n>max{c2σ20(1+η)2γ2s0log(m),c3mlog(n)}. (5)

Suppose also that Assumptions 1 and 2 hold. Let and . Then, the estimator obtained from early stopping of the branch-and-bound process such that GAP satisfies with probability for constants and used for the RE condition.

###### Proof.

First, by the triangle inequality,

 ∥^β−β0∥22+c1λ^s≤∥^β−β⋆∥22+∥β⋆−β0∥22+c1λ^s. (6)

Further, by the sparsity of from Proposition 1, belongs to the set , where and . Thus,

 ∥^β−β⋆∥22≤c1∥X(^β−β⋆)∥22. (7)

Now, noting that (see, e.g., the expanded version in Eq. (10a)), we can write a Taylor series expansion of around to get

 ∥X(^β−β⋆)∥22=ℓ(^β;X)−ℓ(β⋆;X)−2(^β−β⋆)⊤X⊤X(β⋆−β0)+2(^β−β⋆)⊤X⊤E.

Here, we also use the fact that . Thus, using triangle inequality again and the fact that , we get

 ∥^β −β⋆∥22≤ c1[ℓ(^β;X)−ℓ(β⋆;X)]+2c1κmax(X⊤X)∥^β−β⋆∥2∥β⋆−β0∥2+2c1∥^β−β⋆∥2∥X⊤E∥2,

where denotes the maximum eigenvalue of the matrix. Let , and Then, the above inequality can be written as , which implies that . Let be the event under which . Then, on the set ,

 (8)

Plugging in (8) into (6), on the set we get

 ∥^β−β0∥22+c1λ^s ≤c1[ℓ(^β;X)−ℓ(β⋆;X)+λ^s]+∥β⋆−β0∥22+o(1) =c1[ℓ(^β;X)−ℓ(β⋆;X)+λ^s−λs⋆]L(^β)−L(β⋆)+∥β⋆−β0∥22+c1λs⋆+o(1) ≤c1GAP+∥β⋆−β0∥22+c1λs⋆+o(1), (9)

where the last inequality follows from the fact that, by definition, .

Now, by Proposition 1, we know that with probability at least , , and . Moreover, by the RE condition, with probability at least , . Finally, using concentration inequalities for the Gaussian SEM noise [e.g. bulmann2011statistics], the probability of the set is lower bounded by the probability that , which is . Thus, stopping the branch-and-bound algorithm when guarantees that, with probability at least , . ∎

Proposition 2 suggests that the algorithm can be stopped by setting a threshold on the value of for a constant , say . Such a solution will then achieve the same desirable statistical properties as the optimal solution . However, while can be chosen data-adaptively (as discussed in Section 6), the value for is not known. However, one can find an upper bound for based on the number of edges in the super-structure . In particular, if is the moral graph [pearl2009causal] with edges, then . Thus, in this case, a consistent parameter estimate can be obtained if the branch-and-bound process is stopped when .

The above results, including the specific choice of early stopping criterion, are also valid if the super-structure corresponding to the moral graph is not known a priori. That is because the moral graph can be consistently estimated from data using recent developments in graphical modeling; see drton2017structure for a review of the literature. While some of the existing algorithms based on -penalty require an additional irrepresentability condition [meinshausen2006high, saegusa2016joint], this assumption can be relaxed by using instead an adaptive lasso penalty or by thresholding the initial lasso estimates [bulmann2011statistics].

In light of Proposition 2, it is of great interest to develop algorithms that converge to a solution with a small optimality gap expeditiously. To achieve this, one approach is to obtain better lower bounds using the branch-and-bound process from strong mathematical formulations for (3). To this end, we next review existing formulations of (3).

## 4 Existing Formulations of DAG Learning with Linear SEMs

In this section, we review known mathematical formulations for DAG learning with linear SEMs. We first outline the necessary notation below.

Index Sets
: index set of random variables;
: index set of samples.
Input
: an undirected super-structure graph (e.g., the moral graph);
: the bi-directional graph corresponding to the undirected graph ;
, where and denotes th sample () of random variable ; note ;
tuning parameter (penalty coefficient for regularization).

Continuous optimization variables
: weight of arc representing the regression coefficients .

Binary optimization variables
,
.

Let . The PNL can be cast as the following optimization problem:

 min F(β,g), (10a) G(B)induces a DAG from−→M, (10b) βjk(1−gjk)=0, ∀(j,k)∈−→E, (10c) gjk∈{0,1}, ∀(j,k)∈→E. (10d)

The objective function (10a) is an expanded version of in PNL, where we use the indicator variable to encode the regularization. The constraints in (10b) rule out cycles. The constraints in (10c) are non-linear and stipulate that only if .

There are two sources of difficulty in solving (10a)-(10d): (i) the acyclic nature of DAG imposed by the combinatorial constraints in (10b); (ii) the set of nonlinear constraints in (10c), which stipulates that only if there exists an arc in . In Section 4.1, we discuss related studies to address the former, whereas in Section 4.2 we present relevant literature for the latter.

### 4.1 Linear encodings of the acyclicity constraints (10b)

There are several ways to ensure that the estimated graph does not contain any cycles. The first approach is to add a constraint for each cycle in the graph, so that at least one arc in this cycle must not exist in . A cutting plane (CP) method is used to solve such a formulation which may require generating an exponential number of constraints. Another way to rule out cycles is by imposing constraints such that the nodes follow a topological order [park2017bayesian]. A topological ordering is a unique ordering of the nodes of a graph from 1 to such that the graph contains an arc if node appears before node in the order. We refer to this formulation as topological ordering (TO). The layered network (LN) formulation proposed by [manzour2019integer] improves the TO formulation by reducing the number of binary variables. [manzour2019integer] discuss these formulations in detail.

Let be the set of all possible directed cycles and be the set of arcs defining a cycle. The CP formulation removes cycles by imposing the following constraints for (10b)

 CP∑(j,k)∈CAgjk≤|CA|−1,∀CA∈C. (11)

Define decision variables for all and for all . The variable takes value 1 if there is an arc in the network, and takes value 1 if the topological order of node equals . The TO formulation rules out cycles in the graph by the following constraints

 TO gjk≤zjk, ∀(j,k)∈→E, (12a) zjk−mzkj≤∑s∈Vs(oks−ojs), ∀(j,k)∈→E, (12b) ∑s∈Vors=1 ∀r∈V, (12c) ∑r∈Vors=1 ∀s∈V. (12d)

The third way to remove cycles is by imposing the condition that the resulting graph is a layered network. This can be achived by the following set of constraints in the LN formulation:

 LN gjk≤zjk ∀(j,k)∈→E, (13a) zjk−(m−1)zkj≤ψk−ψj ∀(j,k)∈→E. (13b)

Let be the layer value for node . The set of constraints in (13b) ensures that if the layer of node appears before that of node (i.e., there is a direct path from node to node ), then . This rules out any cycles.

The set of constraints in (13b) imposes that if and , then . Thus, additional binary vector along with the set of constraints in (13a) is needed to correctly encode the regularization. Similar reasoning applies for the TO formulation; see [manzour2019integer].

### 4.2 Linear encodings of the non-convex constraints (10c)

The nonconvexity of the set of constraints in (10c) causes challenges in obtaining provably optimal solutions with existing optimization software. Therefore, we consider convex representations of this set of constraints. First, we consider a linear representation of the constraints in (10c). Although the existing formulations discussed in Section 4.1 differ in their approach to ruling out cycles, one major commonality among them is that they replace the non-linear constraint (10c) by so called big- constraints given by

 −Mgjk≤βjk≤Mgjk,∀(j,k)∈→E, (14)

for a large enough . Unfortunately, these big- constraints (14) are poor approximations of (10c), especially in this problem, because no natural and tight value for exist. Although a few techniques have been proposed for obtaining the big- parameter for sparse regression problem [bertsimas2017sparse, gomez2018mixed], the resulting parameters are often too large in practice. Further, finding a tight big- parameter itself is a difficult problem to solve for DAG structure learning.

Consider (10a)-(10d) by substituting (10c) by the linear big- constraints (14) and writing the objective function in a matrix form. We denote the resulting formulation, which has a convex quadratic objective and linear constraints, by the following MIQP.

 MIQPmin tr[(I−B)(I−B)⊤X⊤X]+λn∑(j,k)∈→Egjk (15a) (15b) gjk∈{0,1}∀(j,k)∈→E. (15c)

Depending on which types of constraints are used in lieu of (10b), as explained in Section 4.1, MIQP (16) results in three different formulations: MIQP+CP, which uses (11), MIQP+TO, which uses (12), and MIQP+LN, which uses (13), respectively.

To discuss the challenges of the big- approach, we give a definition followed by two propositions.

###### Definition 1.

A formulation is said to be stronger than formulation if where and correspond to the feasible regions of continuous relaxations of and , respectively.

###### Proposition 3.

(Proposition 3 in [manzour2019integer]) The MIQP+TO and MIQP+CP formulations are stronger than the MIQP+LN formulation.

###### Proposition 4.

(Proposition 5 in [manzour2019integer]) Let denote the optimal coefficient associated with an arc from problem (3). For the same variable branching in the branch-and-bound process, the continuous relaxations of the MIQP+LN formulation for regularizations attain the same optimal objective function value as MIQP+TO and MIQP+CP, if .

Proposition 3 implies that the MIQP+TO and MIQP+CP formulations are stronger than the MIQP+LN formulation. Nonetheless, Proposition 4 establishes that for sufficiently large values of

, stronger formulations attain the same continuous relaxation objective function value as the weaker formulation throughout the branch-and-bound tree. The optimal solution to the continuous relaxation of MIQP formulations of DAG structure learning may not be at an extreme point of the convex hull of feasible points. Hence, stronger formulations do not necessarily ensure better lower bounds. This is in contrast to a mixed-integer program (MIP) with linear objective, whose continuous relaxation is a linear program (LP). In that case, there exists an optimal solution that is an extreme point of the corresponding feasible set. As a result, a better lower bound can be obtained from a stronger formulation that better approximates the convex hull of a mixed-integer linear program; this generally leads to faster convergence. A prime example is the traveling salesman problem (TSP), for which stronger formulations attain better computational performance

[oncan2009comparative]. In contrast, the numerical results by [manzour2019integer] show that MIQP+LN has better computational performance because it is a compact formulation with the fewest constraints and the same continuous relaxation bounds.

Our next result, which is adapted from [dong2015regularization] to the DAG structure learning problem, shows that the continuous relaxation of MIQP is equivalent to the optimal solution to the problem where the -regularization term is replaced with an -regularization term (i.e., ) with a particular choice of the penalty. This motivates us to consider tighter continuous relaxation for MIQP. Let be an optimal solution to the continuous relaxation of MIQP.

###### Proposition 5.

For , a continuous relaxation of MIQP (16), where the binary variables are relaxed, is equivalent to the problem where the regularization term is replaced with an -regularization term with penalty parameter .

###### Proof.

For , the value is in an optimal solution to the continuous relaxation of MIQP (16). Otherwise, we can reduce the value of the decision variable without violating any constraints while reducing the objective function. Note that since , we have . To show that the set of constraints in (10b) is satisfied, we consider the set of CP constraints. In this case, the set of constraints (10b) holds, i.e., , because . This implies that is the optimal solution. Thus, the objective function reduces to regularization with the coefficient .

Finally, Proposition 4 establishes that for , the objective function value of the continuous relaxations of MIQP+CP, MIQP+LN and MIQP+TO are equivalent. This implies that the continuous relaxations of all formulations are equivalent, which completes the proof. ∎

Despite the promising performance of MIQP+LN, its continuous relaxation objective function value provides a weak lower bound due to the big- constraints. To circumvent this issue, a natural strategy is to improve the big- value. Nonetheless, existing methods which ensure a valid big- value or heuristic techniques [park2017bayesian, gomez2018mixed] do not lead to tight big- values. For instance, the heuristic technique by [park2017bayesian] to obtain big- values always satisfies the condition in Proposition 3 and exact techniques are expected to produce even larger big- values. Therefore, we next directly develop tighter approximations for (10c).

## 5 New Perspectives for Mathematical Formulations of DAG Learning

In this section, we discuss improved mathematical formulations for learning DAG structure of a BN based on convex (instead of linear) encodings of the constraints in (10c).

Problem (10) is an MIQP with non-convex complementarity constraints (10c), a class of problems which has received a fair amount of attention from the operations research community over the last decade [frangioni2006perspective, frangioni2007sdp, frangioni2009computational, frangioni2011projected, gomez2018mixed]. There has also been recent interest in leveraging these developments to solve sparse regression problems with regularization [pilanci2015sparse, dong2015regularization, xie2018ccp, atamturk2019rank, wei2019convexification].

Next, we review applications of MIQPs with complementarity constraints of the form (10c) for solving sparse regression with regularization. [frangioni2011projected] develop a so-called projected perspective relaxation method, to solve the perspective relaxation of mixed-integer nonlinear programming problems with a convex objective function and complementarity constraints. This reformulation requires that the corresponding binary variables are not involved in other constraints. Therefore, it is suitable for sparse regression, but cannot be applied for DAG structure learning. [pilanci2015sparse] show how a broad class of -regularized problems, including sparse regression as a special case, can be formulated exactly as optimization problems. The authors use the Tikhonov regularization term and convex analysis to construct an improved convex relaxation using the reverse Huber penalty. In a similar vein, [bertsimas2017sparse] exploit the Tikhonov regularization and develop an efficient algorithm by reformulating the sparse regression mathematical formulation as a saddle-point optimization problem with an outer linear integer optimization problem and an inner dual quadratic optimization problem which is capable of solving high-dimensional sparse regressions. [xie2018ccp] apply the perspective formulation of sparse regression optimization problem with both and the Tikhonov regularization. The authors establish that the continuous relaxation of the perspective formulation is equivalent to the continuous relaxation of the formulation given by [bertsimas2017sparse]. [dong2015regularization] propose perspective relaxation for sparse regression optimization formulation and establish that the popular sparsity-inducing concave penalty function known as the minimax concave penalty [zhang2010nearly] and the reverse Huber penalty [pilanci2015sparse] can be obtained as special cases of the perspective relaxation – thus the relaxations of formulations by [zhang2010nearly, pilanci2015sparse, bertsimas2017sparse, xie2018ccp] are equivalent. The authors obtain an optimal perspective relaxation that is no weaker than any perspective relaxation. Among the related approaches, the optimal perspective relaxation by [dong2015regularization] is the only one that does not explicitly require the use of Tikhonov regularization.

The perspective formulation, which in essence is a fractional non-linear program, can be cast either as a mixed-integer second-order cone program (MISOCP) or a semi-infinite mixed-integer linear program (SIMILP). Both formulations can be solved directly by state-of-the-art optimization packages. [dong2015regularization] and [atamturk2019rank] solve the continuous relaxations and then use a heuristic approach (e.g., rounding techniques) to obtain a feasible solution (an upper bound). In this paper, we directly solve the MISOCP and SIMILP formulations for learning sparse DAG structures.

Next, we present how perspective formulation can be suitably applied for DAG structure learning with regularization. We further cast the problem as MISOCP and SIMILP. To this end, we express the objective function (15a) in the following way:

 tr[(I−B)(I−B)⊤X⊤X]+λn∑(j,k)∈→Egjk (16a) =tr[(I−B−B⊤)X⊤X+2BB⊤X⊤X+λn∑(j,k)∈→Egjk. (16b)

Let be a vector such that , where and means that matrix is positive semi-definite. By splitting the quadratic term in (16b), the objective function can be expressed as

 tr[(I−B−B⊤)X⊤X+BB⊤(X⊤X−Dδ)]+tr(BB⊤Dδ)+λn∑(j,k)∈→Egjk. (17)

Let . (In the presence of Tikhonov regularization with tuning parameter , we let as described in Remark 1.) Then, Cholesky decomposition can be applied to decompose as (note ). As a result, . The separable component can also be expressed as . Using this notation, the objective (17) can be written as

 tr[(I−B−B⊤)X⊤X+BB⊤Q]+m∑j=1∑(j,k)∈→Eδjβ2jk+λn∑(j,k)∈→Egjk.

The Perspective Reformulation (PRef) of MIQP is then given by

 PRefmin tr[(I−B−B⊤)X⊤X+BB⊤Q]+ (18a) m∑j=1∑(j,k)∈→Eδjβ2jkgjk+λn∑(j,k)∈→Egjk, (18b)

The objective function (18a) is formally undefined when some = 0. More precisely, we use the convention that when and when and [frangioni2009computational]. The continuous relaxation of PRef, referred to as the perspective relaxation, is much stronger than the continuous relaxation of MIQP [pilanci2015sparse]. However, an issue with PRef is that the objective function is nonlinear due to the fractional term. There are two ways to reformulate PRef. One as a mixed-integer second-order conic program (MISOCP) (see, Section 5.1) and the other as a semi-infinite mixed-integer linear program (SIMILP) (see, Section 5.2).

### 5.1 Mixed-integer second-order conic program

Let be additional variables representing . Then, the MISOCP formulation is given by

 MISOCPmin tr[(I−B−B⊤)X⊤X+BB⊤Q]+ (19a) m∑j=1∑(j,k)∈→Eδjsjk+λn∑(j,k)∈→Egjk, sjkgjk≥β2jk(j,k)∈→E, (19b) 0≤sjk≤M2gjk(j,k)∈→E, (19c) (19d)

Here, the constraints in (19b) imply that only when . The constraints in (19b) are second-order conic representable because they can be written in the form of . The set of constraints in (19c) is valid since implies and for . The set of constraints in (19c) is not required, yet they improve the computational efficiency especially when we restrict the big- value. [xie2018ccp] report similar behavior for sparse regression. When we relax and let , we obtain the continuous relaxation of MISOCP (19). Let us denote the feasible region of continuous relaxation of MISOCP (19) and MIQP (16) by MISOCP and MIQP, and the objective function values by OFV(MISOCP) and OFV(MIQP), respectively. For a more general problem than ours, [cui2013convex] give a detailed proof establishing that the feasible region of the former is contained in the feasible region of latter i.e., MISOCP . This implies that OFV(MISOCP) OFV(MIQP). Therefore, we are able to obtain stronger lower bounds using MISOCP than MIQP.

### 5.2 Mixed-integer semi-infinite integer linear program

An alternative approach to reformulate PRef is via perspective cuts developed by [frangioni2006perspective, frangioni2007sdp]. To apply perspective cuts, we use the reformulation idea first proposed in [frangioni2006perspective] by introducing dummy decision matrix to distinguish the separable and non-separable part of the objective function; we also add the additional constraint where is element of matrix and is the decision variable in the optimization problem. Following this approach, MIQP can be reformulated as an SIMILP:

 SIMILPmin tr[(I−B−B⊤)X⊤X+DD⊤Q]+ (20a) m∑j=1∑(j,k)∈→Eδjvjk+λn∑(j,k)∈→Egjk, djk=βjk(j,k)∈→E, (20b) vjk≥2¯βjkβjk−¯β2jkgjk∀¯βjk∈[−M,M]∀(j,k)∈→E, (20c) (20d) vjk≥0,(j,k)∈→E. (20e)

The set of constraints in (20c) are known as perspective cuts. Note that there are infinitely many such constraints. Although this problem cannot be solved directly, it lends itself to a delayed cut generation approach whereby a (small) finite subset of constraints in (20c) is kept, the current solution of the relaxation is obtained, and all the violated inequalities for the relaxation solution are added for (assuming ). This process is repeated until termination criteria are met. This procedure can be implemented using the cut callback function available by off-the-shelf solvers such as Gurobi or CPLEX.

### 5.3 Selecting δ

In the MISOCP and SIMILP formulations, one important question is how to identify a valid . A natural choice is diag where is the minimum eigenvalue of , is a sufficiently small number to avoid numerical instability of estimating eigenvalues, and is a column vector of ones. The issue with this approach is that if , then becomes a trivial 0 matrix. If

turns out to be a zero matrix, then MISOCP formulation reduces to the big-

formulation. [frangioni2007sdp] present an effective approach for obtaining a valid by solving the following semidefinite program (SDP)

 (21a)

This formulation can attain a non-zero even if . Numerical results by [frangioni2007sdp] show that this method compares favorably with the minimum eigenvalue approach. [zheng2014improving] propose an SDP approach, which obtains such that the continuous relaxation of MISOCP (19) is as tight as possible.

Similar to [dong2015regularization], our formulation does not require adding a Tikhonov regularization. In this case, PRef is effective when is sufficiently diagonally dominant. When and each row of is independent, then is guaranteed to be a positive semi-definite matrix [dong2015regularization]. On the other hand, when , is not full-rank. Therefore, a Tikhonov regularization term should be added with sufficiently large to make [dong2015regularization] in order to benefit from the strengthening provided by PRef.

## 6 Experiments

In this section, we report the results of our numerical experiments that compare different formulations and evaluate the effect of different tuning parameters and estimation strategies. Our experiments are performed on a cluster operating on UNIX with Intel Xeon E5-2640v4 2.4GHz. All formulations are implemented in the Python programming language. Gurobi 8.1 is used as the solver. Unless otherwise stated, a time limit of (in seconds), where denotes the number of nodes, and an MIQP relative optimality gap of are imposed across all experiments after which runs are aborted. The relative optimality gap is calculated by RGAP where UB(X) denotes the objective value associated with the best feasible integer solution (incumbent) and LB(X) represents the best obtained lower bound during the branch-and-bound process for the formulation .

Unless otherwise stated, we assume which corresponds to the Bayesian information criterion (BIC) score. To select the big- parameter, , in all formulations we use the proposal of park2017bayesian. Specifically, given