On the expressive power of query languages for matrices

09/25/2017
by   Robert Brijder, et al.
0

We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv of inverting a matrix. In MATLANG+inv we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix, which is defined so that different eigenvectors returned for a same eigenvalue are orthogonal. We show that inv can be expressed in MATLANG+eigen. We put forward the open question whether there are boolean queries about matrices, or generic queries about graphs, expressible in MATLANG + eigen but not in MATLANG+inv. The evaluation problem for MATLANG + eigen is shown to be complete for the complexity class ∃R.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/11/2018

On the expressive power of linear algebra on graphs

Most graph query languages are rooted in logic. By contrast, in this pap...
10/26/2020

Expressive power of linear algebra query languages

Linear algebra algorithms often require some sort of iteration or recurs...
09/25/2019

On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra

We study the expressive power of the LARA language – a recently proposed...
03/16/2020

When Can Matrix Query Languages Discern Matrices?

We investigate when two graphs, represented by their adjacency matrices,...
03/04/2018

Comparing Downward Fragments of the Relational Calculus with Transitive Closure on Trees

Motivated by the continuing interest in the tree data model, we study th...
12/31/2018

Complexity of Linear Operators

Let A ∈{0,1}^n × n be a matrix with z zeroes and u ones and x be an n-di...
06/29/2021

Distributed Matrix Tiling Using A Hypergraph Labeling Formulation

Partitioning large matrices is an important problem in distributed linea...

1 Introduction

Data scientists often use matrices to represent their data, as opposed to using the relational data model. These matrices are then manipulated in programming languages such as R or MATLAB. These languages have common operations on matrices built-in, notably matrix multiplication; matrix transposition; elementwise operations on the entries of matrices; solving nonsingular systems of linear equations (matrix inversion); and diagonalization (eigenvalues and eigenvectors). Providing database support for matrices and multidimensional arrays has been a long-standing research topic [34]

, originally geared towards applications in scientific data management, and more recently motivated by machine learning over big data

[5, 39, 9, 31].

Database theory and finite model theory provide a rich picture of the expressive power of query languages [1, 24]. In this paper we would like to bring matrix languages into this picture. There is a lot of current interest in languages that combine matrix operations with relational query languages or logics, both in database systems [20] and in finite model theory [11, 12, 19]. In the present study, however, we focus on matrices alone. Indeed, given their popularity, we believe the expressive power of matrix sublanguages also deserves to be understood in its own right.

The contents of this paper can be introduced as follows. We begin the paper by defining the language

as an analog for matrices of the relational algebra for relations. This language is based on five elementary operations, namely, the one-vector; turning a vector in a diagonal matrix; matrix multiplication; matrix transposition; and pointwise function application. We give examples showing that this basic language is capable of expressing common matrix manipulations. For example, the Google matrix of any directed graph

can be computed in , starting from the adjacency matrix of .

Well-typedness and well-definedness notions of expressions are captured via a simple data model for matrices. In analogy to the relational model, a schema consists of a number of matrix names, and an instance assigns matrices to the names. Recall that in a relational schema, a relation name is typed by a set of attribute symbols. In our case, a matrix name is typed by a pair , where and are size symbols that indicate, in a generic manner, the number of rows and columns of the matrix.

In Section 3 we show that our language can be simulated in the relational algebra with aggregates [23, 28], using a standard representation of matrices as relations. The only aggregate function that is needed is summation. In fact, is already subsumed by aggregate logic with only three nonnumerical variables. Conversely, can express all queries from graph databases (binary relational structures) to binary relations that can be expressed in first-order logic with three variables. In contrast, the four-variable query asking if the graph contains a four-clique, is not expressible.

In Section 4 we extend with an operation for inverting a matrix, and we show that the extended language is strictly more expressive. Indeed, the transitive closure of binary relations becomes expressible. The possibility of reducing transitive closure to matrix inversion has been pointed out by several researchers [26, 10, 36]. We show that the restricted setting of suffices for this reduction to work. That transitive closure is not expressible without inversion, follows from the locality of relational algebra with aggregates [28].

Another prominent operation of linear algebra, with many applications in data mining and graph analysis [17, 27], is to return eigenvectors and eigenvalues. There are various ways to define this operator formally. In Section 5 we define the operation to return a basis of eigenvectors, in which eigenvectors for a same eigenvalue are orthogonal. We show that the resulting language can express inversion. The argument is well known from linear algebra, but our result shows that it can be carried out in , once more attesting that we have defined an adequate matrix language. It is natural to conjecture that is actually strictly more powerful than in expressing, say, boolean queries about matrices. Proving this is an interesting open problem.

Finally, in Section 6 we look into the evaluation problem for expressions. In practice, matrix computations are performed using techniques from numerical mathematics [15]. It remains of foundational interest, however, to know whether the evaluation of expressions is effectively computable. We need to define this problem with some care, since we work with arbitrary complex numbers. Even if the inputs are, say, 0-1 matrices, the outputs of the operation can be complex numbers. Moreover, until now we have allowed arbitrary pointwise functions, which we should restrict somehow if we want to discuss computability. Our approach is to restrict pointwise functions to be semi-algebraic, i.e., definable over the real numbers. We will observe that the input-output relation of an expression , applied to input matrices of given dimensions, is definable in the existential theory of the real numbers, by a formula of size polynomial in the size of and the given dimensions. This places natural decision versions of the evaluation problem for in the complexity class (combined complexity). We show moreover that there exists a fixed expression (data complexity) for which the evaluation problem is -complete, even restricted to input matrices with integer entries. It also follows that equivalence of expressions, over inputs of given dimensions, is decidable.

2

We assume a sufficient supply of matrix variables, which serve to indicate the inputs to expressions in . Variables can also be introduced in let-constructs inside expressions. The syntax of expressions is defined by the grammar:

(matrix variable)
(local binding)
(conjugate transpose)
(one-vector)
(diagonalization of a vector)
(matrix multiplication)
(pointwise application, )

In the last rule, is the name of a function , where denotes the complex numbers. Formally, the syntax of is parameterized by a repertoire of such functions, but for simplicity we will not reflect this in the notation.

Example 1.

Let be a constant; we also use as a name for the constant function . Then

is an example of an expression. At this point, this is a purely syntactical example; we will see its semantics shortly. The expression is actually equivalent to . The let-construct is useful to give names to intermediate results, but is not essential for now. It will become essential later, when we enrich with the operation. ∎

In defining the semantics of the language, we begin by defining the basic matrix operations. Following practical matrix sublanguages such as R or MATLAB, we will work throughout with matrices over the complex numbers. However, a real-number version of the language could be defined as well.

Transpose:

If is a matrix then is its conjugate transpose. So, if is an matrix then is an matrix and the entry is the complex conjugate of the entry .

One-vector:

If is an matrix then is the column vector consisting of all ones.

Diag:

If is an column vector then is the diagonal square matrix with on the diagonal and zero everywhere else.

Matrix multiplication:

If is an matrix and is an matrix then the well known matrix multiplication is defined to be the matrix where . In we explicitly denote this as .

Pointwise application:

If are matrices of the same dimensions , then is the matrix where .

Figure 1: Basic matrix operations of . The matrix multiplication example is taken from Axler’s book [3].
Example 2.

The operations are illustrated in Figure 1. In the pointwise application example, we use the function defined by if and are both real numbers and , and otherwise.

2.1 Formal semantics

The formal semantics of expressions is defined in a straightforward manner, as shown in Figure 2. An instance is a function, defined on a nonempty finite set of matrix variables, that assigns a matrix to each element of . Figure 2 provides the rules that allow to derive that an expression , on an instance , successfully evaluates to a matrix . We denote this success by . The reason why an evaluation may not succeed can be found in the rules that have a condition attached to them. The rule for variables fails when an instance simply does not provide a value for some input variable. The rules for , , and matrix multiplication have conditions on the dimensions of matrices, that need to be satisfied for the operations to be well-defined.

Example 3 (Scalars).

The expression from Example 1, regardless of the matrix assigned to , evaluates to the matrix whose single entry equals . We introduce the shorthand for this constant expression. Obviously, in practice, scalars would be built in the language and would not be computed in such a roundabout manner. In this paper, however, we are interested in expressiveness, so we start from a minimal language and then see what is already expressible in this language.

M∈var(I)M(I)=I(M) e_1(I)=A
e_2(I[M:=A])=B (let M=e_1 in e_2)(I)=B e(I)=Ae^*(I)=A^* e(I)=A1(e)(I)=1(A) e(I)=A
is a column vector diag(e)(I)=diag(A) e_1(I)=A
e_2(I)=B
number of columns of equals the number of rows of e_1 ⋅e_2(I)=A ⋅B ∀k = 1, …, n : (e_k(I)=A_k)
all have the same dimensions apply[f](e_1, …, e_n)(I)=apply[f](A_1, …, A_n)

Figure 2: Big-step operational semantics of . The notation denotes the instance that is equal to , except that is mapped to the matrix .
Example 4 (Scalar multiplication).

Let be any matrix and let be a matrix; let be the value of ’s single entry. Viewing as a scalar, we define the operation as multiplying every entry of by . We can express as

If is an matrix, we compute in variable the matrix where every entry equals . Then pointwise multiplication is used to do the scalar multiplication.

Example 5 (Google matrix).

Let be the adjacency matrix of a directed graph (modeling the Web graph) on nodes numbered . Let be a fixed “damping factor”. Let denote the outdegree of node . For simplicity, we assume is nonzero for every . Then the Google matrix [7, 6] of is the matrix defined by

The calculation of from can be expressed in as follows: let in let in let in let in In variable we compute the matrix where every entry equals one. In we compute the matrix where all entries in the th row equal . In we compute the matrix containing the value . The pointwise functions applied are addition, division, and reciprocal. We use the shorthand for constants ( and ) from Example 3, and the shorthand for scalar multiplication from Example 4.

Example 6 (Minimum of a vector).

Let be a column vector of real numbers; we would like to extract the minimum from . This can be done as follows: let in let in let in let in let in The pointwise functions applied are , which returns 1 on if and otherwise; , defined analogously; and the reciprocal function. In variable we compute a square matrix holding copies of . Then in variable we compute the column vector where counts the number of such that . If then equals the minimum. Variable computes the scalar and column vector is a selector where if equals the minimum, and otherwise. Since the minimum may appear multiple times in , we compute in the inverse of the multiplicity. Finally we sum the different occurrences of the minimum in and divide by the multiplicity.

2.2 Types and schemas

We have already remarked that, due to conditions on the dimensions of matrices, expressions are not well-defined on all instances. For example, if is an instance where is a matrix and is a matrix, then the expression is not defined on . The expression , however, is well-defined on . We now introduce a notion of schema, which assigns types to matrix names, so that expressions can be type-checked against schemas.

Our types need to be able to guarantee equalities between numbers of rows or numbers of columns, so that and matrix multiplication can be typechecked. Our types also need to be able to recognize vectors, so that can be typechecked.

Formally, we assume a sufficient supply of size symbols, which we will denote by the letters , , . A size symbol represents the number of rows or columns of a matrix. Together with an explicit 1, we can indicate arbitrary matrices as , square matrices as , column vectors as , row vectors as , and scalars as . Formally, a size term is either a size symbol or an explicit 1. A type is then an expression of the form where and are size terms. Finally, a schema is a function, defined on a nonempty finite set of matrix variables, that assigns a type to each element of .

The typechecking of expressions is now shown in Figure 3. The figure provides the rules that allow to infer an output type for an expression over a schema . To indicate that a type can be successfully inferred, we use the notation . When we cannot infer a type, we say is not well-typed over . For example, when and , then the expression is not well-typed over . The expression , however, is well-typed with output type .

To establish the soundness of the type system, we need a notion of conformance of an instance to a schema.

Formally, a size assignment is a function from size symbols to positive natural numbers. We extend to any size term by setting . Now, let be a schema and an instance with . We say that is an instance of if there is a size assignment such that for all , if , then is a matrix. In that case we also say that conforms to by the size assignment .

We now obtain the following obvious but desirable property.

Proposition 7 (Safety).

If , then for every instance conforming to , by size assignment , the matrix is well-defined and has dimensions .

M∈var(S) S⊢M:S(M) S⊢e_1:τ_1
S[M:= τ_1]⊢e_2:τ_2 Slet M=e_1 in e_2:τ_2 S⊢e:s_1 ×s_2 S⊢e^*:s_2 ×s_1 S⊢e:s_1 ×s_2 S1(e):s_1 ×1 S⊢e:s ×1 Sdiag(e):s ×s S⊢e_1:s_1 ×s_2
S⊢e_2:s_2 ×s_3 S⊢e_1 ⋅e_2:s_1 ×s_3 n > 0
f : C^n→C
∀k = 1, …, n : (S⊢e_k:τ) Sapply[f](e_1, …, e_n):τ

Figure 3: Typechecking . The notation denotes the schema that is equal to , except that is mapped to the type .

3 Expressive power of

It is natural to represent an matrix by a ternary relation

In the special case where is an matrix (column vector), can also be represented by a binary relation . Similarly, a matrix (row vector) can be represented by . Finally, a matrix (scalar) can be represented by the unary singleton relation .

Note that in , we perform calculations on matrix entries, but not on row or column indices. This fits well to the relational model with aggregates as formalized by Libkin [28]. In this model, the columns of relations are typed as “base”, indicated by , or “numerical”, indicated by . In the relational representations of matrices presented above, the last column is of type and the other columns (if any) are of type . In particular, in our setting, numerical columns hold complex numbers.

Given this representation of matrices by relations, can be simulated in the relational algebra with aggregates. Actually, the only aggregate operation we need is summation. We will not reproduce the formal definition of the relational algebra with summation [28], but note the following salient points:

  • Expressions are built up from relation names using the classical operations union, set difference, cartesian product (), selection (), and projection (), plus two new operations: function application and summation.

  • For selection, we only use equality and nonequality comparisons on base columns. No selection on numerical columns will be needed in our setting.

  • For any function , the operation can be applied to any relation having columns , …, , which must be numerical. The result is the relation , adding a numerical column to . We allow , in which case is a constant.

  • The operation can be applied to any relation having columns , , …, , where column must be numerical. In our setting we only need the operation in cases where columns , …, are base columns. The result of the operation is the relation

    where

    Again, can be zero, in which case the result is a singleton.

3.1 From to relational algebra with summation

To state the translation formally, we assume a supply of relation variables, which, for convenience, we can take to be the same as the matrix variables. A relation type is a tuple of ’s and ’s. A relational schema is a function, defined on a nonempty finite set of relation variables, that assigns a relation type to each element of .

One can define well-typedness for expressions in the relation algebra with summation, and define the output type. We omit this definition here, as it follows a well-known methodology [42] and is analogous to what we have already done for in Section 2.2.

To define relational instances, we assume a countably infinite universe of abstract atomic data elements. It is convenient to assume that the natural numbers are contained in . We stress that this assumption is not essential but simplifies the presentation. Alternatively, we would have to work with explicit embeddings from the natural numbers into .

Let be a relation type. A tuple of type is a tuple of the same arity as , such that when , and is a complex number when . A relation of type is a finite set of tuples of type . An instance of a relational schema is a function defined on so that is a relation of type for every .

We must connect the matrix data model to the relational data model. Let be a matrix type. Let us call a general type if and are both size symbols; a vector type if is a size symbol and is 1, or vice versa; and the scalar type if is . To every matrix type we associate a relation type

Then to every matrix schema we associate the relational schema where for every . For each instance of , we define the instance over by

Here we use the relational representations , and of matrices introduced in the beginning of Section 3.

Theorem 8.

Let be a matrix schema, and let a expression that is well-typed over with output type . Let , , or , depending on whether is general, a vector type, or scalar, respectively.

  1. There exists an expression in the relational algebra with summation, that is well-typed over with output type , such that for every instance of , we have .

  2. The expression uses neither set difference, nor selection conditions on numerical columns.

  3. The only functions used in are those used in pointwise applications in ; complex conjugation; multiplication of two numbers; and the constant functions and .

Proof.

We only give a few representative examples.

  • If is of type then is , where is the complex conjugate. If is of type , however, is .

  • If is of type then is . Here, stands for the constant function.

  • If is of type then is

  • If is of type and is of type , then is

    If, however, is of type and is of type , then is

    We use pointwise multiplication.

  • If and are of type then is .

We may ignore the let-construct as it does not add expressive power. ∎

Remark.

The different treatment of general types, vector types, and scalar types is necessary because in our version of the relational algebra, selections can only compare base columns for equality; in particular we can not select for the value 1.

Remark.

We can sharpen the above theorem a bit if we work in the relational calculus with aggregates. Every expression can already be expressed by a formula in the relational calculus with summation that uses only three distinct base variables (variables ranging over values in base columns). The details are given in the Appendix.

3.2 Expressing graph queries

So far we have looked at expressing matrix queries in terms of relational queries. It is also natural to express relational queries as matrix queries. This works best for binary relations, or graphs, which we can represent by their adjacency matrices.

Formally, define a graph schema to be a relational schema where every relation variable is assigned the type of arity two. We define a graph instance as an instance of a graph schema, where the active domain of equals for some positive natural number . The assumption that the active domain always equals an initial segment of the natural numbers is convenient for forming the bridge to matrices. This assumption, however, is not essential for our results to hold. Indeed, the logics we consider do not have any built-in predicates on base variables, besides equality. Hence, they view the active domain elements as abstract data values.

To every graph schema we associate a matrix schema , where for every , for a fixed size symbol . So, all matrices are square matrices of the same dimension. Let be a graph instance of , with active domain . We will denote the adjacency matrix of a binary relation over by . Now any such instance is represented by the matrix instance over , where for every .

A graph query over a graph schema is a function that maps each graph instance of to a binary relation on the active domain of . We say that a expression expresses the graph query if is well-typed over with output type , and for every graph instance of , we have .

We can now give a partial converse to Theorem 8. We assume active-domain semantics for first-order logic [1]. Please note that the following result deals only with pure first-order logic, without aggregates or numerical columns. The proof, while instructive, has been relegated to the Appendix.

Theorem 9.

Every graph query expressible in (first-order logic with equality, using at most three distinct variables) is expressible in . The only functions needed in pointwise applications are boolean functions on , and testing if a number if positive.

We can complement the above theorem by showing that the quintessential first-order query requiring four variables is not expressible. The proof is given in the Appendix.

Proposition 10.

The graph query over a single binary relation that maps to if contains a four-clique, and to the empty relation otherwise, is not expressible in .

4 Matrix inversion

Matrix inversion (solving nonsingular systems of linear equations) is an ubiquitous operation in data analysis. We can extend with matrix inversion as follows. Let be a schema and be an expression that is well-typed over , with output type of the form . Then the expression is also well-typed over , with the same output type . The semantics is defined as follows. For an instance , if

is an invertible matrix, then

is defined to be the inverse of ; otherwise, it is defined to be the zero square matrix of the same dimensions as . The extension of with inversion is denoted by .

Example 11 (PageRank).

Recall Example 5 where we computed the Google matrix of . In the process we already showed how to compute the matrix defined by , and the scalar holding the value . So, in the following expression, we assume we already have and . Let be the identity matrix, and let denote the column vector consisting of all ones. The PageRank vector of can be computed as follows [13]:

This calculation is readily expressed in as

Example 12 (Transitive closure).

We next show that the reflexive-transitive closure of a binary relation is expressible in . Let be the adjacency matrix of a binary relation on . Let be the identity matrix, expressible as . From earlier examples we know how to compute the scalar matrix holding the value . The matrix has 1-norm strictly less than 1, so converges, and is equal to [15, Lemma 2.3.3]. Now belongs to the reflexive-transitive closure of if and only if is nonzero. Thus, we can express the reflexive-transitive closure of as

where is if and otherwise. We can obtain the transitive closure by multiplying the above expression with . ∎

By Theorem 8, any graph query expressible in is expressible in the relational algebra with aggregates. It is known [18, 28] that such queries are local. The transitive-closure query from Example 12, however, is not local. We thus conclude:

Theorem 13.

is strictly more powerful than in expressing graph queries.

Once we have the transitive closure, we can do many other things such as checking bipartiteness of undirected graphs, checking connectivity, checking cyclicity. is expressive enough to reduce these queries to the transitive-closure query, as shown in the following example for bipartiteness. The same approach via can be used for connectedness or cyclicity.

Example 14 (Bipartiteness).

To check bipartiteness of an undirected graph, given as a symmetric binary relation without self-loops, we first compute the transitive closure of the composition of with itself. Then the condition expresses that

is bipartite (no odd cycles). The result now follows from Theorem 

9.

Example 15 (Number of connected components).

Using transitive closure we can also easily compute the number of connected components of a binary relation on , given as an adjacency matrix. We start from the union of and its converse. This union, denoted by , is expressible by Theorem 9. We then compute the reflexive-transitive closure of . Now the number of connected components of equals , where is the degree of node in . This sum is simply expressible as .

5 Eigenvalues

Another workhorse in data analysis is diagonalizing a matrix, i.e., finding a basis of eigenvectors. Formally, we define the operation as follows. Let be an matrix. Recall that is called diagonalizable if there exists a basis of consisting of eigenvectors of . In that case, there also exists such a basis where eigenvectors corresponding to a same eigenvalue are orthogonal. Accordingly, we define to return an matrix, the columns of which form a basis of consisting of eigenvectors of , where eigenvectors corresponding to a same eigenvalue are orthogonal. If is not diagonalizable, we define to be the zero matrix.

Note that is nondeterministic; in principle there are infinitely many possible results. This models the situation in practice where numerical packages such as R or MATLAB return approximations to the eigenvalues and a set of corresponding eigenvectors, but the latter are not unique. Hence, some care must be taken in extending with the operator. Syntactically, as for inversion, whenever is a well-typed expression with a square output type, we now also allow the expression , with the same output type. Semantically, however, the rules of Figure 2 must be adapted so that they do not infer statements of the form , but rather of the form , i.e., is a possible result of . The let-construct now becomes crucial; it allows us to assign a possible result of to a new variable, and work with that intermediate result consistently.

In this and the next section, we assume notions from linear algebra. An excellent introduction to the subject has been given by Axler [3].

Remark (Eigenvalues).

We can easily recover the eigenvalues from the eigenvectors, using inversion. Indeed, if is diagonalizable and , then is a diagonal matrix with all eigenvalues of on the diagonal, so that the th eigenvector in corresponds to the eigenvalue in the th column of . This is the well-known eigendecomposition. However, the same can also be accomplished without using inversion. Indeed, suppose , and let be the eigenvalue to which corresponds. Then . Each eigenvector is nonzero, so we can divide away the entries from in (setting division by zero to zero). We thus obtain a matrix where the th column consists of zeros or , with at least one occurrence of . By counting multiplicities, dividing them out, and finally summing, we obtain , …, in a column vector. We can apply a final to get it back into diagonal form. The expression for doing all this uses similar tricks as those shown in Examples 5 and 6. ∎

The above remark suggests a shorthand in where we return both and together:

This models how the operation works in the languages R and MATLAB. We agree that , like , is the zero matrix if is not diagonalizable.

Example 16 (Rank of a matrix).

Since the rank of a diagonalizable matrix equals the number of nonzero entries in its diagonal form, we can express the rank of a diagonalizable matrix

as follows:

Example 17 (Graph partitioning).

A well-known heuristic for partitioning an undirected graph without self-loops is based on an eigenvector corresponding to the second-smallest eigenvalue of the Laplacian matrix

[27]. The Laplacian can be derived from the adjacency matrix as let in . (Here is the degree matrix.) Now let . In an analogous way to Example 6, we can compute a matrix , obtained from by replacing the occurrences of the second-smallest eigenvalue by 1 and all other entries by 0. Then the eigenvectors corresponding to this eigenvalue can be isolated from (and the other eigenvectors zeroed out) by multiplying . ∎

It turns out that is subsumed by . The proof is in the Appendix.

Theorem 18.

Matrix inversion is expressible in .

A natural question to ask is if with is strictly more expressive than with . In a noninteresting sense, the answer is affirmative. Indeed, when evaluating a expression on an instance where all matrix entries are rational numbers, the result matrix is also rational. In contrast, the eigenvalues of a rational matrix may be complex numbers. The more interesting question, however, is: Are there graph queries expressible deterministically in , but not in ? This is an interesting question for further research. The answer may depend on the functions that can be used in pointwise applications.

Remark (Determinacy).

The stipulation deterministically in the above open question is important. Ideally, we use the nondeterministic operation only as an intermediate construct. It is an aid to achieve a powerful computation, but the final expression should have only a single possible output on every input. The expression of Example 16 is deterministic in this sense, as is the expression for inversion described in the proof of Theorem 18.

6 The evaluation problem

The evaluation problem asks, given an input instance and an expression , to compute the result . There are some issues with this naive formulation, however. Indeed, in our theory we have been working with arbitrary complex numbers. How do we even represent the input? For practical applications, it is usually sufficient to support matrices with rational numbers only. For , this approach works: when the input is rational, the output is rational too, and can be computed in polynomial time. For the basic matrix operations this is clear, and for matrix inversion we can use the well known method of Gaussian elimination.

When adding the operation, however, the output may become irrational. Much worse, the eigenvalues of an adjacency matrix (even of a tree) need not even be definable in radicals [14]. Practical systems, of course, apply techniques from numerical mathematics to compute rational approximations. But it is still theoretically interesting to consider the exact evaluation problem.

Our approach is to represent the output symbolically, following the idea of constraint query languages [21, 25]. Specifically, we can define the input-output relation of an expression, for given dimensions of the input matrices, by an existential first-order logic formula over the reals. Such formulas are built from real variables, integer constants, addition, multiplication, equality, inequality (), disjunction, conjunction, and existential quantification.

Example 19.

Consider the expression over the schema consisting of a single matrix variable . Any instance where is an matrix can be represented by a tuple of real numbers. Indeed, let (the real part of a complex number), and let (the imaginary part). Then can be represented by the tuple . Similarly, any can be represented by a similar tuple. We introduce the variables , , , and , for , where the -variables describe an arbitrary input matrix and the -variables describe an arbitrary possible output matrix. Denoting the input matrix by and the output matrix by , we can now write an existential formula expressing that is a possible result of applied to :

  • To express that is a basis, we write that there exists a nonzero matrix such that is the identity matrix. It is straightforward to express this condition by a formula.

  • To express, for each column vector of , that is an eigenvector of , we write that there exists such that .

  • The final and most difficult condition to express is that distinct eigenvectors and that correspond to a same eigenvalue are orthogonal. We cannot write , as this is not a proper existential formula. (Note though that the conjugate transpose of is readily expressed.) Instead, we avoid an explicit quantifier and rewrite the antecedent as the conjunction, over all positions , of .

  • A final detail is that we should also be able to express that is not diagonalizable, for in that case we need to define to be the zero matrix. Nondiagonalizability is equivalent to the existence of a Jordan form with at least one 1 on the superdiagonal. We can express this as follows. We postulate the existence of an invertible matrix such that the product has all entries zero, except those on the diagonal and the superdiagonal. The entries on the superdiagonal can only by 0 or 1, with at least one 1. Moreover, if an entry on the superdiagonal is nonzero, the entries and must be equal. ∎

The approach taken in the above example leads to the following general result. The operations of are handled using similar ideas as illustrated above for the operation, and are actually easier. The let-construct, and the composition of subexpressions into larger expression, are handled by existential quantification.

Theorem 20.

An input-sized expression consists of a schema , an expression in that is well-typed over with output type , and a size assignment defined on the size symbols occurring in . There exists a polynomial-time computable translation that maps any input-sized expression as above to an existential first-order formula over the vocabulary of the reals, expanded with symbols for the functions used in pointwise applications in , such that

  1. Formula has the following free variables:

    • For every , let . Then has the free variables and , for and .

    • In addition, has the free variables and , for and .

    The set of these free variables is denoted by .

  2. Any assignment of real numbers to these variables specifies, through the -variables, an instance conforming to by , and through the -variables, a matrix .

  3. Formula is true over the reals under such an assignment , if and only if .

The existential theory of the reals is decidable; actually, the full first-order theory of the reals is decidable [2, 4]. But, specifically the class of problems that can be reduced in polynomial time to the existential theory of the reals forms a complexity class on its own, known as [37, 38]. The above theorem implies that the partial evaluation problem for belongs to this complexity class. We define this problem as follows. The idea is that an arbitrary specification, expressed as an existential formula over the reals, can be imposed on the input-output relation of an input-sized expression.

Definition 21.

The partial evaluation problem is a decision problem that takes as input:

  • an input-sized expression , where all functions used in pointwise applications are explicitly defined using existential formulas over the reals;

  • an existential formula with free variables in (see Theorem 20).

The problem asks if there exists an instance conforming to by and a matrix such that satisfies .

For example, may completely specify the matrices in by giving the values of the entries as rational numbers, and may express that the output matrix has at least one nonzero entry.

An input is a yes-instance to the partial evaluation problem precisely when the existential sentence is true in the reals, where is the formula obtained by Theorem 20. Hence we can conclude:

Corollary 22.

The partial evaluation problem for belongs to .

Since the full theory of the reals is decidable, our theorem implies many other decidability results. We give just two examples.

Corollary 23.

The equivalence problem for input-sized expressions is decidable. This problem takes as input two input-sized expressions and (with the same and ) and asks if for all instances conforming to by , we have .

Note that the equivalence problem for expressions on arbitrary instances (size not fixed) is undecidable by Theorem 9, since equivalence of formulas over binary relational vocabularies is undecidable [16].

Corollary 24.

The determinacy problem for input-sized expressions is decidable. This problem takes as input an input-sized expression and asks if for every instance conforming to by , there exists at most one .

Corollary 22 gives an upper bound on the combined complexity of query evaluation [43]. Our final result is a matching lower bound, already for data complexity alone. The proof is in the Appendix.

Theorem 25.

There exists a fixed schema and a fixed expression in , well-typed over , such that the following problem is hard for : Given an integer instance over , decide whether the zero matrix is a possible result of . The pointwise applications in use only simple functions definable by quantifier-free formulas over the reals.

Remark (Complexity of deterministic expressions).

Our proof of Theorem 25 relies on the nondeterminism of the operation. Coming back to our remark on determinacy at the end of the previous section, it is an interesting question for further research to understand not only the expressive power but also the complexity of the evaluation problem for deterministic expressions.

7 Conclusion

There is a commendable trend in contemporary database research to leverage, and considerably extend, techniques from database query processing and optimization, to support large-scale linear algebra computations. In principle, data scientists could then work directly in SQL or related languages. Still, some users will prefer to continue using the matrix sublanguages they are more familiar with. Supporting these languages is also important so that existing code need not be rewritten.

From the perspective of database theory, it then becomes relevant to understand the expressive power of these languages as well as possible. In this paper we have proposed a framework for viewing matrix manipulation from the point of view of expressive power of database query languages. Moreover, our results formally confirm that the basic set of matrix operations offered by systems in practice, formalized here in the language , really is adequate for expressing a range of linear algebra techniques and procedures.

In the paper we have already mentioned some intriguing questions for further research. Deep inexpressibility results have been developed for logics with rank operators [33]. Although these results are mainly concerned with finite fields, they might still provide valuable insight in our open questions. Also, we have not covered all standard constructs from linear algebra. For instance, it may be worthwhile to extend our framework with the operation of putting matrices in upper triangular form, with the Gram-Schmidt procedure (which is now partly hidden in the

operation), and with the singular value decomposition.

Finally, we note that various authors have proposed to go beyond matrices, introducing data models and algebra for tensors or multidimensional arrays

[34, 22, 35]. When moving to more and more powerful and complicated languages, however, it becomes less clear at what point we should simply move all the way to full SQL, or extensions of SQL with recursion.

Acknowledgment

We thank Bart Kuijpers for telling us about the complexity class . We thank Lauri Hella and Wied Pakusa for helpful discussions, and Christoph Berkholz and Anuj Dawar for their help with the proof of Proposition 10.

References

  • [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
  • [2] D.S. Arnon. Geometric reasoning with logic and algebra. Artificial Intelligence, 37:37–60, 1988.
  • [3] S. Axler. Linear Algebra Done Right. Springer, third edition, 2015.
  • [4] S. Basu, R. Pollack, and M.-F. Roy. Algorithms in Real Algebraic Geometry. Springer, second edition, 2008.
  • [5] M. Boehm, M.W. Dusenberry, D. Eriksson, A.V. Evfimievski, F.M. Manshadi, N. Pansare, B. Reinwald, F.R. Reiss, P. Sen, A.C. Surve, and S. Tatikonda. SystemML: Declarative machine learning on Spark. Proceedings of the VLDB Endowment, 9(13):1425–1436, 2016.
  • [6] A. Bonato. A Course on the Web Graph, volume 89 of Graduate Studies in Mathematics. American Mathematical Society, 2008.
  • [7] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30:107–117, 1998.
  • [8] J.-Y. Cai, M. Fürer, and N. Immerman. An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.
  • [9] L. Chen, A. Kumar, J. Naughton, and J.M. Patel. Towards linear algebra over normalized data. Proceedings of the VLDB Endowment, 10(11):1214–1225, 2017.
  • [10] S. Datta, R. Kulkarni, A. Mukherjee, T. Schwentick, and T. Zeume. Reachability is in DynFO. In M.M. Halldórsson, K. Iwama, N. Kobayashi, and B. Speckmann, editors, Proceedings 42nd International Colloquium on Automata, Languages and Programming, Part II, volume 9135 of Lecture Notes in Computer Science, pages 159–170. Springer, 2015.
  • [11] A. Dawar. On the descriptive complexity of linear algebra. In W. Hodges and R. de Queiroz, editors, Logic, Language, Information and Computation, Proceedings 15th WoLLIC, volume 5110 of Lecture Notes in Computer Science, pages 17–25. Springer, 2008.
  • [12] A. Dawar, M. Grohe, B. Holm, and B. Laubner. Logics with rank operators. In Proceedings 24th Annual IEEE Symposium on Logic in Computer Science, pages 113–122, 2009.
  • [13] G.M. Del Corso, A. Gulli, and F. Romani. Fast PageRank computation via a sparse linear system. Internet Mathematics, 2(3):251–273, 2005.
  • [14] C.D. Godsil. Some graphs with characteristic polynomials which are not solvable by radicals. Journal of Graph Theory, 6:211–214, 1982.
  • [15] G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins University Press, fourth edition, 2013.
  • [16] E. Grädel, E. Rosen, and M. Otto. Undecidability results on two-variable logics. Archive of Mathematical Logic, 38:313–354, 1999.
  • [17] D.J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001.
  • [18] L. Hella, L. Libkin, J. Nurmonen, and L. Wong. Logics with aggregate operators. Journal of the ACM, 48(4):880–907, 2001.
  • [19] B. Holm. Descriptive Complexity of Linear Algebra. PhD thesis, University of Cambridge, 2010.
  • [20] D. Hutchison, B. Howe, and D. Suciu. LaraDB: A minimalist kernel for linear and relational algebra computation. In F.N. Afrati and J. Sroka, editors, Proceedings 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pages 2:1–2:10, 2017.
  • [21] P.C. Kanellakis, G.M. Kuper, and P.Z. Revesz. Constraint query languages. Journal of Computer and System Sciences, 51(1):26–52, August 1995.
  • [22] M. Kim. TensorDB and Tensor-Relational Model for Efficient Tensor-Relational Operations. PhD thesis, Arizona State University, 2014.
  • [23] A. Klug. Equivalence of relational algebra and relational calculus query languages having aggregate functions. Journal of the ACM, 29(3):699–717, 1982.
  • [24] Ph.G. Kolaitis. On the expressive power of logics on finite models. In Finite Model Theory and Its Applications, chapter 2. Springer, 2007.
  • [25] G. Kuper, L. Libkin, and J. Paredaens, editors. Constraint Databases. Springer, 2000.
  • [26] B. Laubner. The Structure of Graphs and New Logics for the Characterization of Polynomial Time. PhD thesis, Humboldt-Universität zu Berlin, 2010.
  • [27] J. Leskovec, A. Rajaraman, and J.D. Ullman. Mining of Massive Datasets. Cambridge University Press, second edition, 2014.
  • [28] L. Libkin. Expressive power of SQL. Theoretical Computer Science, 296:379–404, 2003.
  • [29] M. Marx and Y. Venema. Multi-Dimensional Modal Logic. Springer, 1997.
  • [30] J. Matoušek. Intersection graphs of segments and . arXiv:1406.2636, 2014.
  • [31] H.Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-database factorized learning. In J.L. Reutter and D. Srivastava, editors, Proceedings 11th Alberto Mendelzon International Workshop on Foundations of Data Management, volume 1912 of CEUR Workshop Proceedings, 2017.
  • [32] M. Otto. Bounded Variable Logics and Counting: A Study in Finite Models, volume 9 of Lecture Notes in Logic. Springer, 1997.
  • [33] W. Pakusa. Linear Equation Systems and the Search for a Logical Characterisation of Polynomial Time. PhD thesis, RWTH Aachen, 2015.
  • [34] F. Rusu and Y. Cheng. A survey on array storage, query languages, and systems. arXiv:1302.0103, 2013.
  • [35] T. Sato. Embedding Tarskian semantics in vector spaces. arXiv:1703.03193, 2017.
  • [36] T. Sato. A linear algebra approach to datalog evaluation.

    Theory and Practice of Logic Programming

    , 17(3):244–265, 2017.
  • [37] M. Schaefer. Complexity of some geometric and topological problems. In D. Eppstein and E.R. Gansner, editors, Graph Drawing, volume 5849 of Lecture Notes in Computer Science, pages 334–344. Springer, 2009.
  • [38] M. Schaefer and D. Štefankovič. Fixed points, Nash equilibria, and the existential theory of the reals. Theory of Computing Systems, 60(2):172–193, 2017.
  • [39] M. Schleich, D. Olteanu, and R. Ciucanu.

    Learning linear regression models over factorized joins.

    In Proceedings 2016 International Conference on Management of Data, pages 3–18. ACM, 2016.
  • [40] A. Tarski and S. Givant. A Formalization of Set Theory Without Variables, volume 41 of AMS Colloquium Publications. American Mathematical Society, 1987.
  • [41] L.G. Valiant. Completeness classes in algebra. In Proceedings 11th ACM Symposium on Theory of Computing, pages 249–261, 1979.
  • [42] J. Van den Bussche, D. Van Gucht, and S. Vansummeren. A crash course in database queries. In Proceedings 26th ACM Symposium on Principles of Database Systems, pages 143–154. ACM Press, 2007.
  • [43] M. Vardi. The complexity of relational query languages. In Proceedings 14th ACM Symposium on the Theory of Computing, pages 137–146, 1982.

Appendix

Proof of Theorem 9.

It is known [40, 29] that graph queries can be expressed in the algebra of binary relations with the operations , identity, union, set difference, converse, and relational composition. These operations are well known, except perhaps for , which, on a graph instance , evaluates to the cartesian product of the active domain of with itself. Identity evaluates to the identity relation on the active domain of . Each of these operations is easy to express in . For we use , where for we can take any relation variable from the schema. Identity is expressed as . Union is expressed as , and set difference as . Converse is transpose. Relational composition is expressed as , where if is positive and otherwise. ∎

The relational calculus with aggregates.

In this logic, we have base variables and numerical variables. Base variables can be bound to base columns of relations, and compared for equality. Numerical variables can be bound to numerical columns, and can be equated to function applications and aggregations. We will not recall the syntax formally [28]. The advantage of the relational calculus is that variables, especially base variables, can be repeated and reused. For example, matrix multiplication with of type and of type can be expressed by the formula

Here, , and are base variables and , and are numerical variables. Only two base variables, and , are free; in the subformula only and are free, and in only and are free. So, if or had been a subexpression involving matrix multiplication in turn, we could have reused one of the three variables. The other operations of need only two base variables. We conclude:

Proposition 26.

Let , , and as in Theorem 8. For every expression there is a formula over in the relational calculus with summation, such that

  1. If is general,