1 Introduction
Differential equations are used throughout science and engineering to model and predict the behavior and performance of aircraft, cars, trains, robots, chemical powerplants, the stock exchange, and many more. For example, the aerodynamics of an airplane can be represented by the Navier–Stokes equations [2]; the Riccati differential equation [6] is encountered in the problems of optimal control; and the Black–Scholes equation [30]
is a partial differential equation that is used to model valuation of stock options
[4] in mathematical finance. Differential equations are thus pervasive in almost every aspect of science and engineering, and being able to solve those differential equations precisely and accurately, but also while trusting that the solutions are accurate, is of utmost importance.However, in many cases, solving the differential equations analytically is intractable. Therefore, the differential equations are solved numerically in a finite computational domain, using methods such as finite differences or finite elements. In many cases, the desired precision is so high that it requires a large finite domain, and the exact numerical computation – typically involving the inversion of a large matrix – cannot be performed exactly. In those cases, a common approach is to use a family of methods called iterative methods [27]
. Instead of inverting a large matrix, those methods build a series of approximations of the solution that eventually converge to the direct solution. Many general purpose ordinary differential equation (ODE) solvers use some kind of iterative method to solve the linear system. For instance, ODEPACK
[3], which is a collection of FORTRAN solvers for initial value problem for ODEs, uses iterative (preconditioned Krylov) methods instead of direct methods for solving linear systems. Since most codes in scientific computing are still written and maintained in FORTRAN, these solvers are being widely used. Another widely used suite of ODE solvers is SUNDIALS [7]. SUNDIALS has support for a variety of direct and Krylov iterative methods for solving the system of linear equations. SUNDIALS solvers are used by the mixed finite element (MFEM) package for solving nonlinear algebraic systems and by NASA for spacecraft trajectory simulation [7]. Because those iterative methods are widely used, it is important to obtain formal guarantees for the convergence of iterative solutions to the “true” solutions of differential equations. In this paper we use the Coq theorem prover to formalize the convergence guarantees for iterative methods.A number of works have recently emerged in the area of formalization of numerical analysis. This has been facilitated by advancements in automatic and interactive theorem proving [24, 12, 16, 23]. Some notable works in the formalization of numerical analysis are the formalization of Kantrorovich theorem for proving convergence properties of the Newton methods in Coq by Ioana Pasca [25], the formalization of the matrix canonical forms in Coq by Cano et al. [14], and the formalization of the PerronFrobenius theorem in Isabelle/HOL [29] for determining the growth rate of for small matrices . Brehard [13], worked on rigorous numerics that aims at providing certified representations for solutions of various problems, notably in functional analysis. Boldo and coworkers [9, 11, 10] have made important contributions to formal verification of finite difference schemes. They proved consistency, stability and convergence of a secondorder centered scheme for the wave equation. Tekriwal et al. [28] formalized the Lax equivalence theorem for finite difference schemes. The Lax equivalence theorem provides formal guarantees for convergence of a finite difference scheme if it is stable and consistent. Besides Coq, numerical analysis of ordinary differential equations has also been done in Isabelle/ HOL [19]. Immler et al. [18, 20, 21] formalized flows, Poincaré map of dynamical systems, and verified rigorous bounds on numerical algorithms in Isabelle/HOL. In [17], Immler formalized a functional algorithm that computes enclosures of solutions of ODEs in Isabelle/HOL. However, the problem of iterative convergence has not been dealt with formally before. Over the recent years, the mathcomp library [22] in Coq has made significant contributions in formalizing the algebraic structures and theorems and lemmas from linear algebra. The time is right to exploit and leverage those developments to address problems in verified scientific computing. We believe that our formalization on iterative convergence is an important step towards making fully verified scientific computing possible.
Overall this paper makes the following contributions:

We provide a formalization of the necessary and sufficient conditions for iterative convergence in the Coq proof assistant;

We formalize conditions for convergence of the GaussSeidel classical iterative method;

We formalize the convergence of the Jacobi iteration, another classical iterative method, by instantiating it with an example;

During our formalization, we develop libraries for dealing with complex matrices and vectors on top of the mathcomp complex formalization.
This paper is structured as follows. In Section 2
, we formalize the necessary and sufficient conditions for the iterative convergence in the Coq proof assistant. We formally prove that an iterative method converges if and only if all the eigenvalues of an iterative matrix have magnitude less than 1. In Section
3, we formalize the sufficient conditions for convergence of the Gauss–Seidel iterative process. The sufficient condition states that all the eigenvalues of the iterative matrix of a real and symmetric matrix have a magnitude less than 1 if is positive definite. One can then apply these conditions and the theorem from Section 2 to formally prove that the sequence of iterative solutions obtained by Gauss–Seidel method converges to the solution obtained using direct matrix inversion methods. In Section 4, we formalize the convergence of the Jacobi iterative process and instantiate it with an example. Both Gauss–Seidel and Jacobi iteration are instances of iterative methods. In Section 5, we conclude by summarizing key takeaways and discussing future work.2 Necessary and sufficient conditions for iterative convergence
Iterative methods are a common numerical technique to obtain solution of a system of linear equations. The following presentation is a summary of the general theory of iterative methods described in [27].
Let be a direct solution obtained by inverting the linear system as
(1) 
Here, the matrix and the vector are known and we are computing for unknown vector . Here, we assume that is nonsingular. So, there exists a unique solution of the linear system . Let be the iterative solution obtained after iterations by solving the iterative system:
(2) 
for some choice of initial solution vector . is the solution vector at the iteration step. The iterative system (2) is obtained by splitting the original matrix into two matrices and as,
(3) 
such that is easily invertible. The choice of matrices and defines an iterative method. For instance, if we choose to be the diagonal entries of the matrix and to be strictly lower and upper triangular entries of , we get the Jacobi method. We discuss about the Jacobi method in detail in section (4). If we choose to be the lower triangular entries of and to be the strictly upper triangular entries of , then we get the Gauss–Seidel iterative method. We discuss about the Gauss–Seidel method in detail in section (3).
The solution vector is then obtained by multiplying the inverse of , i.e., on both sides of equation (2) as:
(4) 
The iterative error after iterations is then defined as:
(5) 
An iterative solution is then said to converge to if and only if
(6) 
The following theorem provides necessary and sufficient conditions for the convergence of an iterative method [8] in terms of an iterative matrix defined as:
(7) 
Theorem 2.1
Spectral radius of a matrix is defined as the maximum of the magnitude of its eigenvalues.
Next we provide an informal proof of the Theorem 2.1.
Proof
(8) 
Since the system is linear, equation (8) can be written in terms of the initial iteration error () as:
(9) 
Taking limits of the vector norms on both sides of equation (9),
(10) 
To prove Theorem 2.1, we split the proof into two parts:

(11) assuming . The case for is trivial, since one can then simply check the solution by satisfying the definition of the linear system .

(12) where is the spectral radius of the iterative matrix .
We next provide an informal explanation of both the sub proofs.
Proof of (11):
(Necessity):
To prove
Proof
(Sufficiency): To prove
Proof
By definition, the matrix norm,
Since,
Proof of (12):
(Sufficiency): To prove:
Proof
Since, ,
Applying, , to the goal statement, we need to prove:
under the hypothesis .
(13) 
But the compatibility relation for vector and matrix norms dictates,
(14) 
Since, by definition of an eigensystem,
(15) 
We can then apply the sandwich theorem to prove that .
(Necessity): To prove:
To prove necessity, we obtain the Jordan decomposition of the iterative matrix, . From the Jordan normal form theorem [27], we know that there exists , nonsingular and block diagonal such that:
(16) 
with
where is the Jordan block corresponding to the eigenvalue . Thus, . Since is block diagonal,
Now, a standard result on the power of an Jordan block states that, for
To prove that , we need to prove that , since is nonsingular, i.e. . Since the 2norm of a matrix is bounded above by the Frobenius matrix norm as: , we can prove by proving . The Frobenius norm of matrix is defined as:
Thus, we need to prove that
(17) 
The zero entries of the Jordan block diagonal matrix tend to zero trivially. Since, , the diagonal elements of the Jordan block, tend to zero. Proving that the off diagonal elements tend to zero is a bit tricky though.
We have to prove that . The function increases as increases while the function decreases as increases since . Informally, it can be argued that decreases at a faster rate than the function . Hence, the limit should tend to zero. But proving it formally in Coq was challenging. We need to first bound the function with a function to obtain a sequence, , for which it would be easy to prove the limit. Therefore, we split the proof into two sub proofs:
In order to prove , we use the ratio test for convergence of sequences. The formalization of ratio test has not yet been done in Coq to our knowledge. So, we formalized the ratio test since it provides an easier test for proving convergence of sequences as compared to first bounding the function with an easier function for which the convergence could be proved with the existing Coq libraries. The process of bounding the function with an easier function to prove convergence was challenging for us due to the behavior of . Using plotting tools like wolfram plot or MATLAB plot, we observed that for , the function was monotonously decreasing, while for , increases first and then decreases. Moreover, the location of the maxima of in the interval depends of the number of iterations. Hence, bounding with a monotonously decreasing function is challenging for . But we know that eventually decays. For such scenarios, just comparing the terms and provides a much simpler test for the convergence of the sequence . This is where ratio test for the convergence of sequence comes to rescue. The ratio test for the convergence of sequences is stated as follows.
Theorem 2.2
[5]
If is a sequence of positive real numbers such that
and , then converges and .
Formalizing in the Coq proof assistant:
We state the Theorem 2.1 in Coq as follows:
Since we deal with a generic case where a real matrix is allowed to have complex eigenvalues and eigenvectors, we also work in the complex field.
RtoC_mat transforms a real matrix to a complex matrix so as to be consistent with types. In mathcomp, a complex number is defined on top of a real closed field [15, cohen:inria00593738]. Thus, given a real matrix , RtoC_mat transforms a real entry to a complex number . In Coq, we formally define RtoC_mat as follows:Let us now discuss the hypothesis of the theorem statement. The first hypothesis means that for a given coefficient matrix and a vector , the direct solution satisfies the definition of a linear system as defined in (1). The hypothesis A = addmx A1 A2 corresponds to the definition (3). As discussed earlier, we would want the matrix to be easily invertible so as to construct the iterative matrix . The invertibility condition is stated by the hypothesis
The hypothesis
states that for any initial guess vector , the vector is different from the direct solution . If they were the same, the iterative error would trivially be zero. We define the iterative solution after steps, from the iterative system (2) as
We define the iterative matrix in the let binding of the theorem statement in Coq.
In our formalization of the iterative convergence, we rely on the existing formalization of the Jordan canonical forms by Guillaume Cano and Maxime Dénès [14]. We use their definition of eigenvalues of a matrix derived from the roots of the characteristic polynomials of the Smith Normal form of a matrix . We thus define the eigenvalue of a complex matrix in Coq as
where sp is the sequence of pairs, . is the eigenvalue of a matrix and is its algebraic multiplicity. We assert that the lambda that we obtain from the Smith normal form of a matrix is an eigenvalue of the iterative matrix using the eigenvalue predicate in mathcomp
It should be noted that a square matrix is assumed to be of size at least 1. This design choice is justified by Cano et al. in [14]. A square matrix only forms a ring when their size is convertible to the successor of an integer [14]. The ring property of a square matrix was really helpful in the formalization of diagonal block matrices by Cano et al. Therefore, to use their formalization of canonical forms, we also stick with denoting a complex matrix as ’M[complex R]_n.+1. Here, we instantiate a generic ring type with a complex ring type.
As discussed earlier in the section on informal proof, we split the proof of the Theorem (2.1) into two sub proofs. The statement of these sub proofs are formalized in Coq as:
(Part 1):
(Part 2):
We have used the is_lim_seq predicate from the Coquelicot library to define the limits for the sequence of solution vectors converging to .
To prove the sufficiency, we had to prove the following lemma:
While a lemma exists for other direction in the Coquelicot formalization of limits, lemma is_lim_seq_geom_nec was missing.
As discussed earlier, to prove sufficiency condition for iterative convergence, we had to formalize the ratio test for convergence of sequences which was missing in the existing Coq libraries. In Coq, we state Theorem 2.2 as:
We then use the lemma ratio_test to formally prove which we state in Coq as:
To bound the binomial coefficient with , we formally state the inequality (18) as follows
Since the inequality holds in a complex field, we have to first convert the complex inequality to a real inequality. This was possible using the existing formalization of complex fields in mathcomp. Since the choice function and the factorials are naturals in Coq, we then define a lemma which proves that if the inequality holds for naturals, the inequality also holds in real field after a coercion is defined from naturals to reals. In Coq, we formalize the lemma as:
Applying the lemma nat_ring_mn_le, we obtain the following inequality
(20) 
It should be noted that the division in (20) is an integer division. The inequality (20) can easily be proved as discussed in the previous section. However, to use the inequality (20) in the proof of lemma choice_to_ring_le, one needs to be careful with converting an integer division to a real division.
To prove that each element of the Jordan block matrix converges to zero as in equation (17), we prove the following lemma in Coq
A key challenge we faced when proving the above lemma was extracting each Jordan block of the diagonal block matrix. The diagonal block matrix is defined recursively over a function which takes a block matrix of size denoting the algebraic multiplicity of each eigenvalues . We had to carefully destruct this definition of diagonal block matrix and extract the Jordan block and the zeros on the offdiagonal entries. We can then prove the limit on each Jordan block by exploiting its upper triangular structure.
We state the lemma to extract a Jordan block from the block diagonal matrix as follows:
where size_sum is the sum of the algebraic multiplicities of the eigenvalues and equals the total size of the matrix . Limits of the offdiagonal elements can then be trivially proven to zero. This completes the proof of sufficiency condition for convergence of iterative convergence error.
We will next discuss about convergence of classical iterative methods. We will specifically consider two common iterative methods: Gauss–Seidel iteration and Jacobi iteration.
3 Gauss–Seidel iteration
The Gauss–Seidel iterative method is an instance of iterative methods to solve the system of linear equations , using the following update formula
(21) 
where is the value of after iterations, and is the value of after iterations. From equation (21), we can construct matrix and as [27]:
(22) 
The following theorem by Reich [26] provides a criteria the convergence of the Gauss–Seidel iterative method.
Theorem 3.1
[26] If A is real, symmetric nthorder matrix with all terms on its main diagonal positive, then a sufficient condition for all the n characteristic roots of to be smaller than unity in magnitude is that A is positive definite.
From an application point of view, only the sufficiency condition is important. This is because to apply Theorem (2.1), we only need to know that the magnitude of the eigenvalues are less than 1. Since computing eigenvalues are not very trivial in most cases, the positive definite property of the matrix provides an easy test for for Gauss–Seidel iteration matrix. The proof of necessity uses an informal topological argument that would be difficult to formalize. Therefore, it is enough to just formalize the sufficiency condition.
Next we present an informal proof [26] followed by its formalization in the Coq proof assistant.
Proof
Let be the characteristic vector of corresponding to the characteristic root . Then
(23) 
Multiplying by on both sides,
(24) 
where is the conjugate transpose of obtained by taking the conjugate of each element of followed by transpose of the vector. Equation (24) then simplifies to:
(25) 
Consider the bilinear form, ,
(26) 
Taking conjugate transpose of equation (26) on both sides,
(27) 
Let be the diagonal matrix defined as:
(28) 
It can be shown that
(29) 
Substituting equation (29) in equation (27),
(30) 
Simplifying equation (30),
(31) 
But, and Hence, equation (31) simplifies to,
(32) 
Since, the diagonal elements of is positive, i.e., , is positive definite, i.e., . Since and , .
Formalization in Coq
The Theorem 3.1 is formalized in Coq as follows:
where positive definiteness of a complex matrix is defined as:
.
is the complex conjugate transpose of vector and is the real part of the complex scalar . We defined is_positive_definite in Coq as:
The hypothesis
states that the matrix is not trivially zero.
We can then apply the theorem iterative_convergence with
Reich_sufficiency to prove convergence of the Gauss–Seidel iteration method. We formalize the convergence of Gauss–Seidel iteration method in Coq as
We next demonstrate the convergence of the Gauss–Seidel iteration on a symmetric tridiagonal matrix of size , corresponding to the central difference discretization of differential equation for and the boundary conditions being and . The difference equation is given by [28]:
(33) 
where is the number of interior points in the 1D domain . When expressed in form of a matrix, the matrix would then be:
(34) 
For the formalization, we choose . Hence, the matrix
(35) 
To show that iterative system for the system converges, we need to show that is positive definite. In Coq, we prove that is positive definite as the following lemma statement
Proving that is positive definite by definition for a generic is tedious and does not add much to our line of argument. Hence, we chose to do it for of size . One can perform the exercise for any choice of and get the same result.
In our formalization, we spent considerable effort in developing theory for handling complex vectors and matrices. We build upon the existing formalization of complex numbers in mathcomp. Some properties of conjugates that we formalized in Coq are as follows:
The first property states that the conjugate transpose of the conjugate transpose of a vector is the vector itself. The second property states that and the third property states that where is a complex scalar and is its conjugate. scal_mat_C scales a complex matrix by the complex scalar , and is defined in Coq as:
More details on the formalization of the complex theory for vectors and matrices can be referred to in the attached code.
4 Jacobi iteration
The Jacobi method is another instance of iterative methods to solve the system of linear equations by successive approximations. We solve for the values at the iteration, by keeping the other elements of the vector fixed at the value obtained after iterations. This gives us the following update formula [1]: