Automatized Evaluation of Formalization Exercises in Mathematics

06/02/2020 ∙ by Merlin Carl, et al. ∙ 0

We describe two systems for supporting beginner students in acquiring basic skills in expressing statements in the formalism of first-order predicate logic; the first, called "math dictations", presents users with the task of formalizing a given natural-language sentence, while the second, called "Game of Def", challenges users to give a formal description of a set of a geometric pattern displayed to them. In both cases, an automatic checking takes place.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning the correct use of mathematical language frequently poses a challenge for beginner students. At the same time, it is a basic skill, required both for understanding mathematical texts and for presenting one’s own work.

In mathematical lectures and typical textbooks, this is rarely explictly discusses, though some offer a brief discussion, along with some formalization exercises (see, e.g., [Ha]).

In this note, we present two pieces of software that pursue the goal to support beginner students in learning the use of formal language.

The first one, called “Math Dictations” (a word that we learned from M. Junk, who used the concept (but no automatization thereof) in introductory courses at the university of Konstanz), challenges students to translate a proposition given in natural language, such as “the real function is strictly increasing” into a quantifier formula such as . It is similar to the formalization exercises that form part of the “Mathematical Logic Tutor” by A. Moreno (see [BM]), but goes beyond this in (i) allowing first-order logic rather than propositional logic and (ii) using a restricted automated theorem prover for evaluating solutions, so that many solutions, rather than a single one, are recognized as correct answers.

The second one, which, with a bow to the legacy of J. Conway and his “Game of Life”111See, e.g., we call “Game of Def”, has exercises that ask students to give descriptions of graphically depicted sets in a specified logical language with words such as “right”, “above”, “neighbour” or “equal distance”.

Both programs are written in Prolog and form a part of the Diproche system, which is a proof checker for natural language proofs specifically adapted to the area of beginner exercises. The Diproche system is built by the example of the Naproche system due to P. Koepke, B. Schroeder, M. Cramer and others (see, e.g., [Cr1] or [CFKKSV]). The current Diproche version covers the topics of propositional calculus, Boolean set theory, sets and functions, elementary number theory, induction proofs and axiomatic geometry. Presentations of the checking mechanism and further components of Diproche can be found in [CK] or [C].

2 Math Dictations

The idea of “math dictations” is simple: The student is given a natural language expression, which she or he is then to translate it to a quantifier formula. The quantifier formula is then checked for correctness. As mentioned above, we first learned this concept from M. Junk in Konstanz.

The automatization is rather straightforward: A dictation problem (Id,Nat,Formal,FreeVars) consists of an identifier Id, a natural language sentence (i.e., a string) Nat, a list of formal expressions in the internal Prolog list format Formal and a list of free variables that should occur in a solution. Here, Formal is a list of possible formalization of the sentence given in Nat. The reason we use a list rather than a single formalization is that we want to cover cases in which several substantially different approaches should count as equally correct.

The accepted syntax of the current version is as follows:

  • Small Latin letters are used for variables and constants; both variables and constants are terms.

  • Each natural number (written as a finite string of decimal digits) is a term.

  • If and are terms and is not a number, then is a term which describes the application of to (clearly, this only makes sense when is a function).

  • When and are terms, then , , , and are formulas.

  • When and are formulas, then so are , , , and .

  • When and are formulas and is a small Latin letter, then and are formulas.

All of these terms have their usualy meaning; as a convention, quantifiers range over real numbers. This language is sufficient to express, in the realm of real numbers, statement like the following:

  1. Strictly between any two distinct real numbers, there is a third one.

  2. is a strictly increasing function.

  3. has a zero whenever has a zero.

  4. globally dominates .

  5. converges to .

Thus, this language is already sufficient for a variety of formalization exercises.

In the program, the natural language formulation is displayed to the user, who also has a text window for entering a formula; clicking on the “check” button for the respective program, the checking is initiated and feedback is provided.

The checking works as follows: First, it is checked whether the input is a well-formed formula in which the right free variables appear (i.e., the same ones that appear in the natural language formulation). If not, an error message is displayed and no further processing takes place. Otherwise, the given expression is converted into an internal Prolog list format and a Prolog Tableau-prover222Unfortunately, the current version of the Tableau prover has a bug. It will be corrected soon. (as, e.g., described in [Fi]) is used to check, for each from belonging to the list Formal in the specification of the problem, whether and whether . If there are such that both and can be verified, then the input is considered as correct and the user is congratulated for solving the problem. If there is such that , but no with , then a message is returned saying that is sufficient, but not necessary and that the input should be made more restrictive. If there is such that , but no with , then a message is returned saying that is necessary, but not sufficient and that the condition should be loosened. If there is neither such a nor such a , the user is told that is neither sufficient nor necessary and that she or he should try again.

Of course, the Tableau prover needs to be restricted in some way: First, due to the undecidablity of first-order logic, the checking might not terminate. Second, logical equivalence is a rather poor criterion for the adequacy of formalization. To take an extreme example, we should certainly not accept the statement of Fermat’s last theorem as a formalization of example (1) claiming the density of the real numbers, just because both are provable! In our case, propositional equivalence is accepted without restriction, but the number of instantiations of universally quantified statements that can be used is restricted to .333This value is not chosen for any particular reason, but experience so far shows that it is sufficient for all cases attempted so far and does not yield unacceptably long running times.

3 The ‘Game of Def’

Math dictations as above only give a “‘right” or “wrong” answer, differentiated only by “sufficient” and “necessary”. this is of little help in refining a wrong solution. it would be better if one could see what one actually defined, in contrast to what one was supposed to define. a good teacher could respond by giving examples that match the given solution but are not intended or that are wrongly not covered by an attempted formalization. however, automating this in general is quite difficult. For this reason, the “Game of Def” was designed.

Different problem: Directly modelling a situation in a formal way that is not given by a natural language expression, but rather by a picture (or in some other way).

The syntax of the formal language accepted by the system is as follows:

  • Small latin letters denote variables and constants.

  • When , , , are variables or constants, then rechts(,), links(,), ueber(,), unter(,), nachbar(,) and dist(,)=dist(,) are formulas. (The meaning of these German terms will be explained below when we specify the semantics.)

  • When and are formulas, then , , , and are formulas.

  • When is a formula and is a small latin letter, then and are formulas.

This syntax is adhered to strictly. No omission of brackets, e.g. by priority rules, or addition of extra brackets etc. are allowed. Though it would not be difficult to somewhat loosen those rule, this is in line with the didactical goal of helping to get used to expressing oneself within the borders of a formalism.444As it turns out, some of the advanced levels also raised the interest of advanced mathematicians, who took it as a kind of puzzle game. If this interest persists, loosening the syntactic rules will be reconsidered.

The somewhat odd notation for the existential and universal quantifier and the logical junctors is due to the implementation in Prolog. An improved interface with a more appealing input format is certainly desirable, though it should be kept in mind that beginners should not be expected to be familiar with LaTeX.

The semantics now works as follows: The domain on which the game is played is a -square grid , with the middle marked with “”. Variables and constants refer to squares in this grid. Then:

  • means that and denote the same square.

  • rechts(,) means that the square is somewhere to the right, but in the same row as, ; i.e., if one would use coordinates (which the game syntax does not), we would say that the -coordinate of is larger than that of , while the -coordinates agree.

  • links(,) means that the square is somewhere to the left, but in the same row as, .

  • ueber(,) means that the square is somewhere above, but in the same column as, .

  • unter(,) means that the square is somewhere below, but in the same column as, .

  • nachbar(,) means that and are neighbours, i.e. share exactly one common border line. In coordinates, that means that they have one common coordinate, while they differ by in the other.

  • dist(,)=dist(,) means that and lie in the same row or column, that and lie in the same row or column, and that the distance from to is the same as the distance from to .

Junctors and quantifiers have their usual meaning; note that universal and exisential quantifiers only quantify over squares in the grid, not some infinite extension thereof. Thus, there are squares with no right neighbours etc. Formulas that contain more than nested quantifiers are accepted syntactically, but their semantic evaluation - which is based on an exhaustive search whenever nested quantifiers are involved - takes too long for all practical purposes. Thus, nesting more than two quantifiers should be avoided and is also not required for any solution.

The “Game of Def” now works as follows: In each exercise, one is given an image of the grid, with some squares marked yellow. Some of the squares may be labeled by letters, which means that those letters are constant letters that can be used as parameters. In addition, one is given an informal description of the set of yellow squares in natural language (currently German). The task is then to write down a -formula with exactly one free variable (the choice of the variable is up to the user with the only restriction that constant letters used in the exercise description cannot be used) such that .

Users can write a string into an input window and press the “check” button. If the input is not a -formula or it does not have exactly one free variable, an error message is displayed and no further processing takes place. Otherwise, let us denote by the input formula and by the set described by it. The system then does the following:

  • Squares in are colored green.

  • Squares in are colored red.

  • Squares in remain yellow.

Furthermore, the user receives the following text feedback:

  • When , (s)he is congratulated that the solution is correct.

  • When , a message is returned saying that the given condition is necessary, but not sufficient and that further restriction should be imposed.

  • When , a message is returned saying that the given condition is sufficient, but not necessary and that it should be made more inclusive.

  • When none of the above cases hold, the user is told to try again.

Here is an example of an exercise with the feedback as it is returned to the user:

The interested reader may now want to entertain her- or himself with the following exercises, which are part of the current version of the system:

(a) Problem1
(b) Problem 2
(c) Problem3
(d) Problem 4
(e) Problem5
(f) Problem 6
(g) Problem7
(h) Problem 8
(i) Problem9
(j) Problem 10
(k) Problem11
(l) Problem 12

4 Further Work

Clearly, the possibilities of using automated theorem provers and truth predicate evaluation in supporting formalization exercises are endless. In particular, it is easy to extend the syntax of the math dictation program to comprise other areas of mathematics, like number theory or geometry. Concerning the Game of Def, it would be desirable to get rid of the limited number of nested quantifiers by improving the running time of the evaluation algorithm.

There is a more general topic in the background here, which we plan to take up in future work: Namely, systematically look for theories that are both simple in terms of model theory and complexity theory (-minimality, quantifier elimination and decidability (see, e.g., [Ma]) seem to be particularly relevant properties) and didactically suitable in that their realm of objects is either known to or easy to explain to beginner students and that they allow for many non-trivial, but realistically solvable formalization exercises, preferable those with a visualizable aspect. The theories of Presburger arithmetic and real closed fields may be suitable candidates, provided that the complexity issues (Presburger arithmetic has a double-exponential lower time bound on a decision algorithm, see [FR]; however, the situation is considerably less bad in the case of real closed fields, see, e.g., [Gr]) turn out to be irrelevant for the intended application (simple formalization exercises). We hope for a stimulating interaction of mathematical logic (in particular model theory), computer science and the didactics of mathematics.


  • [C] M. Carl. Using Automated Theorem Provers for Mistake Diagnosis in the Didactics of Mathematics. arXiv:2002.05083v1 (2020)
  • [CK] M. Carl, R. Krapf. Das Diproche-System – ein automatisierter Tutor für den Einstieg ins Beweisen. submitted, (2019)
  • [Cr1] M. Cramer. Proof-checking mathematical texts in controlled natural language. PhD thesis (2013)
  • [CFKKSV] M. Cramer, B. Fisseni, P. Koepke, D. Kühlwein, B. Schröder and J. Veldman. The Naproche Project – Controlled Natural Language Proof Checking of Mathematical Texts. Proceedings of the Controlled Natural Language (CNL) Workshop. (2009)
  • [BM] N. Budesca, A. Moreno. Mathematical Logic Tutor - Propositional Calculus. Available online: (2000)
  • [Fi] M. Fitting. First-Order Logic and Automated Theorem Proving. Springer New York (1996)
  • [FR] M. Fischer, M. Rabin. Super-Exponential Complexity of Presburger Arithmetic. In: Caviness B.F., Johnson J.R. (eds) Quantifier Elimination and Cylindrical Algebraic Decomposition. Texts and Monographs in Symbolic Computation (A Series of the Research Institute for Symbolic Computation, Johannes-Kepler-University, Linz, Austria). Springer, Vienna (1998)
  • [Gr] D. Grigor’ev. Complexity of Deciding the First-Order Theory of Real Closed Fields. Journal of Soviet Mathematics, vol. 55 (1991)
  • [Ha] R. Hammack. Book of Proof. Available online:
  • [Ma] D. Marker. Model Theory: An Introduction. Springer New York (2002) a