Binary intersection formalized

We provide a reformulation and a formalization of the classical result by Juhani Karhumäki characterizing intersections of two languages of the form {x,y}^*∩{u,v}^*. We use the terminology of morphisms which allows to formulate the result in a shorter and more transparent way, and we formalize the result in the proof assistant Isabelle/HOL.



There are no comments yet.


page 1

page 2

page 3

page 4


Proof of Dudley's Convex Approximation

We provide a self contained proof of a result of Dudley [Dud64] which sh...

Intersection and Union Hierarchies of Deterministic Context-Free Languages and Pumping Lemmas

We study the computational complexity of finite intersections and unions...

Characterizing and Enumerating Walsh-Hadamard Transform Algorithms

We propose a way of characterizing the algorithms computing a Walsh-Hada...

A complete axiomatisation of reversible Kleene lattices

We consider algebras of languages over the signature of reversible Kleen...

One-way resynchronizability of word transducers

The origin semantics for transducers was proposed in 2014, and led to va...

Forest Categories

We extend Tilson's theory of the algebra of finite categories, in partic...

Call-by-need, neededness and all that

We show that call-by-need is observationally equivalent to weak-head nee...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the classical results that deserve to be better known is the description Juhani Karhumäki gave in [3] for the intersection of two free monoids of rank two, that is, for languages of the form where and , as well as and , do not commute. The purpose of this article is twofold. First, we reformulate here the result in terms of morphisms which allows an exposition that is much shorter, and hopefully also more transparent. This layer of the article is a slightly modified version of [2]. Second, we complement the improved “human” proof with a formalization in the proof assistant Isabelle/HOL.

It is well known that an intersection of two free submonoids of a free monoid is free. On the other hand, the intersection can have infinite rank. The Theorem 2 in [3] gives two possible forms: and . The original proof spans about fifteen pages (without Preliminaries). The proof often crucially relies on “the way” certain words are “built up” from words and , and/or and . This is exactly the kind of argument that is much easier to make if and ( and ) are seen as images of a binary morphism which is demonstrated in the present article. An important feature of our reformulation is that it allowed to identify the difficult core of the proof, namely Lemma 8. Given this lemma, the rest of the proof is a fairly straightforward. We refer to [2] for a more detailed comparison of the two approaches.

Our second contribution is a formalization of the result in the proof assistant Isabelle/HOL. To our knowledge, this is the first formalization of a comparable result in Combinatorics on Words. We believe that computer assisted proofs are highly desirable in our field which typically features high level of technicality. The verified formalization not only makes sure that the result is correct, but also allows to outsource tedious and uninspiring work where it belongs, namely to computers. We try to provide a reader without any experience with this kind of research with the rough idea of what it entails. It may perhaps serve as a very modest introduction into some basic features of formalization using Isabelle/HOL. The full working formalization is published in the repository [5].

2 Preliminaries

Words are lists of letters from a given alphabet. They form a (free) monoid with the operation of concatenation and the neutral element, the empty word, that is denoted . If the alphabet is then the monoid of lists is typically denoted by using the Kleene star. There is an ambivalence in this notation. If is a subset of a monoid , then denotes the submonoid generated by in , that is, more algebraically, the submonoid . However, elements of the alphabet are not words! This is typically ignored, or at best glossed over by identification of letters with words of length one. However, in the context of the formalization, we have to keep in mind the difference. In our convention, the expression is equivalent to , which means that is not the set of letters but the set of singleton words, that is, words of length one. In the particular case of the binary alphabet, we shall use the generating set where is the word and the word .

The fact that is a prefix (suffix resp.) of is denoted ( resp.). If ( resp.) and , then is a proper prefix (suffix resp.) of . We shall denote the longest common prefix (suffix resp.) of and by ( resp.). Two words are prefix-comparable, denoted , (suffix-comparable, denoted , resp.) if one of them is a prefix (suffix resp.) of the other. If we want to say that is a prefix (suffix resp.) of some sufficiently large power of , we say that is a prefix (suffix resp.) of . Concepts of concatenation, prefix and suffix are extended to pairs in the obvious way.

We shall use the standard notation of regular expressions to describe certain sets of words. Note that is an alternative notation for . In regular expressions, the empty word is represented by .

If is a prefix (suffix resp.) of , then ( resp.) denotes the unique word such that ( resp.). The expressions ( resp.) is undefined otherwise.

A pair of noncommuting words is also called a binary code. We need the following properties of binary codes (see [1, Lemma 3.1]). If and do not commute, then the word is prefix-comparable with all words in . Moreover, there are distinct letters and such that is prefix-comparable with each word in and is prefix comparable with each word in . We shall use these facts for suffixes analogously. They directly imply a weak version of the Periodicity lemma in the following form:

Lemma 1

If is a common prefix (suffix resp.) of and and , then and commute.

A binary morphism (defined on ) is called marked if , where denotes the first letter of . For a general binary morphism , its marked version is the morphism defined by where . It is easy to see, from the facts mentioned above, that the definition of is correct, and that is marked.

We remark that, compared to [2], we adopt a more elementary approach, and do not use the powerful technique of the free basis and the Graph lemma. While using the Graph lemma in general makes certain arguments much more comfortable, in our particular case it turns out that the exposition is only negligibly affected by this choice.

3 Formalizing the proof using Isabelle/HOL automatic proof assistant

Isabelle111 is a generic proof assistant allowing a formalization of mathematical formulas and their proofs. Isabelle was originally developed at the University of Cambridge and Technische Universität München, but now includes numerous contributions from institutions and individuals worldwide. The most important instantiation of Isabelle to higher-order logic is Isabelle/HOL, the reader might consult for instance [4] for more details on Isabelle/HOL. The freely available distribution of the proof assistant also contains detailed documentation.

As mentioned in Introduction, one of the goals of this article is to provide a formalization of the presented result (and of its proof). This is done in Isabelle/HOL. The full formalization is available at [5]. In this article, we give an overview of key concepts, with comments suitable for readers not familiar with Isabelle/HOL. If a reader is not interested in this formalization, these sections may be skipped.

We start by introducing the formalization of the main ideas of Preliminaries. The core building stones of Isabelle are datatypes, terms and formulae. Our basic datatype, used for a word, is a list, which is in Isabelle equipped with many needed tools such as concatenation, denoted as multiplication.

3.1 Words and their datatype

To capture a word over a binary alphabet, we use a custom datatype which allows to work with all binary words. The following code defines the datatype consisting of two values bin0 and bin1:

datatype binA  bin0  bin1

The next declarations set up abbreviations for the two words of length 1, denoted by and (these are the lists of length , denoted by [bin0] and [bin1]).

abbreviation binword0   binA list    where binword0  bin0

abbreviation binword1  binA list    where binword1  bin1

As an example, we exhibit the claim that all lists over the constructed datatype binA are generated by the two words of length . The keyword UNIV stands for the set of all elements of given type (types are inferred automatically).

lemma Agenerates   UNIV  by metis Asingletons basisgenmonoid binUNIV listsUNIV listsbasis wordsunivFMonoidaxioms

The proof verified by Isabelle is given on the second line. It gives the proof method (here metis) and the names of used claims (supplied in the full code). This formalization includes most of the concepts mentioned in Preliminaries (in general, when possible, we keep the same notation in the formalization). For instance, let us exhibit the definition of a prefix and its notation : definition Prefix infixl p 50 where prefdefsimp  u p v   z v  u  z

As morphisms and their marked version form an important part of used tools, we next give their formalization details, along with further Isabelle’s core concepts.

3.2 Morphisms and their marked versions

We formalize the concept of a (general) morphism using locale, the Isabelle’s environment used to deal with parametric theories. In particular, a locale allows to introduce global parameters (introduced by the keyword fixes) and assumptions (introduced by the keyword assumes), thus prevents unnecessary repetition of assumptions in every lemma. As an illustration, we exhibit a simple claim and its proof using these assumptions, called context in Isabelle, and delimited by keywords begin and end.

locale morphism   fixes f  assumes morph f u  v  f u  f vbegin  lemma emptytoempty f       by metis morph selfappendconv2end

Such a lemma in the context in fact produces a claim named morphismemptytoempty which is equivalent to the following lemma:

lemma morphism f  f     by simp add morphismemptytoempty

Note that the assumption named morph contains a term with two free variables, u and v, with no quantifiers. As customary, such variables are understood to be universally quantified, that is, the assumption holds for all u and v. Since types are inferred automatically, this assumption implies that u and v are lists, and f is a mapping from lists to lists.

As mentioned in Introduction, we see elements of a binary code as images by a morphism. Accordingly, a binary code is formalized by extending the locale morphism by an additional assumption on the images of singletons as follows. This gives raise to a new locale binarycode:

locale binarycode  morphism f  binA list  a list  for f   assumes bincode f   f   f   f 

The declaration f  binA list  a list specifies the datatype of the parameter f. The given datatype is a mapping from all lists over the datatype binA to the lists over a generic unspecified datatype a, thus setting the domain to be all binary words.

The next pointed out formalization step are the definitions of and in the context of binarycode, i.e., for a given morphism f.

definition   where LCPsimp    f    p f   

definition  fm where fmdefsimp fm   w   f w  

The definition of fm is done using a nameless function using -calculus conventions.

The next claim is also in the context of binarycode, giving an essential statement on : is a prefix of for every . We display the formalized proof as well; it is done by induction on the list w (that is, the base case is the empty list, and the induction step proves the claim for the list [a]w assuming that it holds for ).

lemma w  p f w  proofinduct w  case Nil (* case *)   then show case    by simp  case Cons a w (* induction step: case , *)   then show case   proof    have  p f a          using 0 1 alphabetor by metis    show thesis       using prefprolongOF  p f a     p f w           hdwordof  a w      by metis appendassoc morph    qedqed This proof gives a rough idea about the level of detail contained in the formalization. Note that the induction step uses the validity of the claim for singletons (facts named 0 and 1) and the simple fact (called prefprolong) which claims that if and , then . The latter claim illustrates what can be considered a single step in the formalization. Note nevertheless that even this step is based on an auxiliary lemma which is proved elsewhere using even more elementary auxiliary lemmas.

4 The result

Let and be two binary codes, that is and . Our aim is to describe the intersection . The aim is achieved by a series of reformulations.

First, we shall see the languages and as ranges of the morphisms and over , defined by and . The structure of the intersection of and will follow from a stronger result: a characterization of the coincidence set of and , defined by

Indeed, we have

Second, instead of we shall investigate

where is the marked version of , and is the marked version of . The set is easier to investigate since both and are marked. The more difficult part of the result is establishing the relationship between and .

Assume that contains a nonempty word, that is, that there are nonempty words and such that . Then both and are prefixes of for a sufficiently large , which implies that and are prefix comparable. Without loss of generality we shall suppose . Let . Then


Formalization: basic locales and the coincidence set

The morphisms and are formalized as two instances of the locale binarycode, producing a new locale binaryintersectionpossiblyempty. This gives access to the words and and to marked versions of and , which obtain their expected names using notation. (It also gives access to the auxiliary claims of binarycode for the two morphisms.) The assumption is then added in yet another locale.

locale  binaryintersectionpossiblyempty   g0 binarycode g0  binA list  a list  h0 binarycode h0  binA list  a list  for g0 h0

beginnotation h0 h (* setting the notation to from the parent locale representing *) notation g0 gnotation h0fm hnotation g0fm gend

locale  binaryintersection   binaryintersectionpossiblyempty  assumes alphas h p g begin definition  where   h  gend

Using the datatype used for a binary morphism (binA list  a list), we define the coincidence set as follows:

definition CoincidenceSet  binA list  a list  binA list  a list  binA list  binA list set   where CoincidenceSet g h  rs g r  h s

The crucial relation between and is formalized as an equivalence (denoted by ):

lemma solutionmarkedversion g0 r  h0 s    g r  h s    using gmarkedfmconjugates hmarkedfmconjugates def  by smt appendassoc appendsameeq g0fmconjugates h0fmconjugates sameappendeq

Again, the displayed proof references auxiliary claims that are not present in the excerpt from the whole formalization which consists of formalizing many “obvious” steps.

4.1 Block structure of

We call pairs solutions. is a free semigroup and the elements of its minimal generating set are minimal solutions.

The structure of heavily depends on the existence of the following three pairs of words, called blocks: We say that is the starting block if , and for any . Note that . We say that is the -block if is a prefix , and is a minimal solution of and . The -block and -block are also called letter blocks. Since and are marked, the process of the construction of a solution is deterministic in the following sense. For any comparable and such that , there is at most one extension of either or which keeps the images comparable. This implies the following facts:

  • each block (the starting block, the -block and the -block) is unique if it exists;

  • any solution in has a unique decomposition into letter blocks.

Similarly, we obtain the following characterization of morphisms without the starting block.

Lemma 2

If the starting block does not exist, then contains at most one minimal solution.


Note that for , the pair is the starting block. Therefore, the word is not empty, and since and are marked and there is no starting block, the words and satisfying

are constructed deterministically, using the mentioned procedure, letter by letter and keeping the images prefix comparable. If such solution exists, then the first one produced by this procedure is a prefix of any other nonempty solution, and using (1), it is thus the unique minimal solution of .

Let us further suppose that the starting block exists. Then we have the following reduction of elements of to elements of .

Lemma 3

If the starting block exists, and , then is a prefix of , and .


As is the starting block, and using (1), we have . Thus, is a prefix of . We may write and obtain

This implies that each solution has a block decomposition by which we mean the decomposition of into letter blocks.

However, the structure of does not necessarily mirror the simple structure of . Although we may be tempted to conclude that consist of elements where , the problem is that is ill-defined if is not a suffix of . Instead we have the following characterization:

Lemma 4

The inclusion is Lemma 3.

To see the inclusion , we first verify, using the properties of the starting block, that implies,

The claim now follows from (1).

Formalization of minimal solutions and blocks

The definition of a minimal solution (for a morphism g, word r, morphism h, and word s, in this order) is formalized in the following way, introducing a useful short notation g r m h s:

definition MinimalSolution  binA list  a list  binA list  binA list  a list   binA list  bool   m   80808080 51   where minsoldef  MinimalSolution g r h s  r    s    g r  h s   r s r np r  s p s  g r  h s  r  r  s  s (* np stands for nonempty prefix *)

Formalization of Lemma 2, dealing with the case of no starting block, is rewritten and proven as:

lemma nopqoneminimal   assumes  p q   g p  h q    and g0 r m h0 s    and g0 r m h0 s  shows rs  rs

The fact that there is at most one starting block is stated (and proven) in the second basic way of writing assumptions and claims in Isabelle using implications .

lemma atmostonepq z    z  g r  h s  p q z  g p  h q   r s z  g r  h s  p p r  q p s

Note that the lemma has two assumptions, namely and , and the conclusion is a complicated logical formula, which itself contains an implication which is nevertheless written as . This illustrates two levels on which the formalization operates, and which reflect the composed name “Isabelle/HOL” of the proof assistant we use. While the formula of the conclusion is formulated in the object logic, namely HOL (see [4]), the implication is part of the metalogic proper to Isabelle, called Pure. This metalogic is best seen as an abbreviation for the natural language construction “if …then”. That is, the whole claim should be read as: “If , and if , then the following formula holds ….”

Finally, the assumption of existence of such a starting pair is realized using a locale, with two additional assumptions called pq and pqminimal. Lemma 4 is formalized within this locale.

locale binaryintersectionpq  binaryintersection  for p q   assumes     pq   g p  h q    and   pqminimal   g p  h q  p p p  q p q beginlemma charsolutions g0 r  h0 s   e f g e  h f  p s p  e  q s q  f  r  pep  s  qfq (* Lemma 4 *) end

4.2 Letter blocks as morphisms

Since the elements of decompose into letter blocks, we define morphisms and on where is the -block. The morphisms are partial if some letter block does not exist. The characterization is finally reduced to characterizing the set satisfying the condition of Lemma 4. Namely we set

Lemma 5

If , then .


Since is a suffix of , we have that is a suffix of . Since is also a suffix of , we deduce that is a suffix of . Similarly, we obtain that is a suffix of . Hence .

We also have the following simple property.

Lemma 6

If , where is positive and , then also .


If is a suffix of , then is a suffix of . It implies that is a suffix of . Similarly, if is a suffix of , it is a suffix of .

We point out three more auxiliary arguments.

Lemma 7

If , with , then

  1. [label=(0)]

  2. is a proper suffix of .

  3. .

  4. .


1. If is not a proper suffix of , then implies that is a suffix of . From , and we deduce that is a suffix of , contradictiong the minimality of .

2. Recall that is suffix comparable with any , since . This implies that and are suffix comparable. It is therefore enough to show that is shorter than . From 1 we have

and the claim follows from .

3. If is not a suffix of , then is a proper suffix of since . This contradicts 2 in view of and .

The most challenging part of the proof is the following lemma. It constitutes the real core of the proof.

Lemma 8

If for some and , then also .


Without loss of generality, let . The claim follows from Lemma 6 if . Let therefore , and assume

We want to show that is a suffix of . This is equivalent to showing that is a suffix of . Assume the contrary.

The equality and Lemma 7 1 imply that is a suffix of . Since , we have that is for some .

Let , and let and be distinct letters such that and . Let, moreover, . Then is the longest common suffix of and . Since is a suffix of both and , we deduce that is a suffix of and hence

Since is a suffix of and not a suffix of , we obtain that is a suffix of . From and , we now have which yields

The two inequalities above imply that and . Since is a suffix of , is a suffix of and and are suffix comparable, the Periodicity lemma implies that and commute (see Lemma 1). Since both and are marked, we obtain that which contradicts .

We can now have characterize the slightly surprising possibility when the intersection of two free binary monoids is infinitely generated. This happens when both letter blocks exist, but one of the singletons is not in . By symmetry, we shall therefore suppose, in the following classification lemma, that and .

Lemma 9

Assume that both letter blocks exist, and . Then is a minimal element of if and only if or

where is the least non negative integer such that .


From , we have that is a suffix of . Hence there exists a least non negative integer such that is a suffix of , that is, such that .

Lemma 7, items 1 and 3 yield that


which implies that for all . We may now characterize the minimal generating set of .

As , using Lemma 5, we have that the only minimal generating element starting with is .

Assume now that is a minimal generating element of starting with . By (2), is a suffix of with , hence . Let us write . As , Lemma 5 implies , and minimality of implies . Hence, the only occurrence of in is as its suffix.

Assume now that has prefix , suffix , and there is no other occurrence of . Have with and non-empty. As is a prefix of , we may write , and thus by (2) we have , which produces an occurrence of , and thus . Therefore, there is no decomposition of , and it is a minimal element of .

Formalization of letter blocks, the set and the result

We skip the formal construction of morphisms and as much more Isabelle’s concepts would need to be introduced in order to explain its technical details. We invite the reader to inspect it in the full code.

The case when only one letter block exists is treated rather implicitly in the human proof. Nevertheless, in the formalization, we have the following explicit claim.

lemma uniqueblock  assumes g e m h f     and  e f g e m h f ef  ef    and g0 r m h0 s   shows rs  p  e  p q  f  q

The assumption of existence of both letter blocks is introduced as a locale which used further on.

locale binaryintersectionblocks  binaryintersectionpq   assumes  minblock0  g   m h   and           hdblock0    0  bin0 and
  (*  0 is the first element of the list   *)            minblock1  g   m h   and           hdblock1    0  bin1

The set is introduced as the predicate of its elements, which is more suitable for further use.

definition Tpred  binA list  bool where Tpred   p s p     q s q    definition T where T     Tpred 

The relation between the solutions, the morphism and , and the set (i.e., the predicate Tpred), is now a consequence of a few more straightforward lemmas in Isabelle resulting in the following: corollary KeyRelation g0h0  p     pq     q    Tpred  

Formalizations of Lemmas 5 and 8 are straightforward: lemma Tprefixcode assumes Tpred 1 and Tpred 1  2 shows Tpred 2 (* Lemma 5 *) lemma lastblock Tpred z  c  Tpred c (* Lemma 8 *)

The human proof of Lemma 8 contains several steps which depend on some level of insight into properties of binary codes. The formalization of this proof is therefore particularly interesting and important (and demanding). The main proof is preceded by a dedicated locale that contains forty three claims, including the claims of Lemma 7. In a sense, therefore, the proof of the lemma is fragmented into forty three smaller steps. It should be made clear, however, that the fragmentation is to a great extent a matter of taste, since a single proof can be often quite naturally divided into several lemmas, or vice versa. Moreover, fourteen lemmas out of the forty three are of purely preparatory nature, allowing to use other claims formulated for prefixes in a reversed way for suffixes. This is something which in the given human proof is done by a simple appeal to a “mirrored situation”, an insight that is hardly possible to formalize in a uniform way.

We do not list the formalized equivalents of Lemmas 7 and 6 as they are split in the code into several lemmas.

The characterization of the set is concluded in the two following locales, the first, called binaryintersectionblockstrivial, is for the case , the second, named binaryintersectionblocksnontrivial, for the case . The term B T stands for a basis of , i.e., the set of its minimal elements, and the term Suc t represents .

locale binaryintersectionblockstrivial  binaryintersectionblocks   assumes     easyblock0 p s p     q s q       and easyblock1 p s p     q s q   begintheorem Tpred  (* i.e., *) end

locale binaryintersectionblocksnontrivial  binaryintersectionblocks  for t   assumes         easyblock p s p     q s q        and tblock   q s q      t    and tblocksuc  q s q      Suc t begincorollary Tbasis B T        p   Suc t s    Suc t f butlast  (* Lemma 9 *) end

Let us explain the notation in the claim Tbasis: f stands for “is factor of” and the function butlast returns the list without its last element.

5 Summary of the proof

Returning from the coincidence set back to the intersection properly speaking, the main claim (Theorem 2) of [3] is that if and are binary codes, then the intersection has one of the following forms:


Let us summarize our proof and show that it agrees with the formulation from [3]. Recall that, by definition, we have and .

0. If , then the claim holds for .

1. Let therefore contain a nonempty word. That is, contains at least one minimal solution. Then and are prefix comparable. By symmetry, we assume and is well defined.

1.1. If there is no starting block, then the construction of a solution is deterministic, hence contains a unique minimal solution . Then with and .

1.2. Let now the starting block exist, i.e., there exist such that . Then each solution has a block decomposition . We define non erasing morphisms such that, for a solution with the block decomposition , we have . Let be the set of block decompositions of all solutions. That is, let

Note that at this moment we do not guarantee that

, , that is, need not be defined. Because of the existence of at least one minimal solution, we may however assume, by symmetry, that for some . Then by Lemma 8, in particular .

1.2.1. If is not a letter block, then , and with

1.2.2. Suppose that is a letter block. If , then , and with If , then by Lemma 9, there is a non negative integer such that

Using Lemma 7 2, we now have


This last case, in which the intersection is infinitely generated, is further specified in [3, Theorem 3]. The generating set is of one of the following forms (we keep the notation of words from [3], although it is not compatible with the notation above; however, we modify integer variables):


for some , where and are nonempty and .



The possibility corresponds to the situation when is a suffix of , where is a suffix of such that (and ). In other words, the difference between () and () is whether contributes to the eventual occurrence of as a suffix of .

We finally illustrate the theory by several examples. The first two are from [3].

Example 1
Example 2

The noteworthy property of the following example is that is a suffix of . The example therefore illustrates the possibility () above.

Example 3
Example 4

Finally, Table 1 lists various situations in which the intersection is generated by at most one word. Interesting is the last line where all three blocks exist, yet the intersection contains the empty word only. Note that is not a suffix of for any nonempty in that case.



aabb ab aba bab a
aa ab aba ba a
aabb ab aba babb a
aab aba aba baa a
aab abb aba bba a
aabb ab abaa bb a
aab abb aa bb a
aab abb aab bba a
aab abb aba bab a
abaab ababab a ba aba


Table 1:


The authors acknowledge support by the Czech Science Foundation grant GAČR 20-20621S.


  • [1] Christian Choffrut and Juhani Karhumäki. Handbook of formal languages, vol. 1. chapter Combinatorics of Words, pages 329–438. Springer-Verlag, Berlin, Heidelberg, 1997.
  • [2] Štěpán Holub. Binary intersection revisited. In Robert Mercaş and Daniel Reidenbach, editors, Combinatorics on Words, pages 217–225, Cham, 2019. Springer International Publishing.
  • [3] Juhani Karhumäki. A note on intersections of free submonoids of a free monoid. Semigroup Forum, 29(1):183–205, Dec 1984.
  • [4] Lawrence C. Paulson, Tobias Nipkow, and Makarius Wenzel. From LCF to Isabelle/HOL. Formal Aspects of Computing, 31:675–698, 2019.
  • [5] Štěpán Starosta Štěpán Holub. Combinatorics on Words Formalized - Binary Intersection Formalized., June 2020.