Quantum Natural Language Processing on Near-Term Quantum Computers

by   Konstantinos Meichanetzidis, et al.

In this work, we describe a full-stack pipeline for natural language processing on near-term quantum computers, aka QNLP. The language modelling framework we employ is that of compositional distributional semantics (DisCoCat), which extends and complements the compositional structure of pregroup grammars. Within this model, the grammatical reduction of a sentence is interpreted as a diagram, encoding a specific interaction of words according to the grammar. It is this interaction which, together with a specific choice of word embedding, realises the meaning (or "semantics") of a sentence. Building on the formal quantum-like nature of such interactions, we present a method for mapping DisCoCat diagrams to quantum circuits. Our methodology is compatible both with NISQ devices and with established Quantum Machine Learning techniques, paving the way to near-term applications of quantum technology to natural language processing.



There are no comments yet.


page 1

page 2

page 3

page 4


Parametrized Quantum Circuits of Synonymous Sentences in Quantum Natural Language Processing

In this paper, we develop a compositional vector-based semantics of posi...

QNLP in Practice: Running Compositional Models of Meaning on a Quantum Computer

Quantum Natural Language Processing (QNLP) deals with the design and imp...

Foundations for Near-Term Quantum Natural Language Processing

We provide conceptual and mathematical foundations for near-term quantum...

Meaning updating of density matrices

The DisCoCat model of natural language meaning assigns meaning to a sent...

How to make qubits speak

This is a story about making quantum computers speak, and doing so in a ...

Grammar-Aware Question-Answering on Quantum Computers

Natural language processing (NLP) is at the forefront of great advances ...

Quantum-like Generalization of Complex Word Embedding: a lightweight approach for textual classification

In this paper, we present an extension, and an evaluation, to existing Q...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, research has flourished in the rapidly emerging fields of quantum machine learning and quantum artificial intelligence

[38, 15, 14, 42]. These terms cover a vast range of topics, from consideration of agent-environment interaction in the quantum domain to potential gains in using quantum devices as subroutines for machine learning algorithms. For the purposes of this work, quantum machine learning will refer to supervised or unsupervised machine learning employing variational quantum circuits

in place of deep neural networks

[4]. The study of variational quantum circuits is important in itself, as they constitute the setting for quantum computational supremacy experiments in the current era of noisy intermediate scale quantum (NISQ) devices [34]. In this work, we focus on natural language processing (NLP), a sub-field of machine learning covering a diverse interdisciplinary landscape. Our contribution to the field will be the introduction of a framework for quantum natural language processing (QNLP), tailored for implementation on NISQ devices.

We consider the distributional-compositional models of meaning (DisCoCat) for natural language [12], mediating between the rule-based approaches to language syntax and the statistical approach to language semantics, most famously associated with John Rupert Firths assertion that “You shall know a word by the company it keeps”. In DisCoCat, structure is introduced via a compositional grammar model, that of pregroup grammar

, which is then endowed with a “distributional” embedding of words into a vector space, where vector geometry captures the correlations between words according to some corpus. The interplay between compositionality of the grammar and the distributional word representation gives rise to semantics for phrases and sentences: starting from embeddings for individual words (extracted from a corpus), the compositional structure of grammar makes it is possible to give meaning to larger syntactic units


. Tasks such as concept similarity or question answering can be then be straight-forwardly translated into geometric questions about vectors and tensors, and solved computationally.

Compositional grammar models—such as pregroup grammars and context free grammars (CFG)—have a natural tensor structure [19, 32, 22] and can be considered quantum-native [43, 3, 8]. Building on the recent proposal of quantum algorithms for NLP task by Zeng and Coecke [43], we take advantage of the tensor structure in order to construct a map from DisCoCat models to variational quantum circuits, where ansätze corresponding to lexical categories—aka parts-of-speech (POS)—are connected according to the grammar to form circuits for arbitrary syntactic units. In some of its applications, the original Zeng-Coecke algorithm relies on the existence of a quantum random access memory (QRAM) [21], which is not yet known to be efficiently implementable in the absence of fault tolerant scalable quantum computers [2, 7]. Here we take a different approach, using the classical ansätz parameters to encode the distributional embedding and avoiding the need for QRAM entirely. The cost function for the parameter optimisation is informed by a corpus, already parsed and POS-tagged by classical means. Taken all together, the pipeline is as follows:


In the pipeline, a POS-tagged sentence in a corpus is first parsed to a diagram capturing its grammatical structure, which is further simplified to some other diagram more suitable for implementation. The simplified diagram is then turned into a variational quantum circuit, which is finally compiled for NISQ devices. This can be done by state-of-the art compilers, such as CQC’s which is a platform-agnostic compiler and interfaces with current NISQ architectures [39]. A python wrapper for can be found at github.com/CQCL/pytket.

Beyond the fact that variational quantum circuits are amenable to implementation on existing NISQ hardware, a reason for constructing such variational embeddings is to exploit an entirely novel feature space in which to encode the distributional semantics [24]

. Quantum-enhanced feature spaces provide a dimension which is exponential in the number of qubits, so that QNLP models have the potential to take advantage of the space for data-intensive tasks. Furthermore, the optimisation landscapes spanned by the variational quantum circuits are of different shape than those appearing in artificial neural networks, so there is the possibility for alternate performance profiles over equivalent benchmark tasks.


This work was originally commissioned by Cambridge Quantum Computing (CQC) and was carried out independently by the CQC team and the Hashberg team.

2 From Sentence to Diagram

For the purposes of constructing our QNLP mode, we build on work which uses pregroup grammars, but context-free grammars (CFS) would be equally suitable for our construction. Note that pregroup grammars are weakly equivalent to CFGs [6]. To construct DisCoCat models of meaning, diagrams encoding the pregroup grammatical structure are constructed directly inside of compact closed categories—a special case of rigid categories—giving semantics to the model.

Specifically, the diagrams in this work represent complex matrices, i.e. they live in the compact closed category fHilb. Each atomic pregroup type is associated a finite-dimensional Hilbert space, each individual (typed) word is associated a pure state in the Hilbert associated to its type, and the pregroup grammatical structure is realised as an interaction between the word states mediated by certain entangling effects.

2.1 Compositionality by Grammar

Definition 2.1.

A pregroup P is the rigid category (with chosen duals) freely generated by a finite set of atomic types. Specifically, the objects in P are generated from as follows:

  • every atomic type is a type (aka object) in P;

  • for every type in P, the left adjoint and the right adjoint are also types in P;

  • for every pair of types , the product type is also a type P (typically written );

  • the product operation is strictly associative and has a bilateral unit, the unit type ;

The pregroup P is a poset category, i.e. every pair of types in as at most one morphism . As convention in poset categories, we write for the unique morphism , if it exists. The morphisms of P are generated as follows:

  • for every type in P, we have morphisms and , known as contractions or caps;

  • for every type in P, we have morphisms and , known as expansions or cups;

All further equalities between objects and between morphisms follow from the requirement that P be a poset category.

Remark 2.2.

Some useful equalities which can be derived from the requirement that P be a poset category include: the snake equations between caps and caps; the cancellation of left and right duals, ; the stability of the unit type under duals, ; the interplay between duals and products, and .

We use a graphical calculus for autonomous categories to depict morphisms in a pregroup (which are also known as reductions, following the tradition of rule-based grammar). In particular, the contractions and expansions are depicted as follows:


In our diagrams, objects are multiplied left-to-right and morphisms are composed top-to-bottom: in the above, the two morphism on the left are the caps/contractions and the two morphisms on the right are the cups/expansions. The empty type is the tensor unit, hence it is not depicted.

Definition 2.3.

Given a pregroup P, a pregroup grammar G for P a pair consisting of a lexicon (a finite set of words ) together with a typing map associating a pregroup type to each word .

When working with pregroup grammars, pregroup types subsume the role traditionally played by lexical categories such as nouns, adjectives, verbs, adverbs, etc [28]. If the lexicon is already POS-tagged by other means, then a pregroup grammar can be obtained by associating pregroup types to each POS-tag. A pregroup grammar G can equivalently be seen as the strict monoidal category generated from P by adding states labelled by the words for each individual type . This is the same as saying that G is the rigid category (with chosen duals) freely generated by the atomic types of P and by states for all words .

Remark 2.4.

Taking the left/right duals gives monoidal functors and by extension monoidal functors , where the opposite categories and (those with arrows reversed) are equipped with the opposite monoidal product with respect to the original categories P and G.

Definition 2.5.

Consider a pregroup P with atomic types is equipped with a chosen sentence type . A grammatical sentence is a non-empty sequence of words together with a sequence of contractions and expansions witnessing that the product type associated to the sequence reduces to the sentence type:


In this work, empty products are identified with the unit type and non-empty products are expanded left-to-right as . 111Such a choice of convention is made necessary by non-commutativity of the product operation on types.

Deciding grammaticality—that is deciding whether the product type associated to a given sequence of words by a pregroup grammar G reduces to the chosen sentence type —is an efficiently solvable problem [33, 17]. In the graphical calculus, the witness of grammaticality for a sequence of words is a pattern of nested caps (and possibly cups) connecting the atoms in the product type in such a way as to leave only one type open:


There exist several measures of how grammatical a sentence is, such as Harmony. For pregroup grammars, Harmony can be defined as the number of non-sentence atomic types left open after parsing [30]. Harmony maximisation—i.e. finding the parsing closest to a witness of sentence grammaticality—non-trivial problem which enjoys polynomial quantum speed-up [40].

2.2 Distributional Meaning

Given a pregroup grammar G, semantics for the grammar are given by monoidal functors , where C is some suitable rigid category (with compact closed categories as a special case). In distributional semantics, dagger compact categories of finite-dimensional Hilbert spaces are often of interest, such as the category fHilb of complex matrices used in this work or its real matrices analogue, used in most traditional approaches to NLP. There is a vast literature on methods for associating distributional semantics to words, from the early bag-of-words approaches to more modern ones based on artificial neural networks [31]. Non-vectorial representations have also appeared in the literature [5]. Compositional distributional semantics has been successfully benchmarked against more traditional approaches, outperforming several of the contemporary techniques [23, 25, 26, 41].

We work in a presentation of fHilb where objects are the positive integers—the possible dimensions for finite-dimensional Hilbert spaces—and morphisms are -by- complex matrices. The presentation is made dagger compact by taking the conjugate transpose of matrices as the dagger, together with the following chosen duals:

  1. we pick our dual objects as ;

  2. we make a choice of orthonormal basis—the computational basis—for all prime;

  3. we extend our chosen computational bases to all by considering product bases;

  4. we define the cap by setting on the chosen orthonormal basis for ;

  5. we define the cup as the adjoint of the cap .

With the above presentation, giving distributional semantics concretely means the following:

  • a finite dimension is associated to each atomic type of the pregroup P; 222Linearity and finite-dimensionality actually force the same dimension to be associated to all duals, i.e. .

  • a -dimensional complex vector is associated to each word .

We refer to the data above as a word embedding. For example, the the word “haunt” would have a complex vector of dimension associated to it by a word embedding, which we can represent as follows in the graphical calculus:

Remark 2.6.

Because information about the factors of the dimension is derivable from the word embedding together with the word type , we can equivalently treat the above as a vector in dimension or as a tensor or arity 3 in the individual dimensions , and . In the general, the arity of the tensor associated to a word is the number of atomic types appearing in .

Given a word embedding, every pregroup grammatical parsing can be turned into tensor contraction by sending the caps and caps of G to the chosen caps and caps of fHilb, as in the following example:


As a special case, grammatical sentences find interpretation as -dimensional vectors, as the witness of grammaticality results in contraction of all tensor legs except for one of type .

In our presentation of fHilb, each Hilbert space has a chosen classical structure (i.e. a special commutative -Frobenius algebra) associated to it, corresponding to our choice of computational basis. This means that spiders are available as additional ingredients to our semantics:


Spiders—with cups and caps as special two-legged cases—have been used in the past to associate semantics to functional and connective words—such as “does”, “is” and “are”—or to relative pronouns—such as “which” and “that” [9, 35, 36]:


It has been argued that pregroup-based models of meaning by no means provide a complete account of linguistic phenomena. In fact, the original Lambek calculus—which pregroups later simplified—is richer and can itself instantiate a semantic model, as argued in [11]. Spiders provide one possible way of enriching such semantic models.

3 Diagram Rewriting

On the way to quantum circuits, we need to simplify our diagrams to optimise our ultimate use of quantum resources. Specifically, we present two diagram simplification methods which aim to reduce circuit width and depth independently of the choice of word ansätze. Both methods require additional flexibility in the manipulation of diagrams and hence take place in the following symmetric version of the pregroup grammar.

Definition 3.1.

A symmetric pregroup grammar is the compact closed category obtained by introducing symmetry isomorphisms to a pregroup grammar G, keeping the same objects.

If G is a pregroup grammar and is the associated symmetric pregroup grammar, then there is a faithful monoidal functor which is the identity on objects. Any monoidal functor towards a compact closed category C factors as for a unique monoidal functor

. As a consequence of this observation, the introduction of a symmetric pregroup grammar provides additional degrees of freedom when it comes to diagram rewriting, without imposing any additional restrictions to the compact closed semantics traditionally considered in the DisCoCat framework.

3.1 The bigraph method

The first rewrite method we present, which we call the bigraph method, completes and improves the original Zeng-Coecke algorithm [43]. We start with the simplest scenario, described in the original algorithm: the diagram has a single open wire (e.g. it is a grammatical sentence) and the cups/caps connect words in such a way as to form a an acyclic (undirected) graph. For example, we could consider the grammatical sentence from (4).

As its first step, the bigraph

method turns the diagram into a bipartite graph, based on the distance from the “root” word, which is defined to be the one connected to the unique open wire. Words of at even distance from the root are left in place as states, while words at odd distance from the root are transposed into effects:


This is essentially the method originally described in [43], except that the transpose in the computational basis is used to turn states into effects, instead of the dagger used in the original formulation:


One issue not originally foreseen with this approach is the introduction of wire crossings. This is a problem when it comes to implementation on NISQ devices: a swap between neighbouring qubits involves up to three entangling gates, which in turn lead to significant increase in circuit depths and exponential decrease in fidelity.

To tackle this issue, the bigraph method attempts to minimise the number of crossings by altering the linear order of words in the two classes. 333Because the semantic relationships between words are now encoded in the tensor contractions, their linear ordering is no longer relevant and can be used as an additional degree of freedom in the optimisation. For example, consider the following grammatical sentence:


After the initial transposition step, the following bipartite graph drawing is obtained:


The drawing above involves 5 crossings: if each wire is mapped to a qubit and we use a reasonably optimised implementation of swaps between non-adjacent qubits, the crossings alone would increase the circuit depth by about 8 CNOTs. However, a simple re-ordering of the words in the two classes leads to a graph drawing involving a single crossing:


The bigraph method does not prescribe a specific algorithm to use when minimizing crossings: this is because the general problem of minimizing crossings in the planar drawing of bipartite graphs is NP-complete [20]. 444It is an open question whether the bipartite graphs that arise from pregroup grammars form a sub-class which is sufficiently restricted—e.g. due to the localised range of the connections—to bring the complexity down to P.

The bigraph method relies on ansätze which can be easily transposed: failing that, each transposition would naively involve a doubling of the number of qubits and the preparation or measurement of nested bell states. In Section 4 we shall see that our chosen ansätze have this property. In fact, circuit ansätze such as those used in this work often have more symmetries, which can be used to further optimise the resulting quantum circuit. For example, it is easy to transform them in such a way as to reverse the order of their outputs: this means that any combination of swaps resulting in a complete reversal of the outputs of a single word is not going to ultimately increase the depth of the quantum circuit. For example, the optimal arrangement for (9) using this additional assumption is as follows:


In a more general scenario, a pregroup grammatical parsing might: (i) involve cups/expansions as well as caps/contractions; (ii) result in a cyclic graph. The presence of cups/expansions is a non-issue when it comes to semantics in compact closed symmetric monoidal categories: thanks to the existence of symmetry isomorphisms, all cup will be cancelled out by caps in the target category. The case of cyclic graphs requires more careful handling. For example, consider the following odd parsing:


In the presence of cycles, distance from the root is no longer a well-defined notion and it may not be possible to re-arrange the diagram as to form a bipartite graph. Given any partition of words into two linearly ordered classes—a “pseudo-bipartite” drawing, let’s call it—each edge between words of same class can be “dragged” to the other side as if it were a word, increasing the circuit width by two wires. This can be seen in the following re-arrangement for the odd parsing above:


In this more general scenario, a cost function is required by the bigraph method to establish a trade-off between minimising the number of crossings and minimising the number of intra-class edges in a “pseudo-bipartite” drawing of the diagram.

Having handled all of the above, there is a final issue to consider: when dealing with parsings other than grammatical sentences, it is not necessary (nor necessarily desirable) that a single wire be left open. To handle this most general scenario, the bigraph method operates as in the cyclic case—i.e. looks for a bipartite drawing optimising some trade-off between lack of crossings and lack of intra-class edges—but restricting the partitions in such a way that all words having one or more open wires are placed in the same class. This ensures that the result always be a state, as was the case so far.

(A Python implementation of the bigraph method will be available at github.com/hashberg-io/qnlp.)

3.2 The snakeremoval method

The second rewrite method we present, which we call the snakeremoval method, is based on previous results by Ref. [16] and [13]. Instead of working with the full symmetric pregroup grammar , the snakeremoval method considers the full sub-category of spanned only by the atomic types and their products, which we call . This subcategory does not contain any word states which involve any adjoint types: instead, it contains the partial transposes of those states where all output wires with adjoint type have been bent into input wires. A set of generators for this sub-category can be obtained by picking one representative for each word. This procedure is called the autonomisation of diagrams [13] and some examples of its application can be found below:


The snakeremoval method prescribes the autonomisation of each word and subsequent yanking of the wires, as done in Def. 2.12 of Ref. [16]:


The end result is a “snake-free” diagram with no cups and caps, which can be interpreted any symmetric monoidal category. For example, consider the following grammatical sentence:


The snakeremoval method would result in the following “snake-free” diagram (where we have used the classical structure ansatz for ”that” from (8)):


When translating the resulting snake-free diagrams into quantum circuits, it is important to note that process-state duality requires all linear maps to be available in the autonomisation process, not only the unitary ones. The realisation of non-unitary maps requires ancillary states and post-selection: this leads to an increase in circuit width and—ceteris paribus—an exponentially higher number of samples required during computation. If post-selection is not a viable option, then the restriction to unitary maps in turn imposes significant restrictions on the states available for words. For example, adjectives cannot change the semantic distance between the nouns the modify: a “spherical cow” and a “spherical chicken” will have the same semantic distance that the unmodified “cow” and “chicken” previously had, regardless of whether they are in a vacuum or not.

(A Python implementation of the snakeremoval method, as part of the DisCoPy toolbox for monoidal categories, is available at github.com/oxford-quantum-group/discopy. For more details see the accompanying technical paper, Ref. [18])

4 From Diagram to Circuit

The last step in our pipeline is the association of ansätze to words, either in the form of state ansätze (for the bigraph method) or in the form of process ansätze (for the snakeremoval method). We consider two generic families of unitary qubit ansätze, the CNOT+U(3) ones and the IQP ones. Each atomic type is mapped to one or more qubits, i.e. we have . More general ansätze are derived from the unitary ones as follows:

  • state ansätze are obtained by application of the unitary ansätze to the Pauli Z state;

  • effect ansätze are obtained by transposition of the unitary ansätze and post-selection onto the Pauli Z measurement outcome corresponding to the effect; 555

    Not that the word post-selection is used here to denote the linear process where no re-normalisation of probabilities is performed.

  • more general linear map ansätze method are obtained by using ancillary qubits prepared in the state and/or post-selecting onto the measurement outcome corresponding to the effect.

For the bigraph method, semantics are given by associating each word to a linear map ansatz and then constructing the following monoidal functor:

  • word states are mapped to the state ansätze described above;

  • word effects are mapped to the effect ansätze described above;

  • wire crossing are mapped to swaps, cups are mapped to preparation of a Bell state, caps are mapped to post-selection onto the measurement outcome corresponding to the same Bell state.

For the snakeremoval method, semantics are given by associating each word to a linear map ansatz and then constructing the following monoidal functor:

  • the chosen word representatives in are mapped to the linear map ansätze;

  • wire crossing are mapped to swaps;

With the exception of functional and connection words, the ansätze are parametrised. Typically we associate a single parametric ansatz to all words with the same POS, with the specific values of the parameters distinguishing between the words.

4.1 CNOT+U(3) ansätze

This family of ansätze consists of unitary quantum circuits formed by alternating layers of single-qubits rotations in X and Z with layers of CNOT gates between neighbouring qubits. The examples below—for 1, 2 and 3 qubits respectively—are written in ZX calculus notation [10]. Single-qubits white and black dots are rotations in Pauli Z and Pauli X respectively, while a black and a white dot connected by a horizontal line is a CNOT gate:


State ansätze are obtained by applying the unitary ansätze to the zero state of the computational basis, following the convention on IBMQ devices, and effect ansätze are obtained by transposing the state ansätze in the computational basis:


The effect is post-selection (without re-normalisation) onto the Pauli Z measurement outcome corresponding to the effect. This family of ansätze transforms nicely under reversal of all inputs/outputs, as shown by the following 3-qubit example:


We have said before that functional and connection words are often modelled in the DisCoCat literature using spiders [9, 35, 36]. As a consequence, it is interesting to see how spiders can be realised as non-parametric of CNOT+U(3) ansätze. Spiders with the same number of input and output legs are obtained from alternating CNOT-TONC ladders with preparation and post-selection on ancillary qubits:


Spiders with a different number of input and output legs are then obtained by application of input legs to states or post-selection of output legs onto the Pauli X measurement outcome corresponding to the effect.

4.2 IQP ansätze

This family of ansätze consists of instantaneous quantum polynomial (IQP) circuits. IQP circuits constitute of one or more layers, each layer consisting of a row of Hadamard gates, followed by a ladder of parametrised controlled-Z rotations—the rotations commute, hence the name “instantaneous”. At the end, a final row of Hadamards is applied. Here follows a schematic representation of one such circuit:


As with the previous family, state ansätze are obtained by application to states and effect ansätze are obtained by post-selection against effects. Transposition of an IQP ansatz results in another IQP ansatz with layers and rotations in reverse order. Reversal of all inputs and outputs of an IQP ansatz results in another IQP ansatz with layers in the same order but rotations in reverse order.

5 Future Work

In this work, we have described a pipeline for the implementation of NLP tasks on quantum devices, by compositional translation of lexical structures to parametrised quantum circuits. We have provided two methods—named bigraph and snakeremoval—for optimising the resulting circuits, developed with the goal of near-term implementation on NISQ devices.

These are humble first steps in uncharted territory and much work remains to be done on the practical side of things. Firstly, our optimisation methods are limited by the assumption that qubits be arranged linearly, and the job of making optimal use of each machine-specific arrangement is left to the transpiler of the specific device, such as IBM’s, or an independent compiler, such as . Future work will take qubit topology into consideration when minimising the number of crossings in the bigraph method. Secondly, we intend to put a lot of work into benchmarking the optimisation algorithms, ansatz choices, training methods and various hyper-parameters, including an investigation of the relationship between corpus size, wire dimensionality and generalisation. Finally, we will explore alternative quantum computing models, such as continuous-variable, adiabatic and measurement-based.

A lot also remains to be done on the theoretical side. As an example, we note that not all linguistic phenomena are well approximated by the use of context-free grammars. Lambek himself proposed the introduction of a “meet” operation, combining two or more pregroup grammars in non–context-free way [27]. It will be interesting to investigate the various ways in which such a model can be mapped onto quantum circuits, to accommodate our choice distributional semantics of fHilb. Another interesting question concerns the incorporation of mixed behaviour in the semantics themselves, moving from fHilb to the operator model of quantum theory. Density matrices have already been used in the literature to model entailment and ambiguity [37, 29] and they can be practically realised with quantum circuits by incorporating measurements and controlled operations, with polynomial overhead.


KM and SG would like to acknowledge useful and interesting discussions with Vojtěch Havlíček and Antonin Delpeuch. KM is supported by a Research Fellowship by the Royal Commission for the Exhibition of 1851 (royalcommission1851.org). All diagrams were drawn with TikZiT (tikzit.github.io). KM, GDF, AT and BC would like to acknowledge financial support from Cambridge Quantum Computing Ltd. SG and NC would like to acknowledge financial support from Hashberg Ltd.