Conversation is an activity that differentiates homo sapiens from all other species. It is through conversation that we express our consciousness of reality and our ability to think. Conversation is the foundation of human society and of human culture.
It is therefore no surprise that, when Alan Turing proposed what was later to be called the “Turing Test” to determine whether machines can emulate the human ability to think, he framed his test around the idea of a machine that was able to mimic a human interlocutor in what he calls the “imitation game”. Specifically, a human is assigned the role of interrogator and engages in separate dialogues with a computer and a second human, who however remain anonymous. The interrogator then has the task of discerning which of the two dialogue partners is a machine by asking a series of questions to the other two participants, who are however not allowed to ask questions themselves (turing:1950).111We provide more details about the setup of the test at the beginning of section 3 below. The machine passes the test if it can “play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.”222Op. cit., §6. We note that these criteria seem quite arbitrary. They amount to what, in software engineering, one would call a “hack”, rather than to a criterion of the sort that one might use in science. Indeed, as has often been pointed out, Turing’s criterion does not in fact test whether a machine has the ability to think, but rather only whether the machine has the ability to fool 70% of a population of average interrogators for 5 minutes.
Turing believed that by the end of the 20th century a “learning machine” (op. cit. §7) could be built that would pass this test, and he assumes that building such a machine is only a matter of memory size, computation speed and “proper programming”. To pass the test a machine would need to display what is called Artificial General Intelligence (also referred to as “general AI”), defined as the ability of a machine to emulate the cognitive performance and general experiential understanding of its environments that humans possess (muehlhauser:2013). The implementation of general AI has been a major goal of research since the 1950s. Although its precise definition is disputed in the literature, we agree with the AI research community that a machine that could conduct a convincing dialogue with a human would possess general AI (and that a machine which cannot conduct such a dialogue does not have general AI).
Many in the AI community are convinced that it is possible to create a machine of this sort, because they share Turing’s view that building such a machine is just a matter of storage and computation. This in turn reflects a common assumption that human cognition is itself just a matter of storage and computation. Searle’s Chinese room argument (searle:1980) picks up on this point, and shows that, even if a machine could be programmed in a way that provides a perfect emulation of a human being’s overt behaviour when engaging in dialogue, it would still be problematic to conclude from this that the machine had any counterpart of human understanding or of human consciousness of meaning.
Interestingly, Turing already (at §4 of his 1950 paper) provides what, in our current context, can be interpreted as a rejoinder to this part of Searle’s argument, arguing that to follow Searle would be to commit oneself to a solipsistic position. This is because we ourselves, from the perspective of our own consciousness, do not know whether another human being has self-consciousness except to the degree that we can infer this from dialogue. If we can have a credible dialogue with a machine, therefore, then Turing holds that we should adopt the same “polite convention that everyone thinks” and apply it to the machine also. Here however Turing commits the fallacy of petitio principii, since he presupposes an equivalence between dialogue-ability (as established on the basis of his criterion) and possession of consciousness, which is precisely what his argument is setting out to prove.
1.1 AI conversation emulation and its failures
Conversation machines – which are what is needed in order to pass the Turing test – are in fact already being built. Efforts are directed mainly towards what are called dialogue systems, or in other words systems able to engage in two-party conversations, which are optimistically projected to be widely used in commercial agent-based applications in areas such as travel booking or service scheduling. However, despite major efforts – from ELIZA (weizenbaum:1966) to the computer-driven dialogue systems of the present day (including Siri and Alexa) – nothing close to dialogue emulation has thus far been achieved.333See 4.3.2 below.
This is so even in spite of the fact that the machines we have today surpass the storage capacity and computing power Turing was estimating in 1950 would be required for his “learning machine”.
The tenacious optimism in the field is, we hold, based on the one hand on an unrealistic view of human cognitive behaviour and on the other hand on a series of impressive successes in reinforcement learning, for example in solving the game of Go(silver:2016) or achieving mastery in first-person shooter games such as Doom and Counter-Strike.
Counting against this optimism, however, is the repeated failure of attempts to build machines able to perform in a satisfactory way in dialogue with humans. This rests, we believe, on the radical complexity of human language use, something first recognized by philosophers starting as early as Thomas Reid (schuhmannSmith:1990), later by Schopenhauer444“It is by the help of language alone that reason accomplishes its most important achievements, – the united action of several individuals, the planned cooperation of many thousands, civilisation, the state; also science, the storing up of experience, the uniting of common properties in one concept, the communication of truth, the spread of error, thoughts and poems, dogmas and superstitions.” The World as Will and Representations, §8., and then by Adolf Reinach (mulligan:1987).
Most important for our purposes, however, is Wittgenstein who, at Philosophical Investigations 23, asks:
“How many kinds of sentence are there? Say assertion, question, and command? – There are countless kinds: countless different kinds of use of what we call ‘symbols’, ‘words’, ‘sentences’. And this multiplicity is not something fixed, given once for all; new types of language, new language-games, as we may say, come into existence, and others become obsolete and get forgotten.”
Wittgenstein then asks his reader to review the multiplicity of language-games in the following examples, and in others:
Giving orders, and obeying them,
Describing the appearance of an object, or giving its measurements,
Constructing an object from a description (a drawing),
Reporting an event,
Speculating about an event,
Forming and testing a hypothesis,
Presenting the results of an experiment in tables and diagrams,
Making up a story; and reading it,
Making a joke; telling it,
Solving a problem in practical arithmetic,
Translating from one language into another,
Asking, thanking, cursing, greeting, praying.
Wittgenstein here reveals a huge landscape of variance in natural language utterance, and his initial probings of this landscape gave way to a significant enhancement in our understanding of how (especially spoken) language works through the contributions of philosophers such as Austin and Grice, since consolidated in the huge body of research in linguistic semantics and pragmatics conducted since the 1950s, upon which we draw extensively in what follows.
1.2 Main arguments of this paper
We take as our starting point the more than 175-year-old statement made by
Ada Lovelace to the effect that the Analytical Engine555This is the
machine built by Charles Babbage which, as Turing points out (op.
cit.), is mathematically equivalent to a Turing machine. “has no
pretensions to originate anything. It can do [only] whatever we know how to
order it to perform” (lovelace:1842). The machine will not develop
consciousness or motives or intentions, and thus not learn in the way
humans do, until we know how to tell it to do so, and to achieve this, we
would have to create models of these human characteristics. However – and
this is a point often made by Turing himself – we can only model
what we can describe mathematically.666 This is true, as we shall
see below, even for cases in which machines create new patterns, as in
reinforcement learning, an AI approach founded on Markov decision process
This is true, as we shall see below, even for cases in which machines create new patterns, as in reinforcement learning, an AI approach founded on Markov decision process models(sutton:2018), or adversarial learning. These are notable exceptions to the proposition that the machine cannot learn anything new. Viewed from this perspective, the idea that consciousness or motives or intentions will somehow emerge spontaneously when storage and computing power reach a certain threshold is a product of magical thinking.
1.2.1 Mathematical models of human dialogue
What, now, of the mathematical representation of human dialogues? A dialogue is a stochastic temporal process. Some processes of this sort can be modelled mathematically, using what are called stochastic models. But for this to be possible, we need to have a collection of input-output tuples of data, where the inputs are connected to the outputs probabilistically, which means that there is a certain (measurable) likelihood that a given input will be associated with a given output (landgrebeSmith:2019).
This is the basis of so-called “machine learning”, where (in the most straightforward case) the learning process starts when we have collected large bodies of input data together with output data humans create therefrom in situations of a given sort, for example when humans identify spam in their email. Here the sender, subject and text of an email serve as input while the human decision to put this email into the spam folder provides the output. The process of training the machine with these data yields a gigantically large equation that models the relationship between the input and output data that have been fed into it. This model is narrowly tied to the training data that generated it, and does not generalise. If the equation (the trained model) is applied to new data, not drawn from the same distribution as the original training data, it will compute undesired outputs.777We deal with this matter at greater length in our discussion of the spam filter in (landgrebeSmith:2019).
Only if we have a sufficiently large collection of input-output tuples, in which the outputs have been appropriately tagged, can we use the data to train a machine so that it is able, given new inputs sufficiently similar to those in the training data, to predict corresponding outputs with a certain (useful) degree of accuracy.
But what ‘sufficiently large’ and ‘sufficiently similar’ mean, here, are questions of mathematics. We shall see that when these questions are raised in relation to those sorts of stochastic temporal processes which are human dialogues it becomes immediately clear that there are two insurmountable hurdles to realizing the scenario in which a machine would pass the Turing test, namely
that we could never have sufficiently large amounts of data to train it because the variance in dialogue situations is as huge as the variance in human culture and behaviour, and
that the processes in question do not meet the conditions needed for the application of any known type of mathematical model.
We shall see that nothing has changed in this respect even with all the advances made machine learning in recent years (including reinforcement learning (see 4.2.6) and unsupervised sequence learning (see 4.3.4)), for again: the limitations on what machine learning can do are of a mathematical nature.
We thus go one step further than Searle’s Chinese room in that we challenge the very idea that a machine emulation of human conversational behaviour is possible at all.
1.2.2 There can be no general AI
This conclusion regarding human dialogue allows the following syllogism concerning the creation of general AI:
There can be no mathematical models for the type of behaviour illustrated by human dialogue.
Therefore, there can be no computer programs implementing such models.
Therefore, there can be no general AI.
This does not mean, however, that all is lost for AI in general. For we also review the current state of the art in dialogue system building, and conclude by identifying what we see as the potential for dialogue systems that would still be useful even though they fall far short of the general AI that would be needed to pass the Turing test. This essay thus complements our previous paper (landgrebeSmith:2019)
, where we defended a sceptical attitude to the current euphoria surrounding “deep neural networks”, while at the same time pointing to AI applications which provide significant utility in addressing specific sets of real-world problems.
2 The nature of human dialogues
People engage in dialogues in order to interact with other people to achieve certain ends (berating, guiding, informing, persuading, socializing, and many more).
A dialogue – as opposed to a lecture or a poetry reading (which are examples of one-way communication) – involves turn-taking: the participants take turns in speaking and interpreting. The drivers of the conversation are our intentions, the goals we want to achieve by means of our utterances (grice:1957; austin:1962; searle:1983).888If the dialogue arises spontaneously, only the first utterer may have an intention; but the interpreter will very quickly form intentions of her own as soon as she is addressed, including the intention to refuse engagement in a dialogue. When it is Mary’s turn to speak in a dialogue with Jack, she tries to fulfil her intentions by conveying content meaningful to Jack. Mary’s intention may be to influence Jack’s intentions through her utterances, as Mary in turn may be influenced by the utterances of Jack. In this way, a conversation may bring about changes in the intentions of its participants. This is sometimes even the explicit intent of the conversation, as when people sit down together to reach decisions or to make plans.
Utterances and interpretations are occurrences. They take place in time and (more or less) in sequence. Both involve the making of conscious and unconscious choices, which are implicit in the sense that they are accessible to the dialogue partner – and to any external human or machine observer – only indirectly, via the utterances to which they lead. Acts of making choices, too, are occurrences, ordered in time in (rough) synchrony with the associated acts of speaking and interpreting. However, all of these occurrences take place against an enduring, and typically slowly changing, background consisting not merely of the evolving linguistic competences of the interlocutors but also of their respective personalities, habits and other elements drawn from their rich personal biographies.
2.1 Habits, capabilities and intentions
These form what we shall call the ‘identity’ of a human being, by which we mean that highly complex individual pattern of dispositions, both cognitive and affective, that determines each person’s reactions to internal or external stimuli.999The underlying account of dispositions is sketched in hastings:2011. This draws in turn on the ontology framework described in arp:2015
Your identity, in this sense, results from the combination of genotypic and environmental influences which affect your neural substrate as it develops through time.101010Our ‘identity’ thus comes close to what Searle calls ‘The Background’ (searle:1979), of which Searle says that it is at one and the same time (i) “derived from the entire congeries of relations which each biological-social being has to the world around itself” and (ii) purely a matter of that being’s neurophysiology. (searle:1983), p. 154.
We can distinguish three main families of dispositions through the realization of which our identity is manifested:
habits, tendencies, personality traits (to stutter, to yawn, to fret, to avoid commitment, to bear grudges, to behave pompously, to behave honestly, )
capabilities (to speak a language, to play the piano, to manage complex activities, to do long division, to play championship tennis, to practice law, )
intentions, goals, objectives (to pass an exam, to marry Jack, to impress Jack’s mother, to lose weight, to heal the rift with your bother, )
Our intentions are the drivers of our behaviour. Typically, they are short-lived, as in the course of a dialogue about plans for supper, when intentions of both dialogue participants may be adjusted with each successive utterance. Our habits and capabilities are longer-lasting. They rest on more enduring patterns in the underlying neural substrate and shape which intentions we develop and how (and whether) they are realized.
In spite of all the advances in neurology in recent years, this neural substrate is still little understood. Indeed, it is not understood at all if we define ‘understanding’ as the ability to model and predict. Thus, it cannot be captured in a formal, let alone machine-processable, way, and the same applies to the array of dispositions – which we share to a greater or lesser extent with our fellow human beings – which are founded upon it. It is this array of dispositions that makes conversation (and indeed all use of language, indeed all human activity) possible.
It shapes and determines the repertoire of both the types of speech acts that we have at our disposal and also of the contents of those speech acts. At the same time, it ensures that the deployment of this repertoire is to a large extent a matter of ingrained reflex – or at least a matter over which we have only very fragmentary conscious control (smith:1987).
Realizations of our linguistic dispositions are triggered in various ways, including by the utterances of our dialogue partners. Sometimes, such realizations involve conscious choices, for instance the choice of whether to adopt a retaliatory or conciliatory tone in response to a threatening utterance, or the choice of which answer to give to a difficult (perhaps a trick) question. More often, however, selection takes place spontaneously and unconsciously. It occurs, moreover, on a number of different levels, affecting both verbal and non-verbal aspects of communication. Matters are made still more complicated by the fact that the contexts in which communicative acts take place play a decisive role in the formation of both utterances and interpretations (fetzer:2017). We shall see that there are also multiple levels of such contexts. And, to make matters worse, the range of possible choices is not static or stable (verschueren:1999, p. 59). The result, as we shall see, is that there are so many different variables involved in a communicative act that the possibilities of forming an utterance are practically infinite.
2.2 Sources of variance in human dialogue
First, the utterer can draw on multiple sets of options at many levels of language production, starting with: which language to use (for example when travelling in a foreign country); the topic to be addressed; intonation, pitch, syntax, vocabulary, volume, as well as code and style of language (brazen, cautious, elegant, pious, rough, wistful ); and so on.
Second, the utterer can draw on a wide repertoire of non-verbal utterance accompaniments, such as gesture, mimicry, gaze, posture. These elements (documented in section 2.5) evince (or mask) intentions of the speaker, which can be argumentative, jocular, overbearing, serious, submissive, supplicative, teasing, threatening, and so forth (smith:2001, section 4).
The utterer, according to her intentions of the moment, uses combinations of elements from the mentioned levels of language production as she adjusts (again, normally unconsciously) to the responses of the recipient in accordance with the physical (temporal, spatial), and social and conversational context within which the conversation takes place. The recipient of an utterance will similarly face many options on the basis of which to attribute meaning to the utterances he hears. He can be suspicious, trusting, fully or only partially attentive, and so on. Which options are engaged on either side may of course change as the conversation unfolds, either for reasons internal to the content of the conversation itself, or because the interlocutors are influenced, by external factors such as effects of alcohol, or inclement weather.
2.2.1 Levels of language production and interpretation
To document the resultant enormous potential for variation in dialogue interactions, we describe in detail the different levels on which the context and structure of the dialogue and the form of its dynamic interaction processes are determined.
Loosely following verschueren:1999, we distinguish five levels of language production and interpretation, namely:
context (including the identities of the interlocutors),
language economics (deixis and implicit meaning),
dialogue structure (words, sentences, gestures, ) 111111We follow Verschueren (op. cit.) in using the term “structure” to designate what might otherwise be referred to as “content” or “material”. Utterance structure can be both verbal and non-verbal (for example when it involves use of gestures).,
When humans engage in conversation all of these levels interact. Their separate treatment here is necessary merely in order to enable systematic description; in reality they can never be properly spliced apart.
2.3 Types of dialogue context
The dialogue context is a “setting”, where this term is to be understood in a broad sense to embrace, for instance: one’s place at the dinner table, one’s place in society, a geographical place, the time of day at which a dialogue occurs, and many more (barker:1968). In each case the context is determined by an interplay between the wider environment and the identities of the parties involved, including their mental attitudes, intentions, and capabilities.
Contexts can be nested: a dialogue relating to the food on the dinner table suddenly switches its context as the diners become aware that someone is banging hard on the front door.
In addition, such contexts are marked not by sharp boundaries but by what is called a “horizon” of possibilities, for example the possibility that our dialogue partner might be lying, or intending to report our conversation to his superiors. The horizon of a spatial context, for instance, might include the possibility of leaving through the back door; the horizon of a temporal context that one’s husband may return at any moment. For each interlocutor, the dialogue context is thus in some ways analogous to the visual field of an individual subject: now more things now fewer things fall within its compass. And the things which do fall within its compass do so in a way that encompasses a penumbra of possibilities.121212Compare Husserl: “The world is pregiven to us, the waking, always somehow practically interested subjects, not occasionally but always and necessarily as universal field of all actual and possible practice, as horizon.” In our natural, normal life “we move in a current of ever new experiences, judgments, valuations, decisions”, in each of which consciousness “is directed towards objects in its surrounding world” surrounded by a horizon of fluently moving potentialities (husserl:1989, pp. 142, 149). Consider for example how facial expressions become apparent as we move closer to persons in our visual field, and how these facial expressions themselves bring to light new potentialities for greeting and embracing.
In each dialogue, each dialogue participant will have at any given stage a dialogue horizon, which results from the interaction of all his salient dialogue contexts at that stage. This dialogue horizon encompasses all possibilities that fall within his intentional scope in the widest sense, as determined not only by his linguistic competences and biography, but also by the social and cultural setting of the dialogue, by contextual factors relating to the time and space in which it occurs, and by his intentions of the moment. The way each interlocutor shifts his intentions alters his dialogue horizon, which in turn determines the way he perceives new utterance material. This then has a dynamic effect on new intentions, which further shape his interpretation and the way new speech acts are formed, new contexts for interpretation are created.
2.3.1 Social, cultural and environmental contexts
188.8.131.52 Social context
is the social setting of the conversation (hanks:1996), for example the context of a family outing, of two strangers bumping into each other on a railway platform, of a teacher berating a failing student, of a session in parliament. As the latter cases make clear, a social context may include institutional elements, and in such cases we can refer also to an institutional context. The social context exists in virtue of the fact that the participants in the conversation have formally or informally defined roles in virtue of which they are subject to certain norms. The social and institutional rewards and sanctions associated with these norms then form part of the dialogue horizon. They influence not only what the dialogue partners say (and what they do not say), but also the ways they speak and act.
184.108.40.206 Cultural context
is a special sub-type of social context. It is the setting created by those socialisation patterns which come into play where the participants in a dialogue draw on a common cultural background passed on from one generation to the next. The cultural context is thus determined by those habits, norms and values which result from similar types of upbringing, education, and so forth.
The social context of a conversation constrains in each case the space of permissible utterances. The broadest space is obtained where social peers speak in private, the narrowest when institutional or social inferiors and superiors speak in an institutional setting, for example judge and defendant in a criminal case. (We note that even here both parties will sometimes step outside the institutionally accepted norms. As in every other type of dialogue, the possibility that a participant forms the desire, for example, to shock or bamboozle his interlocutor can never be ruled out.) On the other hand, if dialogue partners do not share any cultural context or tradition, and do not know about each other’s social roles, then they will likely choose a very general communication context that is appropriate simply for an encounter between fellow humans (or between fellow human beings belonging, for example, to a given age cohort). Turing’s test resembles a context of this sort; the interrogator does not know his interlocutors. This increases the likelihood of shallow (dull) conversations.
220.127.116.11 Contextual constraints on language use
There is a variety of social contexts which constrain our dispositions and choices when producing language, and conversation participants may engage one or more of these within a single conversation. Each determines a particular variety (a “code” or “register”) of the language used in the conversation. A sociolect is an expression of the constrained dispositions and choices of those language users who share a social background resulting from a shared pattern of socialisation. Age cohorts also have sociolects, as do criminals (so-called ‘Rotwelsch’ or criminal cant). A dialect is a sociolect of those language users who share a social background that is regionally determined. A grapholect is a written language as standardized for example in a dictionary. A cognolect reflects the constraints imposed on an utterer by her intellectual abilities and education level, which may include a common professional or disciplinary socialization in, for example, architecture or rap music.
2.3.2 Spatial and temporal context
Spatial context is the site of the dialogue, formed by the physical place (the park bench, spaceship, bus, hospital, pub, bed, and so on) in which it takes place. Temporal context is the time (dusk, Christmas, tea break) in which the dialogue takes place. Both temporal and spatial context can include (at several levels) other spaces and times nested within them, for instance when a dialogue happens at one time but the interlocutors speak about other times and about their temporal order. Consider a conversation between a police officer and the various parties, including witnesses, involved in a car accident. Consider such a conversation where, among the various parties, there are some who speak different languages.
Both spatial and temporal context are determined in part by the communication channel used in the dialogue. This can be local (which means: with physical presence) or remote. It can be spoken versus written, and direct versus asynchronous, with different degrees of delay (such as chat – text message – email – letter). Skype combines verbal, visual and textual (chat) elements, and both of the latter can be enhanced in turn with emojis. Again, there are different sorts of rules and norms associated with different sorts of channel, and different channels are more or less adequate or appropriate to different sorts of communication. A text message channel may be adequate for announcing one’s arrival time; not however for expressing condolence on the occasion of someone’s death (westmyer:1998).
18.104.22.168 Environmental context
is the setting formed by that part of the world in which the conversation takes place. It is a combination of spatial and social context, and thus includes both physical and social constraints. It is made up of what Barker calls “ecological units” (barker:1968), for example the kitchen while Raymond is having breakfast, the interior of the school bus while he is travelling to school, his classroom while a lesson is taking place, the school yard during break.131313wright:1951 records a sequence of some 1000s of settings through which one boy progresses in a single day.
The environmental contexts of participants in a dialogue may differ, as for example when Mary is driving and Jack, sitting next to her, is communicating navigation instructions. Here the environmental contexts share in common the car interior, the road, the route ahead, the destination. Jack’s environmental context includes in addition the map he is using to navigate; Mary’s environmental context includes the entire set of driver affordances making up the car cockpit. That dialogues of this sort so often go wrong rests in part on the fact that there are different ways in which space itself is demarcated in different registers (matthiessen:2014).
Relations between environmental contexts may involve also elements of territoriality, for example when Jack seeks to engage Mary in dialogue by inserting himself into her personal space through displays of dominance or enticement. Environmental context also comprises those environments where political or military power is projected (popitz:2017), such as the layout of a prison in which an overseer can interact via intercom with the prison inmates. Here the environmental context of the overseer comprehends his fellow prison officers together with multiple prison security, video surveillance and communication systems; the environmental context of the inmate extends hardly beyond the walls of her cell.
2.3.3 Discourse context and interpretation
The dialogue is its own context at all levels of language production and interpretation. What this means is that, just as the constituents of a sentence contextualise each other, so do the successive sentences themselves. Each utterance is contextualised by its preceding utterances, and potential future utterances form part of the context horizon of each present utterance. The degree by which preceding statements influence the interpretation of the current statement is called the contextual weight of these statements. In prototypical conversations this weight decreases over time, so that the immediately preceding utterance has the strongest weight and more remote utterances have less as they fall away into the background. There are however cases where interlocutors can suddenly reach back to utterances made much earlier in the dialogue and bring them once more into the foreground.
One important family of cases of this sort results from misunderstandings. We pointed out already that the acts of choosing how to respond to a dialogue utterance are implicit. The same applies also to the interpretations of an utterance on the part of the receiver. The latter are observable only indirectly – via the utterances the interpreter produces after a role switch between the interlocutors has occurred. This implicitness of the interpretation often forces an utterer to revise a statement from further back in the conversation when he realises, on the basis of how his interlocutor is now responding, that he has been misunderstood.
2.4 Discourse economy: implicit meaning
Discourse economy is the feature of language production which requires the interpreter to take account of context for interpretation. This is because in almost all dialogues the intended meaning remains partially implicit. Such implicit meaning is generated almost always unconsciously, because parties to a dialogue automatically assume that they share sufficient general as well as context-specific knowledge to allow each of them to contextualise successfully the utterances of the other. Thus, they can still effectuate an adequate interpretation, even though not everything is said explicitly. This is of importance not least because it reflects the way in which the structure of the dialogue is influenced by interactions between the respective identities of its participants, above all by shared intentions and shared (linguistic and other) capabilities.
Verschueren gives an example of our almost universal reliance on implicitness by describing his attempt to make fully explicit the colloquial statement: “Go anywhere today?” This resulted in a text of 15 lines that still does not achieve full explicitness (op. cit., p. 26). The major reason for such reliance turns on the need for economy in use of language. The speaker will in normal circumstances want to obtain from her speech acts maximal effect in a limited time, and implicitness at the right level allows her to pass over details that would otherwise disturb the conversational flow or be boring to the interlocutor. Avoiding explicitness can also be used as a conversational tactic, for example to maintain politeness or mask deception.
To achieve a dialogue that is productive on both sides, the preponderance of implicit meaning on the side of what is communicated by the utterer needs to be understood by the interpreter in a way that is close to the utterer’s intention. In his Studies in the Way of Words, grice:1989 formulates in this connection what he calls the “Cooperative Principle”, in which he recognises not only the need for dialogue economy but also its two-sided nature. For cooperativeness, as Grice understands it, incorporates both a maxim of quantity – be as informative as you possibly can, and give as much information as is needed – and a maxim of manner – be as clear, as brief, and as orderly as you can in what you say, and avoid ambiguity. These requirements are clearly in competition with each other; if brevity is taken too far, for example, then the interlocutors will typically later require more explicitness in order to resolve potential misunderstandings.
The most important form of implicit meaning is deixis, which is the use of language elements whose reference is determined by the specific context of an utterance. Deictic expressions – such as “him”, “next week”, “there” – need to be interpreted by the receiver using contextual clues.141414talmy:2018 provides a survey of such cues as part of an account of how the utterer in a dialogue draws the attention of the interpreter to the particular entity that she wants to communicate about by using both speech-external and speech-internal context. He describes the vast array of strategies humans use to bring this about, given that the utterer cannot somehow reach into the hearer’s mind and directly place his focus of attention on that target. Four important forms of deixis are: person deixis, temporal deixis, spatial deixis and discourse deixis.
22.214.171.124 Person deixis
means: references to a person, where who the person is can be inferred only if contextual information is available (meibauer:2001; sidnell:2017). The utterer knows who he himself is, and in the setting of a face-to-face communication the interpreter knows who the utterer is, and is thus able to resolve the deictic pronouns “I” and “you”. Not, however, in other settings, for instance in the case of conversations using a teleprinter (as in one form of the Turing test as described by Turing himself in 1950).
126.96.36.199 Spatial deixis
is a phenomenon arising when reference to space requires context for disambiguation (lyons:1977). It can be seen at work in the use of prepositions such as “in”, “out”, “below”; also of verbs such as “enter”, “go to”, “leave”; of adverbs such as “here”, “there”; and of demonstrative pronouns such as “these” and “those”. For example, the utterance “Let’s go downtown” when uttered in Berlin needs context to be disambiguated, since “downtown” can mean (at least) Berlin Zoologischer Garten and Berlin Mitte. Between 1961 and 1990 the term “Berlin” itself needed context for disambiguation.
188.8.131.52 Temporal deixis
is the analogous phenomenon involving reference to time (lyons:1977). To resolve the meaning of utterances like “Yesterday Trump met Kim” or “Next February I will travel to Rome” event time point, time point of utterance and reference time scale need to be applied in disambiguation (thomsenSmith:2018).
The need to keep track of temporal order inside a dialogue is illustrated by a statement such as
After Paris I need to get to Abbeville before nightfall.
This involves four temporal references, one (implicit) present and three (explicit) in successive futures, as well as three spatial references: present location at time of utterance (implicit), Paris and Abbeville (explicit). We can use this example to illustrate how the context and horizon of a conversation influence each other mutually. On the one hand, if the sentence is used in a conversation between two British tourists planning a trip from Paris to Normandy, the horizon might include potential closing times on Somme battlefield memorial sites. If, on the other hand, it is used in a conversation between two Oklahoma truck drivers, then the dialogue horizon might include potential traffic holdups on Route 7 on the way from Paris, Texas to Abbeville, Louisiana.
184.108.40.206 Discourse deixis
is the usage of an utterance in a conversation to refer to this utterance itself or to previous or future parts of the conversation (levinson:1983). Examples are: “What you just said contradicts your previous statements”, or “So what does it feel like, participating in a conversation like this one?”; or again: “This conversation must stop immediately!” or “I contest the legitimacy of these entire proceedings!” While change in dialogue horizon normally takes place gradually, and without being noticed, the employment of discourse deixis brings the ongoing dynamics of horizon change into the foreground. Discourse deixis is often an element of a meta-discourse, for example when three persons leave the room and then one of the remaining interlocutors says: “That was a strange conversation.”
2.4.2 Other forms of implicit meaning
220.127.116.11 Non-deictic reference
is a way of expressing the relation to an entity using a fixed reference, as in proper names (such as “G. Pico de la Mirandola”) or definite descriptions (like “the highest mountain in Europe”) (abbott:2017). Proper names and other fixed references, too, require background (world) knowledge to be interpreted correctly.
is the usage of an implicit unit of meaning in a way that implies that the interpreter will have to draw on contextual knowledge to understand the intended meaning, as in the sentence “Let us meet the chancellor”, which carries the presupposition that the interlocutor knows who the chancellor is. A variant type of presupposition (as in: “Have you stopped beating your wife?”) is sometimes used as a way of tricking a dialogue partner in unfriendly interactions.
occurs where there is a unit of meaning which the speaker does not make explicit in his utterance, but which the interpreter can deduce from this utterance. huang:2017 gives the following example: “The soup is warm” implies that the soup is neither hot nor cold. This differs from presupposition, because the implication can be resolved without background knowledge; only minimal language competence at the lexeme level is required.
2.5 Structural elements of dialogue
Next to context, another important layer contributing to dialogue variance arises from choice of language structure, including both verbal structures (utterances) and non-verbal structures (for example gestures).
Layers of structure are, from the coarse to the fine-grained:151515ingarden:1973, pp. 29f., identifies a similar pattern of layers in his analysis of the ontology of the literary work of art, and points out how each can contribute to the aesthetic quality of the work as a whole. He emphasises that, despite the heterogeneous character of these layers, the work nonetheless constitutes an organic unity, since the layers are unified unproblematically by the reader in virtue of the dimension of meaning which runs through them all. Something similar applies in the dialogue case, though here there are two – potentially conflicting – chains of meaning which unify the layers, one for each of the two dialogue partners.
non-verbal level: including facial expression, gestures and body language,
whole language level: including language choice, language code and language style,
level of single dialogue contributions: sentential and suprasentential utterance units,
level of morphemes and words,
level of sound structures.
2.5.1 Non-verbal structural elements of dialogue
Facial expression, glances, gestures and body language are important ways in which uses of language are supported by non-verbal structures (verschueren:1999, p 100f.). All of them can potentially transform the sense of a verbal utterance, so that even a statement of condolence can be accompanied by gestures and facial expressions that make it appear cynical to the interpreter. In negotiations (and negotiation-based games such as poker) body language and facial impression are indispensable to obtaining the desired results. It has been shown that their effect on the interpretation of contexts, situations and even the personality of the interlocutor is quite strong (ambady:1992). Sign language is also an important non-verbal component of dialogue, and is used very often to disambiguate spatial from person deixis, for example by means of pointing (sidnell:2017).
2.5.2 Language code and style
Code – also called “register” – is a matter of the language choices systematically made by a social group, such as the inhabitants of an area or the members of a social class or profession. Dialogue participants can switch codes, for example, to convey special meaning or emphasis, or to communicate mockery.
Style concerns the level of formality of language use (verschueren:1999); a speaker may switch, for example, to a more aggressive style, in order to intimidate or punish his dialogue partner. Both code and style are important dimensions of variance in utterance formation and interpretation.
2.5.3 Sentential and suprasentential utterances
A sentential utterance expresses a relatively closed unit of meaning encompassing the basic functions of reference and predication. Subkinds are: statement, question, imperative, request and exclamation (duerscheid:2012). Statements are characterised by features such as reference (subject in noun phrase) and predication (verb phrase). Typically, they are expressed as complete sentences, but ellipses are also used, as in “Guilty, your honour”). A suprasentential utterance is a sequence of sentential utterances which the utterer uses to optimise the fulfilment of her intentions by conveying a complex meaning. The way this is done depends on context.
18.104.22.168 Incompleteness and ellipses
Sentential and suprasentential utterances are often incomplete or elliptical. This may result from interruption or the inability of the speaker to finish his thought. But often, such utterances can be completed by elements of the situation and are not pragmatically incomplete (mulligan:1997). In such cases humans can interpret even incomplete utterances in a sense that is close to the meaning intended by the utterer.
22.214.171.124 Force and modality
Force describes utterance styles characteristic of assertion, command, request, question, and so forth. In addition, there are varying degrees of force, so that, depending on the emotional involvement and inclinations of the speaker, a request to obtain something might be phrased either as a question or as an imperative.
With frege:1879 and searle:1979 one might take the view that an expressed proposition can be evaluated independently of the force with which it is communicated. hanks:2007, however, gives strong evidence to the effect that propositional content and force interact and must be understood together. Thus, while logicians and computer scientists have sometimes held that the linguistic subdiscipline of semantics can hold itself cleanly separate from concerns with matters of pragmatics, a view of this sort cannot be maintained even for the language used in silent monologue (clark:1996). Such a view will certainly be inadequate when it comes to that sort of language whose mastery is needed to pass the Turing test.
The philosopher’s understanding of force is closely related to the linguistic notion of modality, which describes aspects of attitude – of how the utterer relates to his utterance, signalling properties such as: degree of certainty, optionality, doubt, vagueness, possibility, necessity, and so forth (verschueren:1999).
But modality as understood by linguists comprehends also other aspects of the utterer’s attitude, for example that he is joking, lying, flattering, ordering, arguing, interrogating, pleading. The verbal expression of modality is often combined with non-verbal language-supporting elements (see 2.5.1), for example when the utterer is holding a gun to the head of the interpreter, or is kneeling before the interpreter in the middle of the street while holding a ring in his hand.
Lexemes are the carriers of the minimal units of linguistic meaning – for example run or hat. The building blocks of sentences are lexemes in their inflected forms, which are called wordforms – for example runs, ran, running or hats, hat’s, behatted
. For any given language there is a relatively small set of lexemes that has to cover a very wide range of possible topics. This is because it is not possible to have an exact word for each and every aspect of reality if the size of the lexicon is to be kept small enough that it can be managed by a single human being. Lexemes are therefore prototypes(rosch:1975). They obtain part of their meaning from the context created by the other lexemes they are used with in a sentence, as well as by all the other contextual dimensions identified above. For example, the lexeme freedom has a very different meaning in (2) and (3):
Do not clutter my desk with stuff, I need freedom to move.
We want freedom of speech!
Lexemes are organised in hierarchies of greater and lesser generality. Depending on intention and context, lexemes of varying degrees of abstraction may be chosen in the course of a single dialogue. In everyday usage it is the mid-level that dominates. For instance, when talking about pets, the participants in a dialogue will typically use mid-level terms such as “dog” or “cat” rather than the low-level “dachshund” or the high-level “animal”. Something similar holds when we describe an ailment (where we refer in a dialogue to a fracture of the foot, rather than of the fifth metatarsal bone). When we introduce ourselves in a dialogue, we (prototypically) talk about our place of origin by referring to city or region rather than to neighbourhood or street. Utterers, normally unconsciously, select the level of abstraction/generality that is salient to the dialogue context (compare 2.3.3).
2.5.5 Sound structure
In its sound structure, speech is built out of elementary phonetic segments (vowels and consonants), which are combined into composite sounds beginning with syllables and words and proceeding to entire sentential and supersentential utterances. We can compare the former to single notes in music, and the latter to melodic structures formed by notes in sequential combination.161616The elementary phonetic segments of vocal utterances have features comparable to the pitch, overtone composition, and amplitude of single notes in music. Each entire utterance is made with a specific prosody, by which is meant that aspect of speech sound that inheres in composite sound units. Among the various dimensions of prosody, intonation and pace are the most important.
Variations in intonation, for example suddenly switching to a high-pitched voice, are used to express emotions or attitudes of the speaker, or to distinguish sentential units of different modalities (for example statements from questions), or for purposes of emphasizing or highlighting certain aspects of the dialogue, or to regulate the conversational flow.171717Note that the pitch variation in intonation is different from tone, another type of pitch modulation, that is used to distinguish grammatical or lexical meaning. In Mandarin, for example, lexemes are differentiated via differences in what is called syllable pitch. Another aspect of sound structure that can influence interpretation is voice quality, such as the use of a soft or hard voice, or the use of mere vocal cues such as throat-clearing, grunts, sniffs, unintelligibly muttering under one’s breath.181818We focus here on sound structure as it appears in the flow of a spoken dialogue. But sound structure can play a role, too, in written dialogue, for example when our minds associate the words in an email message with a certain intonation. This is an example of the subtlety and massive complexity of language interpretation as it occurs in the dynamic flow of inner mental experience.
Pace comprises rhythm, speed – for example speeding up or pausing, hesitating in mid-sentence – all of which can be selected, consciously or unconsciously, to shape the ways an utterance or sequence of utterances is interpreted. Pausing can also be used as a device to signal to one’s interlocutor that a conversation is reaching its end. Different types and layers of sound can be used together, for example when a dialogue partner responds to an utterance with a slow hand clap, or when Romeo serenades his sweetheart with musical accompaniment.
2.6 Dialogue dynamics
The production of meaning in the course of a dialogue is a highly dynamic process, which involves all of the levels distinguished above: the intentions of the interlocutors, the dialogue horizon generated by the interaction of relevant dialogue contexts (see 2.3), deixis and other forms of implicit meaning, and all the dialogue’s structural elements.
Both interlocutors start the dialogue with certain intentions. As the dialogue evolves, these intentions change as they dynamically drive the meanings the interlocuters want to convey to each other. Depending on how they interpret each other’s utterances, each can draw on a huge variety of interacting combinations of the structural elements involved in utterance production. Moreover, while this is happening, the dialogue horizon itself is evolving: some things move into the field of what is relevant to the dialogue, other things fall away. The contributions of each interlocutor modify the dialogue contexts that had existed when the dialogue began. In some cases the dialogue itself becomes on its own account an important (even the most important) context. When a dialogue uses its own unfolding history as context in this way, this leads to a refocussing of all earlier contexts, and this refocussing then influences how subsequent (unconscious and conscious) choices will be made in utterance formation and interpretation as the conversation proceeds.
2.6.1 Dialogue flow
In a prototypical dialogue, the interlocutors take turns. The roles of utterer and interpreter change periodically, and this switching of roles determines the role context of the dialogue participants. Either each interlocutor waits for the other to finish his utterance before replying, or other principles of turn-taking are used, for example that questions should be followed by answers (as in Turing’s vision of his own test) or that greetings expect a greeting in return (sacks:1974).191919Turns can also be guided using what (Sacks op. cit.) calls turn-constructional units, an important type of which are “possible completion points”, which are signals in the dialogue that indicate the opportunity for a role switch. These principles seem to be universally valid, in the sense that they are found in all human cultures (schegloff:2017; stivers:2009).
Conversational turn-taking is displayed in its ideal form in the strings of characters printed by a teleprinter on a moving paper tape, where only one person can have control over the input mechanism at any one time. This ideal form is illustrated also by a published interview after an editor has worked to create a polished textual flow. In actually occurring spoken dialogues, however, there are of course frequent deviations from this ideal. The utterer may pause or hesitate or stutter, create false starts, make mistakes, interrupt herself or try to add retrospective corrections to what she has said earlier, suddenly change the subject of the dialogue entirely. The interpreter may seize the speaker role by forcing a role switch before the utterer has finished her statement. If the utterer does not yield to the interruption, this may lead to utterances occurring simultaneously, so that the flow of meaning transmission breaks. Sometimes, the interpreter anticipates the next statement of the utterer and takes a turn before the latter finishes. All these deviations increase the complexity of the role context and add to the pressures on the dialogue participants both in forming and in interpreting dialogue utterances. They often go hand in hand with emotional layers to the dialogue flow, which support specific sorts of interpretation of dialogue utterances, for example where one dialogue partner seeks to influence the other by (as we say) playing on his emotions.
2.7 Concluding remarks on dialogue variance
We invite the reader to note not merely the many levels of variance in dialogues – the many more or less independent dimensions of variation that have been documented by both philosophers and linguists – but also the degree to which these variations depend on multiple factors (indeed multiple levels of multiple factors) both inside and outside the dialogue itself, factors that can extend to include almost any matter within the biographies and within the scope of the knowledge and interests of the dialogue partners.
We note also the degree to which many of these factors are a matter of continuous variation in the sense that the range of options forms a continuum, as for example between speaking with a soft and a loud voice, or with a calm and an angry voice. Movements along multiple such continua may take place within a single dialogue, and when such movements are effected by one dialogue partner they will typically call forth some concordant movement on the side of her interlocutor.
In all respects, indeed, preserving the flow of a dialogue rests on the capacity of humans to adjust their contributions to fit those of the dialogue partner, for example to adjust their respective intentions. This capacity is applied even in the most heated of disputes between friends or lovers, where even the most acrimonious of dialogue partners are able to maintain a conversation flow for considerable periods of time. This is achieved through a type of homeostatic process, whereby, when the conversation seems to be going completely off the rails, one or other partner succeeds in pulling it back from the brink and initiating another phase of what is once more recognizable as coherent verbal exchange.
3 Why machines cannot conduct real dialogues
In introducing his “new form of the problem” whether machines can think (turing:1950), Turing describes a second sort of “imitation game”, one in which an interrogator (Turing’s own word) would attempt to determine, on the basis of their answers to a series of questions, which of two persons was a man, and which a woman. The woman, here, is an ally of the interrogator, and has the strategy of telling the truth about herself, for example when answering questions about hairstyle or taste in clothes. The man, on the other hand, has the strategy of fooling the interrogator so that he will make the wrong decision.
The interrogator, for his part, must, precisely, interrogate. This means: try to gain the advantage by using the sorts of techniques interrogators use in other contexts in order to detect gaps, contradictions, confabulations and denials on the part of an interrogee (walton:2003). This means potentially also: drawing on all the combinations of levels of style, modality, prosody, facial gesture and posture distinguished in the above, including pleading, chastising, banging the table, until he gets the answers he wants.
In the Turing test proper, now, the interrogees are a human being and a machine. The former has a role similar to that of the woman in the foregoing, and thus adopts the strategy of telling the truth about herself. The machine, on the other hand, is programmed to attempt to fool the interrogator.
We have already raised doubts as to whether a demonstration that a machine is able to fool an interrogator in at least 70% of a set of five-minute rounds would truly provide evidence for the claim that a machine is in fact thinking. We can now also point out that these conditions do not do justice to what is involved in a genuine interrogation. Most importantly, meeting them would not, in our view, demonstrate general AI.
A demonstration that a machine is able to fool an interrogator purely on the basis of a written exchange – for example through the vehicle of a teleprinter – would also not demonstrate general AI. For it would not come close to testing the machine’s ability to emulate the “general experiential understanding of its environments that humans possess”.202020See again muehlhauser:2013. Very many of the sources of variance which we must command in order to achieve mastery of human dialogue are conveyed through what the dialogue participant sees or hears.212121nyiri:1989 describes how, in focusing on language use in his theory of meaning, Wittgenstein in effect reintroduces into philosophy a feeling for language as oral discourse. For Wittgenstein, the experience of spoken – rather than written – language is experience of language alive. Accordingly, we will complement the rather weak form of the test proposed by Turing himself with what we shall call the strong Turing test, which would require:
that the machine has the capability to engage with a human interrogator in dialogues of arbitrary length (we postulate that this will in practice be of the order of 4-6 hours, which was the typical length of an interrogation segment performed by one Stasi officer in the days of the DDR);
that the turns in such a dialogue are not restricted to cases where the machine merely reacts to a human trigger, such as in a succession of question-answer-pairs; rather, the interlocutors should behave exactly as they would in a normal dialogue;
that the dialogue would be – if not face-to-face – then at least in spoken form;222222A spoken dialogue of this sort would require a solution to the (hard) problem of engineering a machine with convincing voice production. and
that the machine (and thus also the human interrogee) would see the interrogator, since, if it is to pass the test, the machine has to demonstrate that it can react appropriately to the whole habitus of its human dialogue partner and not just to her speech.232323The interrogator should not of course see her two interlocutors, for otherwise she would be in a position to recognize the machine on the basis of how it looks.
We believe that a machine convincing a human interrogator under these conditions would indeed indicate the realisation of general AI.
3.1 Human and machine biography
When a mentally healthy human being242424By this we mean a human being without a mental disorder of the sort that would seriously impair the display of personality. engages in conversation, she is able to draw on her entire personal history and repertoire of capabilities not just of a linguistic nature, but also all the capabilities she has acquired in navigating all aspects of reality throughout her life. She is able to manifest, in other words, her identity.
Many of the conversations in which humans engage at least touch upon biography and identity. Humans want to know who it is they are dealing with; to learn about their origins, views, values, wishes, preferences, habits; and as they become more acquainted, they typically want to know and understand more and more about their respective interlocutors. Such information may, after all, be of practical relevance to them in realizing their intentions, for instance, in determining the trustworthiness of an interlocutor.
Machines, on the other hand, have neither biography nor identity. Yet if they are to pass the Turing test, they must present themselves to their interlocutors as human, and for this purpose fake biographies and identities must be manufactured. As the history of the clandestine services shows, cover stories are difficult to create, and living a cover identity is hard to carry off successfully in a way that will withstand real scrutiny. The sorts of pseudo-personae employed by covert operatives can often survive only superficial examination, and will likely break down under intense questioning.
It is of course possible to store large bodies of factual knowledge in
Turing machines, including knowledge about a person’s biography and
identity. A fundamental difference between the machine and the human case,
however, is that a human’s life history is expressed not as the recital of
a series of pieces of information but rather as realizations of that
integrated complex of evolving dispositions which is the person’s identity. These
dispositions exist because the human being has a certain neurophysiology
which is at any given stage the combined outcome of interactions between
his genotype and his experiences up to that stage.252525 To avoid
misunderstandings: We do not wish to imply that only something with our
kind of DNA, neurons, and so forth, could pass the Turing test. We think
that any entity with real intentions and the ability to undergo
auto-modifications analogous to inherited changes of genotype could evolve
to master the Turing test given enough time and environmental pressure.
This could even happen within the lifetime of a single entity if it lived
long enough to survive the modifications.
To avoid misunderstandings: We do not wish to imply that only something with our kind of DNA, neurons, and so forth, could pass the Turing test. We think that any entity with real intentions and the ability to undergo auto-modifications analogous to inherited changes of genotype could evolve to master the Turing test given enough time and environmental pressure. This could even happen within the lifetime of a single entity if it lived long enough to survive the modifications.
Recounting one’s biography is not merely generating a list of facts. It is a part of realizing one’s identity, and is thus tied to ones aims and goals. And it is a capability whose realization essentially involves inner mental experience. For example, when we are asked to recall aspects of our past, remembered experiences immediately become part of the context of our present experiences, where they are coloured by present feelings such as pride, regret, sadness, shame. These feelings will shape the way we will answer the interrogator’s questions about our past life, even if we try to hide our emotions. The interrogator acting as interpreter will consciously or unconsciously sense these emotions and this will again shape the way he interprets our response.
Because machines lack an inner mental life – as we do not know how to engineer such a thing – they also lack those capabilities, including capabilities relating to the emotional life, used in language production which draw on the implicit level of interpretation. The interrogator, because she is tuned to the expression of capabilities of these sorts, will soon notice this absence.
Furthermore, the life history upon which the human can draw is complete; it has no gaps. What this means is that, however hard we drill down to the details of a life, there will always be a fact of the matter that is at least in principle able to be determined, even though not all of these facts can be known directly by the subject in question (for example because he does not remember them, or because they pertain to events which took place at or before the time of his birth).
Every cover story, in contrast, like every work of fiction, is of necessity incomplete. It has what Ingarden called loci of indeterminacy, by which he means gaps left unspecified.262626Loci of indeterminacy are part of the stratum of represented objects in the structure of the literary work. Because the entities in this stratum (for example the protagonists in a novel) are defined using only a finite number of sentences, many aspects – relating to hair colour or weight or temperature or mood – must be left undetermined. See Ingarden, op. cit. This will hold, too, of the cover stories assigned to the machine. It is impossible to program a machine without such gaps, in the same way that it is impossible (and in any case not desirable) to write a novel without them. The designer of a “learning machine” would need to find a way to make it respond to challenges pertaining to such loci by allowing it to fill in gaps when challenged. The results, however, would be in some ways analogous to the confabulations produced by patients suffering from dementia who do exactly this – often displaying considerable creativity – in order to conceal their memory gaps.
If, as under the conditions of the strong Turing test, the interrogator is allowed – is indeed required – to dispense with the rather shallow sorts of question-answer sessions typically discussed in the Turing test literature, he will thereby be able to force his interrogees to bare their souls, to reveal exactly who they are, how they have realized, or failed to realize, their goals in life. A normal linguistically articulate human interrogee is easily able to pass this test. For the machine-that-is-pretending-to-be-a-human it will take more, much more, than a convincing cover story coupled with algorithms for filling gaps.
3.2 Static aspects of language production and interpretation
In providing an account of the powers that would be required of a machine purporting to emulate human dialogue behaviour, we distinguish between static and dynamic aspects of language production and interpretation.
An act of producing or interpreting a linguistic utterance is said to be static if it can be performed without the need to take temporal context into account. It is said to be dynamic if this is not the case – in other words if account needs to be taken of the switching of roles over time. We will begin with the static aspects, and show that the machine struggles even with these. We will then move on to the much more demanding dynamic aspects.
3.2.1 Static production by humans and machines
Static production is only found in the first utterance of a dialogue (the dialogue-initiating utterance); after this, dynamic utterance production takes place. In this situation, the machine faces the problem of generating a first utterance. This problem is fundamental, since it relates directly to the fact that the machine has no intentions and therefore cannot perform human-like speech acts. An initiating speech act of this sort when made by a human is always a reaction not just to some perceived external situation but also to some inner mental experience which shapes the intentions that drive the human to act. Such initiation is easy in those instances where there is a practical need – for example to ask a question of a salesperson before buying, to alleviate acute boredom, or to warn someone of an imminent danger. In many cases, however, human dialogue initiation involves overcoming a series of hurdles,272727Different sorts of hurdles are faced by different sorts of people in different contexts. In polite social circles in England, for example, a dialogue initiation step can be made only between persons who have already been introduced. Men and women face different hurdles in almost all contexts. of which the fear of rejection is only the most apparent. For this reason, humans take great care when initiating dialogue with persons with whom they are unacquainted, drawing on resources of prudence, confidence, bravado, and above all on an ability to anticipate the likely reactions to such an overture on the part of a stranger. They are required, on the basis of physical appearance alone, in what may be a very complex and rapidly changing situation involving what may be very many non-verbal clues, to make decisions regarding a major source of human behavioural failure. (Another such major source is failing correctly to interpret an overture from a stranger.)
On one Turing test scenario the interrogator will simply remain silent until the interrogee makes an opening conversational move. How is a machine to make the initial utterance in such a scenario? This could, of course, involve appeal to some routine formula such as: “Hi! My name is Hal. How do you do?” But it will be hard for a machine to vary the dialogue start in a convincing way. This is because, unlike humans, the machine has no intuitive understanding of the situation its interlocutor is in. It will indeed have some of the same non-verbal clues as the human, such as facial expression and body language. It can comment, for example, on the interrogator’s clothes. But it needs to draw out from what it can see the beginnings of a coherent narrative in the way that humans have been trained to do over the course of evolution.
3.2.2 Static interpretation by humans
What now as regards the interpretation of a single utterance of the sort we are called upon to perform in relation to the first utterance in a dialogue. For humans, according to current understanding, this task has two steps: first is a syntactic step, which is realized through a dynamic process of syntactical sentence parsing and construction using the structural elements constituting the sentence.282828There are several grammatical theories about how this happens, ranging from generative to constraint-based theories. mueller:2016 gives an overview.
This syntactical analysis yields the basis for the second, semantic step, which is the context-dependent assigning of meaning to the sentence (loebner:2013). Even for one sentence this process has a dynamic aspect. This is because, beginning with the very first word, the syntactic construction and semantic interpretation interact. This can require several successive cycles of revision, as an initial syntactic construction is revised as earlier parts of the sentence are re-interpreted in light of the ways they interact with parts coming later.292929auer:2009 has coined the term “on-line syntax” to describe this phenomenon.
For a single utterance, the context required for its interpretation by a human is what barker:1968 refers to as the ecological setting. This is the salient part of the environmental (physical) context in which the dialogue takes place and which will typically be centred on the person by whom the utterance is made. The absence of such a context explains why humans find it hard to speak on the phone with someone they have never met or spoken to before. The absence of a shared physical environment severely reduces the amount of context usable by the interlocutors and thereby creates a barrier to the transmission of meaning.
When interpreting the single utterance, the human has to apply contexts available to her from her own biography together with any clues she can draw from her interlocutor’s physical appearance and behaviour. Discourse economy forces her to make assumptions on this basis in her attempt to understand those aspects of meaning left implicit by a speaker, for example in order to disambiguate ambiguous aspects of his utterance, or gauge the force of turns of phrase that might in some contexts be threatening. In real conversations, humans use a massive amount of contextual non-verbal utterance-structure to achieve this. For example, when negotiating the purchase of a used car, the buyer will look for non-verbal cues indicating the reliability and honesty of the seller to make up for the information asymmetry inherent to the situation (akerlof:1970).
In addition, humans interpret static utterances by using knowledge they have derived through processing their own experiences over time, above all knowledge acquired through practical experience of the way the world around them is structured causally. From these experiences (combined with innate capabilities) they acquire an ability to reason about the relationships which link together entities in our environment into different families of predictable patterns.303030This ability is sometimes called ‘common sense’ (smith:1995). Compare also section 2.2 of (landgrebeSmith:2019). The latter are then extended also to the entities referred to in dialogue utterances, and this enables these utterances to be interpreted, for example in terms of their practical relevance to the interpreter.
126.96.36.199 Human interpretation of multi-sentence static utterances
Single utterances consisting of more than one sentence are still more challenging for humans to interpret than single-sentence utterances. This is because the sentences now contextualise each other: there are syntactic and semantic as well as explicit and implicit interdependencies which link them together. For example, sentences may be connected explicitly, via anaphora, or as chains of steps in an argument or chronological narrative, or implicitly, through analogies or historical resonances attached to certain words or phrases.
3.2.3 Static interpretation by machines
How, then, does the machine interpret the single sentence utterance? Here again two steps are involved: of syntactic construction and semantic interpretation. We deal with these in turn.
The syntactic construction using structural elements that humans perform according to the grammatical theories referred to in section 3.2.2 can be mimicked by the machine quite effectively for written text, when no non-lexematic structural language material has to be taken into account.313131By ‘lexematic material’, here, we mean those structural elements that can be directly reduced to lexemes – essentially wordforms and all their variants and composites. Machines fail, however, as soon as non-lexematic structural material such as facial expression, gestures, posture, or sound structures (2.5.5) come into play. This is because the world knowledge enabling the interpretation of this material – which can be combined in arbitrary forms to create many different sorts of contexts – cannot be learned without life experience and it cannot be mathematically formalised (compare paragraph 3.2.5).
For the interpretation of a single sentence – abstracting for now from non-lexematic material – the machine needs to reproduce the syntactic construction achieved by humans if the static interpretation pattern used by the human brain (syntactical analysis followed by semantic step) is to be reproduced323232The mainstream machine-learning-NLP community thinks this is no longer necessary. All is supposed to be computed implicitly using “end-to-end deep neural networks”.. This requires use of computational phrase structure grammar, dependency grammar or compositional grammar parsers.333333An overview is given in manning:1999. All of these create trees which represent the syntactic structure of the sentence. The parsers work well if the input sentences are syntactically valid. However, if a sentence is syntactically valid but semantically ambiguous, as in:
He saw old men and women,
an ideal computational parser will create two syntactic trees representing each sense.343434This feature is only available with compositional grammar parsers (moortgat:1997). With a sufficiently sophisticated computational setup, a context-dependent disambiguation may be possible.
It is with the interpretation of the syntactic structure – in other words with the move from syntax to semantics – that machines struggle, and this holds even in the static single sentence utterance case. For what is the context which the machine could use to assign meaning to a single sentence? The machine cannot decide this on its own. The multitude of combinations of language elements (compare section 2.2) allow for a huge number of interpretation possibilities even at the single sentence level. The machine cannot decide, for instance, how to fill in implicit meaning generated by the frequent usage of language economics, incomplete utterances or ellipses.
To achieve this the machine would need an appropriate context and dialogue horizon. Background information would thus need once more to be given
to the machine, analogous to the cover story background information needed to enable mimicry of a human dialogue-partner. If the scope of the anticipated subsequent sentences is very narrow, one can create a library of contexts and use a classifier to determine an appropriate context choice for a given input sentence. This context can be loaded and used to assign a meaning to the sentence with the help of logical inference. To achieve this, the logical language to be used needs to have the properties of completeness and compactness(boolos:2007). This means, however, that the expressiveness of both the sentence to be interpreted and the specification of contexts must be severely restricted, thus marking one more dimension along which the machine will fall short of general AI.
What can be achieved in this fashion is illustrated in the field of customer correspondence management, where there are repetitive customer concerns that can be classified and for which pre-fabricated narrow background contexts can be stored in the machine using first-order logic. Customer texts can then be understood by relating them to this knowledge base.353535This is the approach described in section 3.2 of landgrebeSmith:2019. However, it can be applied only in those special sorts of situation where the relevant contexts can be foreseen and documented in advance.
In both the weak and the strong versions of the Turing test, however, the range of contexts and context combinations is as vast as the human imagination. The human interlocutor can speak about anything he has experienced, read about or can imagine, depending on his biography, his current mood and intentions and their interaction with the situation he is in. It is impossible to build a library of contexts that would prepare a Turing machine for this kind of variation. In nearly all situations, therefore, the machine will not have any context to load in order to assign a meaning to the sentence, let alone to disambiguate personal pronoun anaphora of the sort illustrated in a sentence such as:
Been in touch with the Holy Spirit, or with the Pope, lately? How has he been?
3.2.4 Machine interpretation of suprasentential utterances
The space of possible contexts is all the more immense when we consider multi-sentence (suprasentential) utterances. Here interpretation requires the ability to identify and interpret complicated relationships between sentences, including all the syntactic and semantic as well as explicit and implicit sentence interdependencies of the sorts listed identified in 188.8.131.52. In open text-understanding tasks363636Closed tasks are those in which a large proportion of the texts to be understood contain repetitive patterns, such as customer or creditor correspondence or notices of tax assessment. it is impossible to foresee the possible sentence relationships and to provide in advance knowledge of the sort that would enable the machine to interpret them adequately.
Consider, to take a toy example, the tasks the machine would face in interpreting the following sentences:
The salmon caught the smelt because it was quick.
But the otter caught it because it was slow.
First, to understand that the explicit anaphora ‘it’ refers to ‘salmon’ in both (6) and (7) – even though two contradictory properties (‘slow’ and ‘quick’) are attributed to it – and thus to understand the reason for the adversative ‘but’ in (7), the machine needs biological knowledge about the species involved and about their respective hunting behaviours373737The interpretation of the second ‘it’ as referring to the smelt is perhaps still possible. Ambiguity is often simply not fully resolvable. Given such knowledge it can contextualise the two adjectives by tying them to different parts of the total situation described in the sentence pair. Already this is difficult – but infinitely many such combinations with much higher difficulty are possible (for instance, consider this very text which you, the reader, now have before you).
3.2.5 Machine interpretation of static non-lexematic material
In the strong Turing test, the machine has to interpret the entire structural material of an utterance, including the non-lexematic parts, which means: facial expressions, gestures, body language, as well as sound structures emanating from the interrogator. All of these can transform the interpretation of the lexematic structure: a sardonic grin, for example, when associated with what would otherwise be a compassionate utterance, can transform it into something sadistic; body language, pace, or intonation can make an expression of interest or a compliment sound hollow or sarcastic.
We will see that it is impossible for machines to detect such clues and to combine them with lexematic material in a way that would make it possible for them to achieve the sort of adequate interpretation that would be required to pass the strong Turing test – for the reason that the variance resulting from such combinations is huge, and each combination is a rare event which cannot be learned from training material. Furthermore, each combination of structural utterance material allows different interpretations. To make a selection from them and to create a reply based thereon that seems natural (non-stereotypical) to a human interlocutor requires an array of capabilities and intentions (an identity) which the machine lacks.
4 Modelling dialogue dynamics mathematically
In the previous section we have seen that it is very hard to make machines utter and interpret single static utterances. What happens in a real dialogue? As described in section 2.6, the evolution of a dialogue can be highly dynamic. The interlocutors switch roles as utterers and interpreters as they take turns based on signals in ordered or unordered form (cutting each other short, interrupting, speaking at the same time). While this happens, their respective dialogue horizons are in constant movement, and so are the intentions and speech acts based thereon. New utterances interact with older ones, the dialogue creates its own context.
From a mathematical perspective, a dialogue is a temporal process
in which each produced utterance is drawn from an extremely high
dimensional, multivariate distribution.383838 A multivariate
distribution is a distribution that can be modelled using the vector spaces
employed in stochastics.
A multivariate distribution is a distribution that can be modelled using the vector spaces employed in stochastics.(klenke:2013) Each produced utterance can relate to the utterances that preceded it in an erratic manner. In other words: there is no way to formalise the relationship between the utterance and what preceded it.393939Falling in love at first sight is a classic example of an event relating in an erratic manner to the events that precede it. Each utterance interpretation is drawn from a distribution of similar complexity, and can relate also to the utterance that preceded it in an erratic manner.
To see the sorts of problems that can arise, consider a dialogue between Mary and Jack involving several rounds of role-switching. Mary makes an utterance at round 7, which requires Jack to take into account an utterance from round 3. Based on this, Jack associates with Mary’s utterance an experience from his own past, of which Mary knew nothing, and provides an answer relating to this experience. This utterance from Jack is for Mary quite unexcepted (erratic) given her utterance in round 7. But it is perfectly coherent from the perspective of Jack, given his inner experience. Given phenomena such as this, there is no way to formalize the relationship between an interpretation of an utterance and this utterance itself.
4.1 Modelling dialogues as temporal processes
To understand the dialogue as a temporal process, four types of events need to be distinguished:
(static) initial utterance production, followed by
(static) initial utterance interpretation, followed by
(dynamic) dialogue-dependent responding utterance production, followed by
(dynamic) dialogue-dependent utterance interpretation.
These are linked via relations of dependence. In the prototypical case, pairs of events of types 3. and 4. are repeated until the dialogue concludes.
The distribution from which each of these events is drawn varies massively with the passage of time, as ever new utterances are generated and interpreted.
Both utterer and interpreter have a huge number of choices to make when generating and interpreting meaning, and because these choices depend on the various dialogue contexts (including the dialogue itself) and on their respective horizons, as well as on the biographies, personalities, capabilities, intentions, (and so forth), of the participants themselves. Each utterance and each interpretation is therefore erratic, which means: it cannot be modelled as depending on the set of observations made before utterance or interpretation choice is made.
To make matters worse, this does not even take into account the fact that most human dialogues deviate from the turn-taking prototype. For example, it is not conceivable that we could create a mathematical model that would enable the computation of the appropriate interpretation of interrupted statements, or of statements made by people who are talking over each other, or of the appropriate length of a pause in a conversation, which may depend on context (remembrance dinner or cocktail party), on emotional loading of the situation, on knowledge of the other person’s social standing or dialogue history, or on what the other person is doing – perhaps looking at his phone – when the conversation pauses. Pauses are context modifiers which influence or are important ingredients of the overall dialogue interpretation. They often contain subtle non-verbal cues, for example the fiddling of the interlocutor with a small object indicating irritation or nervousness.
4.2 Mathematical models of temporal processes
Two types of explicit mathematical methods are available to model temporal processes: differential equations and stochastic process models. Given that there are no other methods available,404040That is, nothing else exists in the body of knowledge of mathematics and theoretical physics. any Turing machine able to model such processes would have to draw from these alternatives or from some combination.
4.2.1 Differential equation models
Such models can be used to provide adequate representations of the changes in related variables when their relationships follow deterministic patterns of the sort that can be observed in the physical realm (for example radioactive decay over time). By “adequate model”, we mean a scientific model that is able (in ascending order of scientific utility) to 1. describe, 2. explain or 3. predict phenomena and their relationships. Description is the minimum requirement for any scientific model, but often the other two properties are also needed to make the model useful. Differential equations can be used, for example, to make predications regarding changes such as are involved in the distribution of heat from a source in space, which is modelled using the heat equation, a partial differential equation first developed by Fourier in 1822. But such models can only deal with a small number of variables and their interrelations. Whether the equations are adequate models of reality needs to be (and always can be) verified or falsified using physical experiments.
To pass the Turing test, adequate utterances have to be produced by the machine. Mathematically speaking, an utterance produced by the machine – no matter what algorithm is used – is a model-based prediction conditioned on the previous utterances in the dialogue and on the context (just as a Go-playing machine predicts its own next move conditioned on the opponent’s last move and the overall situation on the board). Such predictions need to be highly accurate414141Accuracy of stochastic models is measured by the percentage of predictions that match human expectations. In this case, this would mean a high degree of utterance salience. and their quality has to be maintained over the entire course of the dialogue, otherwise the human interrogator will realise that his interlocutor is non-human and the machine will fail the test.
It is a problem therefore, that as concerns social processes – and biological phenomena in general – differential equations cannot even provide descriptions, much less explanations or predictions, of the changes involved. This follows already from the fact that the number of variables involved in such phenomena is too large and their interdependences too complex to make such modeling possible. Evidence that proposed models do not work in these sorts of contexts is provided by the fact that they are falsified by empirical observations over and over again. Where differential equations are used successfully in biology this is because the number of variables is limited, for example in modelling organism population growth under simplified laboratory conditions.
The application of differential equation-based models to the problem of dialogue production and interpretation is for the same reason impossible. There are far too many variables, and we cannot even begin to formulate equations that would describe their relationships. The reason for this is that, although all the parts of the brain function in accordance with the laws of nature, the system behaviour is deterministic-chaotic for the phenomenon of language production424242A deterministic-chaotic system obeys deterministic laws but cannot be modelled due to overcomplexity. An erratic event, on the other hand, is an event – such as a sudden movement of the oil price – which seems unrelated to the events that precede it, and for which we have neither explanation nor model. In a Laplacian view of the universe, an erratic event is just an instance of a deterministic-chaotic process step. and thus cannot be modelled using differential equations (schuster:2005). We cannot even describe it in these terms, much less form predictions.
4.2.2 Stochastic process models
These can describe the behaviour of a one- or multi-dimensional random stochastic process , but only if
the process has additional properties that allow mathematical modelling (specifically, it must have independent and stationary increments, as further specified below).
The most expressive family of stochastic models, and thus the models that have had the widest usage in describing phenomena based on human interactions, is the Wiener process model (also referred to under the heading “Brownian motion”), which belongs to a family of models used extensively (indeed, too extensively, as we shall see) to model financial market processes such as movements in stock or derivative prices (jeanblanc:2009). Such prices are an expression of the aggregated intentions of very many market participants. The models make strong mathematical assumptions, for example that a price change process is a case of Brownian motion, or in other words that it satisfies the following conditions:
it has independent r.v. increments: for any pair of time points
, , where models the time before ,
it is stationary: , and
for any time point , .
Condition 1. expresses the fact that each increment of the r.v. is independent of what happened in the past; condition 2. that the unconditional444444This means: involving no reference to any particular starting value.
joint probability distribution of the process does not change when shifted in time; condition 3. expresses that the r.v. is distributed according to the Gaussian distribution.
Unfortunately, processes satisfying these conditions are nowhere to be found in actual markets. This is, again, because the preferences and intentions of human beings are erratic (in part because they depend on real world events, for example geopolitical events, which are also erratic). This is why, whenever collective decisions are off-trend, financial stochastic process models fail (mccauley:2009).
Dialogues, too, as we have seen, are multivariate processes, with the r.v. – utterances and interpretations – drawn from immense, typically unknown, and in any case not modellable multivariate distributions. Neither utterances nor interpretations are distributed according to a multivariate Gaussian distribution, since they are non-stationary and non-independent. And, to make matters even worse, interpretations are not directly observable (see paragraph 2.4).
The Brownian motion model is therefore not applicable, as none of its three conditions is satisfied.
4.2.3 Stochastic differential equation models
Differential equations can be extended to model temporal processes subjected to stochastic effects (noise), for example to model molecular dynamics. Again, however, even stochastically modified differential equations would still not be applicable to the problem of dialogue process modelling, since this would require that the assumptions needed for the applicability of both differential equations and stochastic process models would hold simultaneously for processes of language use. In fact however both of these sets of assumptions fail.
4.2.4 Deep Neural Network (dNN) models
These are a subclass of stochastic models that have in recent years been associated with considerable enthusiasm, triggered above all by:
the successes achieved since 2014 in improving automated translation through use of dNNs,
the popularisation by goodfellow:2014 of generative adversarial networks (GANs),454545Invented by Schmidhuber in (schmidhuber:1990). and
the invention of reinforcement learning, which brought the capability to outperform human beings, for example in the game of Go (silver:2016).
dNNs were accordingly tested early on in the domain of process modelling. We review the potential capability of three seemingly promising dNN-methods to model human dialogues, before looking at the empirical evidence yielded by experiments in dialogue emulation.
4.2.5 Deep recurrent neural network (rNN) models
These are dNN which contain connections between the nodes of the dNN graph in a way that allows the modelling of temporal sequences. They are often called sequence-to-sequence-dNNs because they can be used to create one sequence from another (for example, a translation from an input sentence). Often long-term-short-term-memory (LSTM) (schmidhuber:1997) and its numerous extensions are used in practice. Because classical stochastic process models are not able to model multivariate processes, the ability of rNNs to model temporal processes has been investigated in recent years as a potential saving alternative (dasgupta:2016; neil:2016; lai:2018). The results have performed well in modelling certain sorts of processes, for example road traffic occupancy, solar power production, or electricity consumption over time (lai:2018). As the latter reports, they have outperformed classical stochastic process modelling in certain tasks, especially when two processes with different patterns are overlaid in a series of observations.
We can infer from these examples several reasons why rNNs work well on such numerical time-series data:
data of these sorts approximately fulfil the assumptions needed for stochastic process modelling in general (of which dNNs, and thus rNNs, are a special case),
the data are repetitive and huge historical datasets are available for training purposes,
the dimensionality and the variance of the data is low,464646Exchange rates, another example modelled using dNNs, form a special case. This is because outcome dimensionality and variance are here relatively low in the short term, but mid-term outcomes are erratic. Thus the models work less successfully (lai:2018).
dNN architectures can be used to model temporal pattern overlays of the sort observed for example in traffic occupancy, which has a circadian and a workday vs. weekend rhythm.
Unfortunately, human dialogues
are not repetitive, but erratic.
do not fulfil the central assumptions presupposed by temporal process models.
are of extremely high dimensionality, and
manifest variance that is as large as the sum of the results of all human activities since the emergence of our species.
Moreover, because the interpretations involved at each stage of the dialogue are implicit (see 2.4), we can never use them as a source of training data. This will mean that there can never be training data to cover the dependencies that hold between successive utterances occurring over time, since interpretations are an essential link in the chain between one utterance and the next.
Note again, however, that all of this holds only for dialogues in general, the mastery of which is a criterion of general AI. As we shall see, for very stereotypical dialogues, for example the telephone scheduling of a haircut or the reservation of a hotel room, there could eventually be sufficient training data for a dNN-based approach to be of value.
184.108.40.206 Generative adversarial network (GAN) models
work using two networks, one discriminative, the other generative (goodfellow:2014). The former is trained to discriminate classes of input data using annotated training material, often pictures tagged by human beings (for example pictures in which humans can be distinguished from other items represented). The generative network is then tasked to create new samples of one desired class (for example pictures of humans, (karras:2018), which it can indeed do). The two networks are then chained together by having the samples yielded by the generative network passed on to the discriminative network for classification. The system is then optimised to minimise the rate at which samples are generated that are not classified by the discriminative network as belonging to the desired class. This approach works very well with pictures, because the discriminative net can be pretrained with adequate training material (data tagged by humans).
Again, however, GANs are not applicable to language. For to build an utterance-generating GAN that creates meaningful output one would need to pretrain a discriminative net that can distinguish meaningful from non-meaningful utterances. The problem is that, because the meaningfulness of an utterance depends on its context and interpretation, there is no conceivable way in which a sufficient body of training material could be assembled to cover the practically infinite variance of human utterances. It is therefore not possible to create a discriminative net that can be used to build a meaningful-utterance-producing GAN.
4.2.6 Models based on reinforcement learning
In reinforcement learning, a reward (score) is assigned when a certain step in a repeatable type of finite process is realized by the machine. ‘Finite’, here means that the process ends after a series of steps that is not too long, such as a game of Go or a first-person shooter game in which killing sequences are repeated. In Go, for example, a trained algorithm is used to assign a score after each action the machine performs in each game. The machine obtains one point for each stone of the opponent it captures and one point for each grid intersection of territory it occupies. The trained algorithm is optimised to maximise the total score obtained over the entire game. This is done by having the computer play the game billions of times in different situations, so that optimal paths for these situations can be found and stored in the model.
Crucial, for such optimization to be possible, is that the scores for every move can be assigned automatically by the machine. Machine learning of this sort can thus be used only in those situations in which the results of machine decisions can be scored via further machine decisions. This is primarily in games, but the method can be extended, for example, to debris cleaning scenarios where what is scored is the number of units of debris removed. In such narrowly defined situations, machines can find behavioural strategies that outperform human behaviour (jaderberg:2018). Lastly, reverse reinforcement learning (arora:2018), a technique to automatically learn an adequate reward score from observed situations, does not help in the dialogue case because there is here no adequate set of observed situations because – with very few exceptions – almost all dialogues are different, too few patterns are repeated sufficiently many times.
Reinforcement learning cannot, therefore, be applied to the engineering of Turing-test-capable dialogue systems, for there is here nothing to which the needed sorts of scores can be assigned. (There is no winning, as we might say; or at least no winning of the sort that can be generally, and repeatedly, and consistently, and automatically scored.) We note also that the truly impressive successes of reinforcement learning do not provide evidence that general AI is about to be achieved. This is because the scope of applicability of such algorithms is narrowly limited to situations in which scoring is possible. It is also because the meta-parameter for the algorithms which compute the optimisation, the reinforcement including its scores (and many others) need to be set in each case by engineers.
4.3 Current state-of-the-art in dialogue systems: A review of what has been achieved thus far
Even given all of the above, dialogue emulation is an area of considerable activity in AI circles. The resultant dialogue systems – also called ‘agents’ (or in some circles ‘chatbots’) – are designed and built to fulfil three tasks (citing gao:2018, p. 6):
Question Answering – “the agent needs to provide concise, direct answers to user queries based on rich knowledge drawn from various data sources”
Task Completion – “the agent needs to accomplish user tasks ranging from restaurant reservation to meeting scheduling  and business trip planning”
Social Chat – “the agent needs to converse seamlessly and appropriately with users – it is performance along this dimension that defines the quality of being human as measured in the Turing Test – and provide useful recommendations”.
In our view at least, the third task could only be performed by a machine with general AI. Indeed, it would be one of the purposes of general AI to perform tasks such as this. The Turing test would then be one instrument for validating performance ability.
4.3.1 Question Answering and Task Completion
Question Answering and Task Completion are areas in which dialogue systems are already of considerable commercial value, mainly because customers with relatively homogeneous cultural backgrounds can be motivated to reduce their utterance variance – for example by articulating clearly and using sentences from a pre-determined repertoire – if by interacting with a bot they can quickly obtain answers to questions or resolution of boring tasks. In the medium term, technologies of these sorts will enable systems which can satisfy a double-digit percentage of customer requests.
However, Question Answering and Task Completion clearly have nothing to do with passing the weak or the strong Turing test in a way that would be indicative of general AI. Fulfilment of each is (thus far) something that is achieved simply by using appropriately configured software tools, which every user identifies as such immediately on first engagement.
4.3.2 Social Chat
What, then, about Social Chat (also called ‘neural chitchat’) applications? Here, research is currently focused on two approaches:
reinforcement learning used to train conversational choice-patterns over time (the optimal path of machine utterances during a dialogue).484848Discussed in 4.2.6 above, and in loc. cit., pp. 59-61.
Strong claims are made on behalf of such approaches, for example in zhou:2018 which describes Microsoft’s XiaoIce system – “XiaoIce” is Chinese for “little Bing” – which is said to be “the most popular social chatbot in the world”. XiaoIce was “designed as an AI companion with an emotional connection to satisfy the human need for communication, affection, and social belonging.” The paper claims that XiaoIce “dynamically recognizes human feelings and states, understands user intents, and responds to user needs throughout long conversations.” Since its release in 2014, XiaoIce has, we are told, “communicated with over 660 million users and succeeded in establishing long-term relationships with many of them.”
Like other “neural” chitchat applications, however, XiaoIce displays two major flaws, either of which will cause any interlocutor to realise immediately that they are not dealing with a human being and which will prevent any sane user from “establishing a long-term relationship” with the algorithm.
First, such applications create repetitive, generic, deflective and bland responses, such as “I don’t know” or “I’m OK”. This is because the training corpora they are parameterised from contain many such answers, and so the likelihood that such an answer might somehow fit is rated by the system as high. Several attempts have been made to improve answer quality in this respect, but the utterances obtained from the algorithms are still very poor. The reason is that the algorithms merely mimic existing input-utterance-to-output-utterance sequences without interpreting the specific (context-dependent) input utterance the system is reacting to.
Each input is treated, in fact, as if it were the input to a machine translation engine of the sort which merely reproduces sentence pairs from existing training sets. The difference is that here the training sets consist of pairs of sentences succeeding each other in one or other of the dialogues stored in a large dialogue corpus. The result is that, with the exception of a small subset of the structural elements, none of the sources of human discourse variance listed in section 2.2 are taken into account in generating output utterances. Again, no attempt is made to interpret the texts. Rather, in producing utterances, the machine simply tries to copy those utterances in the training set which immediately follow syntactically and morphologically similar input symbol sequences. This means that utterances are decoupled from context, and so responses appear ungrounded. Attempts to improve matters on this front using what are called “Grounded Conversation Models” (Gao et al., op. cit., section 5.3) – which try to include background- or context-specific knowledge – have not solved the problem. The failure to model the utterance variance sources persists.
Second, these sorts of applications create ever more incoherent utterances over time. This is first of all because they cannot keep track of the dialogue as its own context 2.3.3, and secondly because the datasets they are trained from are actually models of inconsistency due to the fact that they are created as mere collections of fragments drawn from large numbers of different dialogues. Attempts to alleviate the problem using “speaker” embeddings or “persona”-based response-generation models have improved the situation slightly (ghazvininejad:2017); but they do not come close to ensuring consistent conversation.
To refer to the existing social chat algorithms as “neural” is, therefore, somewhat far-fetched. The neural ability of simple species such as round worms (302 neurons) or jellyfish (5000 neurons) is much higher, for the survival of the latter through some 500 million years of evolution suggests that they have the ability to integrate situational stimuli from a range of contexts to yield adequate reactions.
Given that machines of the mentioned sorts can neither interpret utterances by taking into account the sources of variance, nor produce utterances on the basis of such interpretations together with associated (for example biographical) knowledge, the approach cannot be seen as promising when it comes to mastering the Turing test in either its weak or its strong versions.
4.3.3 Reinforcement leaning in neural chitchat
The basic problem of reinforcement learning (RL) for social dialogue is that it is impossible to define a meaningful reward. XiaoIce uses CPS (conversation turns per session, (zhou:2018)), a measure that maximises the duration of a conversation. We doubt, however, that the Turing test interrogator would be impressed by dialogue behaviour generated to optimise a measure of this sort.
li:2016 used a more sophisticated reward system by training an RL-algorithm using dNN-generated synthetic utterances (because using real human utterances would be prohibitively expensive) together with a tripartite reward function rewarding
non-dull responses (using as benchmark a static list of dull phrases such as “I don’t know”)
non-identical machine utterances, and
Markov-like short-term consistency.
The results are appalling, and one wonders why this type of research is being conducted, given that – as a result of using synthetic data – it violates the basic principles of experimental design as concerns adequacy of measurement setup for observation of interest.
4.3.4 Multi-purpose dNN language models
Recently, radford:2018 multitask dNN-language models from
large corpora by formulating the learning task as the ability to predict a
language symbol – for example a single word – based on the symbols
preceding it, were trained using an unsupervised approach, but with the possibility to
condition the model on certain task types (mccann:2017).494949 In
the type of unsupervised learning described there, the algorithm learns
models of probability distributions for symbol sequences from unlabelled
input data. These models reflect symbol sequence distributions. Once
created, they can be used to predict symbol sequences given conditioned
In the type of unsupervised learning described there, the algorithm learns models of probability distributions for symbol sequences from unlabelled input data. These models reflect symbol sequence distributions. Once created, they can be used to predict symbol sequences given conditioned input.The model that results (dubbed “GTP-2”) is then conditioned with problem-specific input data to produce model-based predictions to solve NLP benchmarks (“zero shot predictions”). For some basic tasks amenable to sequence-modelling (including translation and text gap filling) the performance is good. For question answering, however, which is the only dialogue-related task that was tested, only 4.1% of questions505050Typical example: “Largest state in the US by land mass?” were answered correctly. The limited relevance of the authors’ approach to language modelling, which views language purely as a matter of sequences, becomes apparent for example when they state that “While dialogue is an attractive approach, we worry it is overly restrictive.”
4.4 Problem-specific AI: Turing machines enriched by prior knowledge
Looking at the main problem of social chatbots, namely their inability to interpret utterances and to react to them with context-adequate, biography- and knowledge-grounded responses, one could indeed imagine endowing an algorithm with systematic prior knowledge of the sort required for conversations. The system presented in landgrebeSmith:2019 incorporates prior knowledge in this way, focusing on knowledge of the sort needed to complete tasks such as simple letter and email answering or repair bill validation. It uses this prior knowledge and logical inference in combination with machine learning to explicitly interpret texts on the basis of their business context and to create adequate interpretation-based responses. However, it works only because it has strong, in-built restrictions.
The range of linguistic inputs it has to deal with is very narrow (for example car glass damage repair bills or customer change of address requests), thereby avoiding the problem of complex and nested or self-referential contexts (section 2.3).
Such a system would fail in dialogues, and this would be so even if it was stuffed to the gills with (for example) biographical knowledge of the dialogue participants. This is because it could not cope with either the complex dialogue contexts or the dialogue dynamics. These phenomena aggravate the difficulties in dealing with language economy, dialogue structure and modality, because the contexts and the dynamics create a huge range of interpretation possibilities on all such levels. The resultant huge variance makes it impossible to provide the machine with sufficient knowledge to derive meaningful responses.
How, then, do humans pass the Turing test? By using language, as humans do. Language is a unique human ability that evolved over millions of years of evolutionary selection pressure.515151“For Wittgenstein, the main reason why primarily human beings have the ‘privilege’ of ‘having’ mental states is that ‘thinking’, ‘understanding’, ‘feeling pain’ and other psychological phenomena are inseparably tied to human life and can only be grasped against this background the ascription of psychological states is necessarily connected with the human form of life; already as children we learn to see some events as signs for human behavior as well as some specific behavior as sign for ‘inner’ states like ‘thinking’ or ‘being in pain.” (neumaier:1986 pp. 151f.) Using language gives us the ability to realize our intentions, for instance by generating initial utterances (engaging in dialogue as a means of expressing ideas or desires) and by dynamically interpreting an interlocutor’s utterances. This then allows us to react adequately, either with further utterances or with corresponding actions. Subconsciously or consciously, a human interlocutor is thereby able to sense the purposes of a fellow human being with whom he interacts because this is a survival-critical ability. Machines cannot use language in this sense because they lack any framework of intentions525252Machines lack intentions because they have nothing like human will. We could perhaps endow a machine with will and intentions if we knew how to model mathematically their underlying human neural substrates in a Turing-machine processable way. Unfortunately, we do not know what these neural substrates are nor do we know how they operate. that could shape the way in which they interpret or generate utterances.
Ultimately, a human always notices that a machine has no intentions. An analogous lack of intentions and purpose can be experienced also when speaking to long-term schizophrenics with a consequent autistic syndrome. Their reaction patterns are immediately perceived as non-normal, because the ability to interpret and create utterances has deteriorated (bleuler:1983). Machines perform much worse than do such patients.
The AI community has so far failed to come to grips with the huge landscape of variance in dialogue utterances and with the deterministic-chaotic nature of dialogues as erratic stochastic processes. Could they take these factors into account with new system designs? We have argued that there is no way to model the human use of language – either explicitly as with classical approaches (summarised for example in (russell:2014, section 22.4.1), or implicitly with deep neural networks or reinforcement learning. And again, this is for mathematical reasons: there is no known type of model which could even come close to representing human dialogue behaviour. This would still be so even if we could make available the huge quantities of data – orders of magnitude greater than the datasets used to train Google translate, one of the world’s largest contemporary language-dNN – that such a model would need if a machine was to be trained to implement it.
Turing machines can only compute what can be modelled mathematically, and since we cannot model human dialogues mathematically, it follows that Turing machines cannot pass the Turing test. This is so in both the weak form introduced by Turing and in the strong form introduced in this paper.
Passing the strong form of the test would indeed be clear evidence of general Artificial Intelligence. But this will not happen in the short- or mid-term. For our ability to interpret and generate utterances that seem natural to a human interlocutor depends on our identity, including the intentions we form. Intentions are the ultimate source of dialogue variance, and we do not know how to teach a machine to emulate them.
For comments on an earlier version of this manuscript we would like to thank Larry Hunter, Prodromos Kolyvakis, Niels Linnemann, Emanuele Martinelli, Robert Michels, Alan Ruttenberg, and Thomas Weidhaas.