Virtual Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Research

by   Douwe Kiela, et al.
University of Cambridge

Meaning has been called the "holy grail" of a variety of scientific disciplines, ranging from linguistics to philosophy, psychology and the neurosciences. The field of Artifical Intelligence (AI) is very much a part of that list: the development of sophisticated natural language semantics is a sine qua non for achieving a level of intelligence comparable to humans. Embodiment theories in cognitive science hold that human semantic representation depends on sensori-motor experience; the abundant evidence that human meaning representation is grounded in the perception of physical reality leads to the conclusion that meaning must depend on a fusion of multiple (perceptual) modalities. Despite this, AI research in general, and its subdisciplines such as computational linguistics and computer vision in particular, have focused primarily on tasks that involve a single modality. Here, we propose virtual embodiment as an alternative, long-term strategy for AI research that is multi-modal in nature and that allows for the kind of scalability required to develop the field coherently and incrementally, in an ethically responsible fashion.



There are no comments yet.


page 1

page 2

page 3

page 4


Building Human-like Communicative Intelligence: A Grounded Perspective

Modern Artificial Intelligence (AI) systems excel at diverse tasks, from...

Artificial Intelligence for Long-Term Robot Autonomy: A Survey

Autonomous systems will play an essential role in many applications acro...

To Root Artificial Intelligence Deeply in Basic Science for a New Generation of AI

One of the ambitions of artificial intelligence is to root artificial in...

Revisiting Citizen Science Through the Lens of Hybrid Intelligence

Artificial Intelligence (AI) can augment and sometimes even replace huma...

Subjectivity Learning Theory towards Artificial General Intelligence

The construction of artificial general intelligence (AGI) was a long-ter...

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Multi-modal datasets in artificial intelligence (AI) often capture a thi...

Conflict and Cooperation: AI Research and Development in terms of the Economy of Conventions

Artificial Intelligence (AI) and its relation with societies is increasi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Meaning has been called the “holy grail” of a variety of scientific disciplines, ranging from linguistics to philosophy, psychology and the neurosciences Jackendoff:2002 . The field of Artifical Intelligence (AI) is very much a part of that list: the development of sophisticated natural language semantics is a sine qua non for achieving a level of intelligence comparable to humans. Embodiment theories in cognitive science hold that human semantic representation depends on sensori-motor experience Barsalou:2008arp ; the abundant evidence that human meaning representation is grounded in the perception of physical reality leads to the conclusion that meaning must depend on a fusion of multiple (perceptual) modalities Meteyard:2008book . Despite this, AI research in general, and its subdisciplines such as computational linguistics and computer vision in particular, have focused primarily on tasks that involve a single modality. Here, we propose virtual embodiment as an alternative, long-term strategy for AI research that is multi-modal in nature and that allows for the kind of scalability required to develop the field coherently and incrementally, in an ethically responsible fashion.

Embodiment theory implies that the best way for acquiring human-level semantics is to have machines learn through (physical) experience: if we want to teach a system the true meaning of “bumping into a wall”, we simply have to have it bump into walls repeatedly. Although this scenario shares similarities with human language acquisition, it is not (yet) a viable route: our current machine learning paradigms do not allow for the required rate of learning to make such a scenario feasible. With modern day state-of-the-art deep learning systems requiring millions of samples to solve highly specific tasks that are trivial to humans, it is reasonable to speculate that it would take much longer than a human lifespan for a

physically embodied agent to develop extensive linguistic capabilities, with current technology. We conjecture that such limitations apply to a much lesser extent to an agent that is virtually embodied.

By virtual embodiment we mean to say that agents may collectively or individually acquire semantics by being embodied in a virtual, rather than a physical, world. Concretely, rather than having a physical robot learn to understand the world by physically bumping into physical walls, we would have virtual agents bump into virtual walls in a virtual world. Such virtual embodiment offers several key advantages:

  1. Scalability and incremental development: The complexity of virtual worlds can develop in conjunction, i.e., scale, with the capabilities of artificial agents. This allows for a stepwise development towards general machine intelligence, rather than aiming for the end-goal without a concrete understanding of the challenges or consequences we will face when attempting to reach that end-goal.

  2. Long-term feasibility: The performance ceiling of any agent is a function of the complexity of the virtual environment. Virtual worlds may initially not be overly complex, but they can grow in complexity as technology develops. This allows for a focused long-term research strategy that is feasible now, but will remain challenging in years and decades to come.

  3. Rapid iteration: The fact that artificial agents are constrained by arbitrary parameters means that development can happen rapidly and iteratively, through agents learning from interacting both with humans and with each other. Rather than the extremes of either having one system solve small uni-modal tasks, or instead trying to solve the whole problem in a single attempt, we can improve iteratively, in an agile fashion, at great speed.

  4. No requirement for continuous human involvement: Although interaction is necessary for embodied learning, virtual interactions need not require human involvement at each step, but may rather happen between agents themselves. This unburdens humans by foregoing the need for a constant supervised signal, as is currently often seen in machine learning applications, which also facilitates rapid development.

  5. Ethical testability: Importantly, since artificial agents are exposed to a constrained environment, virtual worlds provide the ultimate testing ground for carefully fleshing out important ethical considerations in relation to artificial intelligence Bostrom:2003paperclip , without any potentially damaging immediate consequences in the physical world.

For these reasons, we propose virtual embodiment as one of the best and most feasible strategies for instigating a stepwise development towards artificial general intelligence. In particular, we advocate the development and use of “video games with a purpose” to facilitate virtual embodiment. In what follows, we briefly outline some of the background that led to this proposal, explain why video games are suitable for the current purposes and list the desiderata for virtual embodiment-compatible video games to facilitate research in artificial intelligence.

2 Grounding Semantics in Virtual Perception

A fundamental problem of semantics is the grounding problem Harnad:1990 , which concerns the circularity in defining the meaning of a symbol through other symbols. In the context of Searle’s famous Chinese Room argument Searle:1980bbs

, it can be phrased as: is it possible to learn Chinese from nothing but a (very sophisticated) Chinese dictionary? Modern representation learning approaches, including the word embeddings that have become popular in natural language processing, are exponents of the distributional hypothesis

Harris:1954word , which stipulates that you “shall know the meaning of a word through the company that it keeps” Firth:1957book . In other words, semantic representation learning defines symbols through other symbols, which exposes it to the grounding problem. In contrast, there is abundant evidence that human meaning representation is grounded in physical reality and sensorimotor experience Jones:1991cd ; Barsalou:1999bbs ; Glenberg:2002pbr ; Louwerse:2011tcs ; Barsalou:2005book .

Motivated by these theoretical considerations, the field of multi-modal semantics aims to ground semantic representations by introducing extra-linguistic, perceptual input into semantic models. Multi-modal semantic models lead to practical improvements in a variety of natural language processing tasks, ranging from resolving linguistic ambiguity Berzak:2015emnlp to metaphor detection Shutova:2016naacl . Beyond vision, there has also been work aimed towards auditory Lopopolo:2015iwcs ; Kiela:2015emnlpa and even olfactory Kiela:2015acl grounding. However, most current multi-modal semantic models suffer from two important limitations. First, images and to a lesser extent sound files lack the element of time, whereas temporal and sequential input are central aspects of language understanding. Second, these approaches lack any interaction, which plays an important role in language acquisition: children learn basic language understanding by interacting with the environment, and build more intricate “reflective reasoning” on top of that foundation Landau:1998tcs .

There has been work in linguistic grounding that allows for temporal aspects, for instance in videos Gupta:2009cvpr ; Regneri2013:tacl ; Yu:2015jair , and both time and interaction, notably in the field of robotics Fitzpatrick:2003rsa ; Coradeschi:2013ki ; Bisk:2016naacl . However, robotics does not currently constitute a suitable platform for language learning, since physical embodiment is not yet feasible. Virtual embodiment does not suffer from the same limitations. There has been recent work on grounding in virtual worlds, notably in video games Narasimhan:2015emnlp

. Work applying deep reinforcement learning to video games points the way towards agents learning from each other

Mnih:2015nature ; Silver:2016nature ; Lazaridou:2016arxiv . An alternative would be virtual or augmented reality, which offers the benefit of joint multi-modal data over time, but this crucially lacks the element of interaction.

Our position is very much aligned with recent proposals for new directions in AI research Mikolov:2015arxiv ; Sukhbaatar:2015arxiv ; Weston:2015arxiv ; Johnson:2016malmo . The particular problem of language features in these proposals to a varying extent, but we take it to be a core piece of any path toward artificial general intelligence, in line with recent attempts to make machines genuinely understand human language Hermann:2015nips ; Wang:2016acl . We specifically advocate multi-agent video games “with a purpose” VonAhn:2004chi , rather than alternative virtual worlds that lack gamification, since they provide interesting platforms for humans to engage with for extended periods of time, without the explicit purpose of teaching machines to achieve a certain task.

3 Desiderata

It is worthwhile outlining the properties that video games might have if they are to be suitable platforms for developing AI through virtual embodiment. For that purpose, we propose a hierarchy111Inspired by the Kardashev scale for the sophistication of civilizations Kardashev:1964hierarchy . of the types of embodied manifestations an agent might have in a world. The same type hierarchy applies both to physical and to virtual worlds:

  • Type 0: Agents perform basic first-order interactions with the world, with full or limited access to the objective world state. No intra-agent communication is required.

  • Type 1: As above, but without any state access. Communication may be used for sharing knowledge about the state of the world.

  • Type 2: As above, but with higher-order interactions, i.e., with an element of planning, strategy and non-monotonic reasoning. Communication is essential for sharing knowledge about the world.

  • Type 3: The world should be strictly non-deterministic and multi-modal. This makes communication essential for not only sharing knowledge about the world, but also for sharing plans and strategies.

  • Type 4: Agents should be multi-objective, that is, an agent’s objective or reward function should be a weighted function of various objectives or rewards, that depend both on the state of the world and current plans and strategy.

  • Type 5: Multi-objective agents interact with and communicate about a non-deterministic world in such a way that it allows for them to plan ahead and form and execute sophisticated strategies.

The final type of embodiment corresponds to what biological agents are capable of performing in the physical world. It is much too large a leap for current technologies to achieve, but the benefit of virtual embodiment is that we can grow the complexity of the world together with the sophistication of artificial agents, which makes virtual embodiment suitable for being AI’s next frontier. The real world is enormously complex, and performing common sense reasoning in such a complicated environment has long been one of AI’s classic problems in the shape of the frame problem McCarthy:1969rai . The frame problem is a function of the world’s complexity, which makes it more manageable for virtual embodiment.

Most recent work has not extended beyond Type 1 embodiment, which means that the field has a long way to go. Specifically in the context of video games, we believe that development can proceed more rapidly if they are of mixed agency, meaning that both humans and artificial systems control agents in the virtual world; and carefully designed as a level playing field with a human bias, such that human agents have a slight upper hand, which means that e.g. the superior memory of machines should not affect in-game performance and that machines can learn from humans. To our knowledge no video game currently exists that satisfies these properties and which facilitates Type 5 embodiment.

4 Conclusion

We propose virtual embodiment, through video games, as a scalable long-term strategy for artificial intelligence research. Embodiment is essential for developing human-level natural language semantics, which we take to be a core aspect of artificial intelligence. Virtual embodiment allows for growing the complexity of virtual worlds in line with the sophistication of artificial agents, which makes it a suitable testing ground for artificial intelligence, in an ethically responsible manner.


This research was enabled by the European Research Council PoC grant GroundForce.


  • [1] R. Jackendoff. Foundations of Language. Oxford University Press, Oxford, 2002.
  • [2] Lawrence W. Barsalou. Grounded cognition. Annual Review of Psychology, 59(1):617–645, 2008.
  • [3] Lotte Meteyard and Gabriella Vigliocco. The role of sensory and motor information in semantic representation: A review. Handbook of cognitive science: An embodied approach, pages 293–312, 2008.
  • [4] Nick Bostrom. Ethical issues in advanced artificial intelligence. Science Fiction and Philosophy: From Time Travel to Superintelligence, pages 277–284, 2003.
  • [5] Stevan Harnad. The symbol grounding problem. Physica D, 42:335–346, 1990.
  • [6] John R. Searle. Minds, brains and programs. Behavioral and Brain Sciences, 3(3):417–57, 1980.
  • [7] Z. Harris. Distributional Structure. Word, 10(23):146—162, 1954.
  • [8] John R. Firth. A synopsis of linguistic theory. Blackwell, 1957.
  • [9] Susan S Jones, Linda B Smith, and Barbara Landau. Object properties and knowledge in early lexical learning. Child development, 62(3):499–516, 1991.
  • [10] Lawrence W Barsalou. Perceptions of perceptual symbols. Behavioral and brain sciences, 22(04):637–660, 1999.
  • [11] Arthur M Glenberg and Michael P Kaschak. Grounding language in action. Psychonomic bulletin & review, 9(3):558–565, 2002.
  • [12] Max M. Louwerse. Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 59(1):617–645, 2008.
  • [13] Lawrence W. Barsalou and Katja Wiemer-Hastings. Situating abstract concepts. In Grounding cognition: The role of perception and action in memory, language, and thought, pages 129–163. Cambridge UP, 2005.
  • [14] Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. Do You See What I Mean? Visual Resolution of Linguistic Ambiguities. In Proceedings of EMNLP 2015, Lisbon, Portugal, 2015.
  • [15] Ekaterina Shutova, Douwe Kiela, and Jean Maillard. Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of NAACL-HTL 2016, San Diego, 2016.
  • [16] A. Lopopolo and E. van Miltenburg. Sound-based distributional models. In Proceedings of the 11th International Conference on Computational Semantics (IWCS 2015), 2015.
  • [17] Douwe Kiela and Stephen Clark. Multi- and cross-modal semantics beyond vision: Grounding in auditory perception. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2461–2470, Lisbon, Portugal, September 2015.
  • [18] Douwe Kiela, Luana Bulat, and Stephen Clark. Grounding semantics in olfactory perception. In Proceedings of ACL, pages 231–236, Beijing, China, July 2015.
  • [19] Barbara Landau, Linda Smith, and Susan Jones. Object perception and object naming in early development. Trends in cognitive sciences, 2(1):19–24, 1998.
  • [20] Arpan Gupta, Praveen Srinivasan, Jianbo Shi, and Larry S Davis. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

    , pages 2012–2019. IEEE, 2009.
  • [21] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013.
  • [22] Haonan Yu, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. A compositional framework for grounding language inference, generation, and acquisition in video. Journal of Artificial Intelligence Research, 52(1):601–713, 2015.
  • [23] Paul Fitzpatrick and Giorgio Metta. Grounding vision through experimental manipulation. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 361(1811):2165–2185, 2003.
  • [24] Silvia Coradeschi, Amy Loutfi, and Britta Wrede. A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intelligenz, 27(2):129–136, 2013.
  • [25] Jonathan Bisk, Deniz Yuret, and Daniel Marcu. Natural language communication with robots. In Proceedings of NAACL, San Diego, CA, 2016.
  • [26] Karthik Narasimhan, Tejas D Kulkarni, and Regina Barzilay. Language understanding for textbased games using deep reinforcement learning. In Proceedings of EMNLP, 2015.
  • [27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [28] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.

    Mastering the game of go with deep neural networks and tree search.

    Nature, 529(7587):484–489, 2016.
  • [29] Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. Towards multi-agent communication-based language learning. CoRR, abs/1605.07133, 2016.
  • [30] Tomas Mikolov, Armand Joulin, and Marco Baroni. A roadmap towards machine intelligence. arXiv preprint arXiv:1511.08130, 2015.
  • [31] Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, and Rob Fergus. Mazebase: A sandbox for learning from games. arXiv preprint arXiv:1511.07401, 2015.
  • [32] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
  • [33] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In International joint conference on artificial intelligence (IJCAI), 2016.
  • [34] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701, 2015.
  • [35] Sida I. Wang, Percy Liang, and Christopher D. Manning. Learning language games through interaction. In Proceedings of ACL 2016, Berlin, Germany, 2016.
  • [36] Luis von Ahn and Laura Dabbish. Labeling images with a computer game. In CHI, pages 319–326, 2004.
  • [37] Nikolai S Kardashev. Transmission of information by extraterrestrial civilizations. Soviet Astronomy, 8:217, 1964.
  • [38] John McCarthy and Patrick J Hayes. Some philosophical problems from the standpoint of artificial intelligence. Readings in artificial intelligence, pages 431–450, 1969.