Developmental robotics  (and synonyms cognitive developmental robotics, autonomous mental development as well as epigenetic robotics ) is the interdisciplinary approach to the autonomous design of behavioural and cognitive capabilities in artificial agents that directly draws inspiration from developmental principles and mechanisms observed in children’s natural cognitive systems [10, 39].
Autonomous agents in such settings learn in an open-ended  manner, where crucial components of such developmental approach consist of learning the ability to autonomously generate goals and explore the environment, exploiting intrinsic motivation  and computational models of curiosity [43, 37].
Ii Related Work
Ii-a Development and learning in human child
The development of the executive functions (EF) in human infants and young children with rudimentary neurodevelopment of prefrontal cortex (PFC) refers to an array of organizing and self-regulating goal-directed behaviors that inhibit impulses and regulate behaviour from a very early age. These developments have been associated with both the PFC maturation and its connectivity with other brain areas  which is enabled by the individual’s sustained interaction with the surrounding physical and social environment . The initiation of these sensorimotor interactions in young children are exploratory in nature and often are embedded in playful activities with components of motor learning [1, 27]. Visual stimuli are also responsible for the elicitation of improved EF and cognitive organization which contributes to the development of perceptual learning.
Although exposure to visual stimuli can lead to perceptual learning, it is often insufficient to yield robust learning . Research shows that additional factors, such as attention and reinforcement are needed to produce robust learning. Amount of exposure, strength of exposure, relation to attention, interactions of multiple sensory systems in perceptual learning are some of the factors that promote human learning; the underlying brain mechanisms that relate to these factors are among the most active targets of research into the complex mechanisms of child’s learning and the association of EF development with visuo-motor integration .
In relation to the above-mentioned mechanisms, research has shown the interaction of memory and learning with mechanisms such as curiosity, appraisal, prediction and exploration [28, 31]. Gruber’s PACE framework  suggests that curiosity is triggered by significant prediction errors that are appraised. This enhances memory which is encoded through increased attention, exploration and information seeking and contributes to the consolidation of information acquired while in a curious state through dopaminergic neuromodulation of the hippocampus. More on the dopamine neuromodulator from the intrinsic and extrinsic reward perspective of RL is in .
From a behavioural perspective, exploration has been previously identified as a special form of curiosity that refers to a drive that is either intrinsic or extrinsic . Active experimentation with physical objects generates more accurate inferences about the latent properties of the object than passive observation . Exploration of the physical world is considered as a phase in human transition from behavioural events towards symbolic and conceptual thinking. The developmental process of symbol and concept emergence has been associated to the relative frequency in which certain strategies are used and to the process of abandoning an old strategy and discovering new ones .
In problem-solving tasks these mechanisms have been correlated with child’s ability to inhibit a certain action while considering an alternative one that would be more appropriate for the optimal performance of a task . The developmental process that leads from sensori-motor events to abstract learning and the acquisition of the optimal strategy for a specific task can be measured by behavioural indicators such as task performance speed and accuracy level. . However, this process appears more complex in the case of collaborative problem-solving where the child interacts with a more knowledgeable social agent. This includes the process of selective social learning and relies on child’s social motivation aspects for learning .
Research shows that humans have the ability to explicitly communicate their uncertainty to others at a very early stage of their life. Infants are capable of monitoring and communicating their own uncertainty non verbally to gain knowledge from others . While playing in unstructured and uncertain environments that lack clear extrinsic reward signals, they actively seek help from other humans. In early childhood, however, children might be aware of their uncertainty, but they do not proceed always to help-seeking  which shows the complexity of extrinsic and intrinsic motivation.
In this complex context, the examination of the learning outcome often is not adequate for the understanding of children’s problem-solving activity. An emphasis on how children move from early to later levels of competence within an EF component allows the depiction of their developmental trajectories , , . A mapping of the developmental trajectories reveals inter-individual differences in cognitive mechanisms such as inhibition of prepotent responses, mental shifting  and generalization . These changes have been associated with changing brain connectivity which is considered as both cause and consequence of the developmental changes . An additional input towards the understanding of child’s developmental process comes from the field of child-robot interaction in which the child can take advantage of the robot’s appropriate interventions.
Ii-B Child development inspired artificial agent learning
Child learning has vastly inspired how to build learning machines . A sample of cognitive architecture to teach robots in the way infants learn is in , demonstrating how exploiting sensitivity to sensorimotor contingencies/ affordances in developmental psychology, combined with the notion of goal allows an agent to develop new sensorimotor skills in open ended learning settings [21, 20]. An example of new discovered contingency is, e.g., touching a bell to generate a sound.
Inspired by developmental psychology, in 
interactive learning (active imitation learning and goal-babbling) is combined with autonomous exploration in a strategic learner to reuse previously learned tasks or “procedures” in aSocially Guided Intrinsic Motivation with Procedure Babbling (SGIM-PB) able to determine the representation of a hierarchy of interrelated tasks. In hard-exploration games, novelty seeking agents , curiosity meta-learning  and remembering promising states and exploring from them  are powerful approaches to learn artificial agents.
Essential robotics scenarios for open-ended learning making use of brain inspired models are Long-Term Memory for Artificial Cognition , for robots to learn to operate in different worlds under different goals when the occurrence of experiences is intertwined. In this context, a Baxter robot demonstrates to learn control tasks, segmenting the world into semantically loaded categories associated with contexts, that in order, can allow higher level reasoning and planning. Architectures for lifelong learning by evolution in robots are MDB (Multilevel Darwinist Brain) [4, 5].
Some of the modulation based mechanism embedded within a cognitive architecture for robots combine long-term memory and a motivational system in order to select candidate primitive value functions for transfer and adaptation to new situations through modulatory ANNs. These progressively conform new parameterized value functions able to address more complex situations in a developmental manner in a Baxter robot, which must solve different tasks in a cooking setup , or simplify the utility space in continuous state spaces .
Charisi et. al.  take inspiration from inhibitory control in developmental psychology and examine child-robot collaborative problem-solving with a focus on the process rather than the outcome of child’s acquisition of a certain strategy. The task of Tower of Hanoï is used to study the initiation of voluntary request for help in a child-robot interaction setting with child-initiated robot interventions. They observe children’s trajectories of problem-solving and the needs for exploratory actions. We extend this work  to test if robotics learning processes and agent learning from an expert can be child-development inspired. Since their analysis of when and why asking for help helps solving collaborative tasks in inhibitory processes, in this paper we contrast the hypotheses tested in kids with those mimicking the same situations in an artificial agent learning to solve the same task, with reinforcement learning .
As in  we are evaluating the learning agent (LA) on the Tower of Hanoï game, but instead of the LA being a child, our agent is a Q-learning algorithm  with a learning rate , a discount factor and an exploration . As it can be seen on Fig. 3 in the Appendix, the Tower of Hanoï game with 3 disks is a simple close-ended task with 27 possible states and, at most, 3 possible actions associated to each state. Each element of the reward matrix used for the Q-learning represents the reward from moving from the current state to the next one. Moves leading to the goal state are assigned a reward of 100, illegal moves a reward of and others a reward of 0.
In order to explore if algorithms benefit from asking for help in human-robot collaborative problem-solving, in the same manner as kids do, we further formulate two hypotheses:
H1: Canonical interventions from an expert speed up learning.
H2: Getting help on demand from an expert accelerates finding the optimal solution compared to not on demand.
Iii-B Research Design
We manipulate the expert intervention with 2 different scenarios:
The LA1 solves the task in collaboration with the expert in a “turn-taking” scenario, which results in a canonical cognitive intervention by the expert.
The LA2 solves the task independently, having the option to ask for help of the expert whenever (if) this is needed, which results in an on demand intervention by the expert.
In order to test the different variations among teacher-driven and learner-driven interaction  in our HRI setting, we vary two main parameters:
The canonical intervention rate, i.e. the frequency of the expert’s intervention during the canonical scenario.
The ask-for-help parameter, i.e. how much the LA asks the expert to do the next movement, as a proxy to simulate the needs for help, during the on demand scenario.
Our evaluation metric is the number of movements required to solve the task after a variable number of training episodes. To make these results robust, all the experiments were repeated 100 times.
We used the above-mentioned parameters to test our hypotheses as follows.
Iv-a Task Performance with and without Turn-Taking
The first configuration consists of a LA1, a Q-learning agent, playing in collaboration with an expert that knows exactly what is the optimal movement in each configuration. Every two turns, the expert will play instead of the LA1 and perform the optimal action. We compare this with the performance of the LA1 when it solves the task alone, and with the one of a random policy.
As it can be seen in Fig. 1, the LA1 is directly more efficient when it is helped by the expert in a turn taking scenario, going from an order of moves to solve the task without help without training, to with canonical interventions. This can be explained by the fact that the agent is directly placed by the expert on the optimal sequence of actions (the left side diagonal from Fig. 3) to solve the task. In fact the expert is able to solve the task in 7 moves starting from any state, so after it has played, the LA1 is necessarily only 6 moves away from victory rather than 7. Thus during the first episodes of Q-learning, when the LA1 is not yet aware of the optimal path and acts somewhat randomly, it is still closer to the resolution of the task when it receives help than when it does not, because in the worst case it would be 6 moves away from the resolution instead of 7. In other words, the help of an expert improves the performance of the random policy. However the LA1 is moving away from the random policy after only 10 episodes when it does not receive help and it takes 100 training episodes to start drastically reducing its mean number of moves. At the same time, the performance still seems to be random in the turn-taking configuration and it takes to the agent 300 training episodes before it starts to converge to the optimal solution. The curves intersect after 400 training episodes when the LA1 without help starts to outperform the helped LA1. The LA1 needs 3,000 episodes of training to reach the optimal solution with canonical interventions, whereas it only needs 1,000 episodes when it is not helped. We can therefore conclude that being helped every 2 rounds by an expert agent does not speed up the learning process, on the contrary it slows it down.
This is somehow not really surprising because the expert giving the optimal solution every two rounds prevents the agent from exploring every possible state. As shown in Fig. 3, the objective is to reach the 222 state at the bottom left and each move of the expert will therefore lead the game to a state further to the left or further down than the previous state. This makes some states hard to reach (such as 121) or even impossible (such as those below the 223), thus delaying the convergence towards the optimal solution as the agent will still waste time trying to get in there even if it is not possible. This is a drawback of the learning system used. In contrast to some state-of-the-art methods such as Policy Shaping [30, 11], our Learning Agent is guided by an expert user and the feedback is not formulated as policy advice, as the goal is not to optimize the human feedback but to mimic how a kid learns solve the task with a Learning Agent with a Q-learning algorithm, instead of a child as in the settings of . The learning system could be improved by optimizing the teaching  by not always giving the optimal action but the one that will teach the agent the most.
A solution that would not deviate from the initial experimental setup could therefore be to let the LA1 explore the different states by involving the expert less frequently, by modifying the canonical intervention rate. This is what we did in Fig. 4 in the Appendix, letting the expert play every 3 and 4 turns.
Iv-B On demand or canonical intervention by the expert
The second configuration consists of a LA2, a Q-learning agent, trying to solve the task independently. It has the opportunity to ask for help to an expert agent whenever it needs to. To do this, we added an ask for help parameter to the Q-learning. At each turn, if the best policy value is lower than the ask for help parameter, the expert will play instead of the LA2. As we can see on Fig. 2, the LA2 is directly more effective, because he is always asking for help as it does not know yet what do to. After asking for help many times during the first 10 episodes it starts solving the task by itself, resulting in a loss of efficiency. We interpret this as the LA2 gaining confidence in movements which, while not perfect, still allows the task to progress towards its resolution through state exploration and trial. Compared to the LA1 without help, the LA2 asking for help is much more efficient but there is not a lot of variation between the canonical and the ask for help
configuration. This is probably due to the rather simple simulation of the ask for help trigger.
V Discussion and Future Work
This paper presents the initial work towards the understanding of problem-solving process with two artificial agents by simulating a child-robot interaction experimental study. We acknowledge that the simulation of child’s behaviour is a complex task and more emphasis is needed on accurate description of the multidimensional child behaviour.
The aspect of intrinsic motivation , indirectly related with solving a concrete task, but concerned with learning a set of reusable skills, should be further studied when rising the level of abstraction, specially in the context of solving different tasks and taking as input larger state spaces and of larger dimensionality [45, 44] in order to simplify problem-solving in an end-to-end learning manner. State representation learning  may come into use for a more realistic, less preprocessing demanding setting, i.e., not requiring human annotations of each game state when involving human collaboration.
Our point is to verify if hypotheses derived from an empirical study in an HRI setting are valid when translated from children to a RL Agent. Thus, the difficulty lies in the simulation of the child’s behaviour by an artificial agent. The addition of an intrinsic motivation, on the desire for the LA to solve the game by itself, could increase the accuracy of this comparison. Our LA asks for help when it considers that a movement is not good enough to be played (i.e. when the largest Q-value among all available states fall under a pre-set threshold), whereas in reality, the mechanisms pushing the child to ask for help are much more complex [35, 7].
One of the challenges to explore is to validate the hypotheses tested with more complex tasks. More elaborated manners should be devised to more faithfully model uncertainty in the agent while acting. Future work could better mimic the presented and other human learning inspired behaviours. For instance, one could quantify (aleatoric and epistemic) uncertainty [54, 13, 15] of the agent’s next action so we can better simulate the ask for help setting when an agent is not certain enough. An accurate assessment should be made of the mechanisms that lead a child to ask for help when solving a task independently. This would allow it to be represented in the LA’s behaviour so that it could ask for help in a more human-like natural way.
Future work includes the expansion of collaborative problem-solving settings with triadic interactions e.g. two children and a robot, in order to examine features of collective problem solving accounting for social dynamics . In addition to this, we are planning to examine the shifting processes , i.e. the processes of strategy generalization in a different task in human and artificial agents. Future work could also consider the possibility of trading-off between the gain generated for the agent by asking versus the disruption it causes to the human, using principles of mixed initiative interaction . CoBots approaches111CoBots http://www.cs.cmu.edu/~coral/projects/cobot/ to ask for help are a related field to further explore, e.g., planning approaches for the LA to distinguish actions that it can complete autonomously from those that it cannot [49, 48].
Finally, in order to better understand child’s developmental trajectories, we aim to replicate a similar child-robot interaction setting with a larger sample by manipulating additional variables such as the agent’s social behaviour. This would inform our testing of more complex algorithms than Q-learning, using other dopamine based distributional RL signals , and as little training data as people need .
We thank Cristina Conati for giving feedback on this work.
-  (2019) Motor development: embodied, embedded, enculturated, and enabling. Annual review of psychology 70, pp. 141–164. Cited by: §II-A.
-  (2020) Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325. Cited by: §II-B.
-  (2002) When and where do we apply what we learn?: a taxonomy for far transfer.. Psychological bulletin 128 (4), pp. 612. Cited by: §II-A.
-  (2010) A Cognitive Developmental Robotics Architecture for Lifelong Learning by Evolution in Real Robots. External Links: Cited by: §II-B.
Using promoters and functional introns in genetic algorithms for neuroevolutionary learning in non-stationary problems. Neurocomputing 72 (10), pp. 2134 – 2145. Note:
Lattice Computing and Natural Computing (JCIS 2007) / Neural Networks in Intelligent Systems Designn (ISDA 2007)External Links: Cited by: §II-B.
-  (2010) A developmental perspective on executive function. Child development 81 (6), pp. 1641–1660. Cited by: §II-A, §II-A.
-  (2018) Expanding the active inference landscape: more intrinsic motivations in the perception-action loop. Frontiers in Neurorobotics 12, pp. 45. External Links: Cited by: §V.
-  (2018) Intuitive experimentation in the physical world. Cognitive psychology 105, pp. 9–38. Cited by: §II-A.
-  (2012) Algorithmic and human teaching of sequential decision tasks. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI’12, pp. 1536–1542. Cited by: §IV-A.
-  (2018) From babies to robots: the contribution of developmental robotics to developmental psychology. Child Development Perspectives. Cited by: §I.
-  (2015) Policy shaping with human teachers. In IJCAI, Cited by: §IV-A.
-  (2020) Child-robot collaborative problem-solving and the importance of child’s voluntary interaction: a developmental perspective. Frontiers in Robotics and AI 7, pp. 15. External Links: Cited by: §II-B, §III, §IV-A, Fig. 3.
-  (2019) Estimating risk and uncertainty in deep reinforcement learning. arXiv preprint arXiv:1905.09638. Cited by: §V.
-  (2018) Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in neural information processing systems, pp. 5027–5038. Cited by: §II-B.
-  (2019) Uncertainty-aware action advising for deep reinforcement learning agents. Cited by: §V.
Agents teaching agents: a survey on inter-agent transfer learning. Autonomous Agents and Multi-Agent Systems 34, pp. 1–17. Cited by: §III-B.
-  (2020) A distributional code for value in dopamine-based reinforcement learning. Nature, pp. 1–5. Cited by: §V.
-  (1988) The development of children’s strategies for selective attention: evidence for a transitional period. Child Development, pp. 1504–1513. Cited by: §II-A.
-  (2002) Normal development of prefrontal cortex from birth to young adulthood: cognitive functions, anatomy, and biochemistry. Principles of frontal lobe function, pp. 466–503. Cited by: §II-A.
-  (2020) DREAM Architecture: a Developmental Approach to Open-Ended Learning in Robotics. External Links: Cited by: §II-B.
-  (2018) Open-ended learning: a conceptual framework based on representational redescription. Frontiers in Neurorobotics. Cited by: §I, §II-B.
-  (2019) Learning a set of interrelated tasks by using a succession of motor policies for a socially guided intrinsically motivated learner. Frontiers in Neurorobotics 12, pp. 87. External Links: Cited by: §II-B.
-  (2019) Perceptual generalization and context in a network memory inspired long-term memory for artificial cognition. International Journal of Neural Systems 29 (06), pp. 1850053. Note: PMID: 30614325 External Links: Cited by: §II-B.
-  (2020) First return then explore. arXiv preprint arXiv:2004.12919. Cited by: §II-B.
-  (2019) Neural substrates of early executive function development. Developmental Review 52, pp. 42–62. Cited by: §II-A.
-  (2017) Unity and diversity of executive functions: individual differences as a window on cognitive structure. Cortex 86, pp. 186–204. Cited by: §II-A.
-  (1988) Exploratory behavior in the development of perceiving, acting, and the acquiring of knowledge. Annual review of psychology 39 (1), pp. 1–42. Cited by: §II-A.
-  (2018) Towards a neuroscience of active sampling and curiosity. Nature Reviews Neuroscience 19 (12), pp. 758–770. Cited by: §II-A.
-  (2016) Infants ask for help when they know they don’t know. Proceedings of the National Academy of Sciences 113 (13), pp. 3492–3496. Cited by: §II-A.
-  (2013-01) Policy shaping: integrating human feedback with reinforcement learning. Advances in Neural Information Processing Systems, pp. . Cited by: §IV-A.
-  (2019) How curiosity enhances hippocampus-dependent Memory: The prediction, appraisal, curiosity, and exploration (PACE) framework. Trends in cognitive sciences. Cited by: §II-A.
-  (1999) Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’99, New York, NY, USA, pp. 159–166. External Links: Cited by: §V.
-  (2019) Sensorimotor contingencies as a key drive of development: from babies to robots. Frontiers in Neurorobotics 13, pp. 98. External Links: Cited by: §II-B.
-  (2013) Selective social learning: new perspectives on learning from others.. Developmental Psychology 49 (3), pp. 399. Cited by: §II-A.
-  (2017) Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §II-B, §V, §V.
-  (2018) State representation learning for control: an overview. Neural Networks 108, pp. 379 – 392. External Links: Cited by: §V.
-  (2020) Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Information Fusion 58, pp. 52 – 68. External Links: Cited by: §I.
-  (1994) The psychology of curiosity: a review and reinterpretation.. Psychological bulletin 116 (1), pp. 75. Cited by: §II-A.
-  (2003) Developmental robotics: a survey. Connection Science 15 (4), pp. 151–190. External Links: Cited by: §I.
-  (2019) Developing together: the role of executive function and motor skills in children’s early academic lives. Early Childhood Research Quarterly 46, pp. 142–151. Cited by: §II-A.
-  (2000) The unity and diversity of executive functions and their contributions to complex “frontal lobe” tasks: a latent variable analysis. Cognitive psychology 41 (1), pp. 49–100. Cited by: §V.
-  (2007-04) Intrinsic motivation systems for autonomous mental development. Evolutionary Computation, IEEE Transactions on 11 (2), pp. 265–286. External Links: Cited by: §I, §V.
-  (2018) Computational theories of curiosity-driven learning. CoRR abs/1802.10546. External Links: Cited by: §I.
-  (2019) Introducing separable utility regions in a motivational engine for cognitive developmental robotics. Integrated Computer-Aided Engineering 26 (1), pp. 3–20. Cited by: §V.
-  (2019-07) Modulation based transfer learning of motivational cues in developmental robotics. In 2019 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. External Links: Cited by: §V.
-  (2020) Producing parameterized value functions through modulation for cognitive developmental robots. In Robot 2019: Fourth Iberian Robotics Conference, M. F. Silva, J. Luís Lima, L. P. Reis, A. Sanfeliu, and D. Tardioli (Eds.), Cham, pp. 250–262. External Links: Cited by: §II-B.
-  (2019) Simplifying the creation and management of utility models in continuous domains for cognitive robotics. Neurocomputing 353, pp. 106–118. External Links: Cited by: §II-B.
-  (2012) Is someone in this office available to help me?. Journal of Intelligent & Robotic Systems 66 (1-2), pp. 205–221. Cited by: §V.
-  (2011) Task behavior and interaction planning for a mobile service robot that occasionally requires help. In Automated Action Planning for Autonomous Mobile Robots, Cited by: §V.
-  (2007) Microgenetic analyses of learning. Handbook of child psychology 2. Cited by: §II-A, §II-A.
-  (2005) Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288. Cited by: §II-A.
-  (2020) Beyond origins. developmental pathways and the dynamics of brain networks. In Current Controversies in Philosophy of Cognitive Science, A. J. Lerner, Cullen,Simon, and S. Leslie (Eds.), pp. 49–62. Cited by: §II-A.
-  (1998) Introduction to reinforcement learning. Vol. 135. Cited by: §II-B.
Single-model uncertainties for deep learning. In Advances in Neural Information Processing Systems, pp. 6414–6425. Cited by: §V.
-  (2016) The embodied mind: cognitive science and human experience. Cited by: §II-A.
-  (2020) Explainable agents through social cues: a review. arXiv preprint arXiv:2003.05251. Cited by: §V.
-  (2017) Proactive help-seeking: preschoolers know when they need help, but do not always ask for it. Cognitive Development 43, pp. 91–105. Cited by: §II-A.
-  (1992-05) Technical note: q-learning. Machine Learning 8, pp. 279–292. External Links: Cited by: §III.
Vi-a Tower of Hanoï game
All possible states of the Tower of Hanoï game are in Fig. 3.
Vi-B Additional Results
The LA1 who receives help is, regardless of the number of training episodes, always more efficient than the one who does not receive help. We can therefore conclude that an agent is more efficient when it receives help, as long as this help does not block its exploration.