Worldwide, approximately million people lived with dementia in 2018 . Reminiscence therapy (RT), the most popular therapeutic intervention for persons with dementia (PwDs), exploits the PwDs’ early memories and experiences, usually with some memory triggers familiar to the PwDs (e.g., photographs and music), to evoke memory and stimulate conversation . It has been evidenced that RT has positive effects on PwDs’ quality of life, cognition, communication, and mood . While computer-based RTs, such as the InspireD Reminiscence App  and Memory Tracks App , have been developed to make RT more accessible to PwDs, the interaction modality is limited to 2D visual signals or sounds, lacking non-verbal interactions, e.g., eye gazing, body movement and facial expression. Comparatively, a physically embodied social robot capable of providing non-verbal interactions, is believed to enable more intuitive, effective and engaging memory triggers during RT , thus stimulating more memory recall and conversation. In addition, robot-assisted RT is a promising solution to cope with the increasing number of PwDs and relieve the stress from the caregivers due to the dead-set execution and indefatigable repeatability .
applied deep learning technique to develop a smartphone-based conversational agent which automated RT by showing a picture, asking questions about the pictures, and giving comments on users’ answers. However, PwDs may have different dementia degrees, and an individual PwD may show time-varying behaviors, emotions, personalities, and cognitive capabilities[3, 9]. It is very challenging for supervised or unsupervised learning to achieve a learning agent with sufficient adaptivity to different individual PwDs. We herein target the robot training using RL, which allows the robot to constantly learn from interacting with the PwD and end up with an optimal conversation strategy for the target PwD . There are several existing works that investigated PwD-robot dialogue management using RL. For instance, Magyar et al.  employed Q-learning (QL) to learn a robotic conversation strategy to promote the PwD’s response with considering the PwD’s interested topics and emotions. Yuan et al.  developed a robotic dialogue strategy via QL to handle the repetitive questioning behaviors from the PwDs. However, a pervasively applicable patient model is still lacking in the existing literature which can i) integrate a comprehensive list of the major factors impacting the PwD’s behaviors during RT, and ii) accurately characterize the probabilistic transitions between the PwD’s mental states under different robotic actions. Such a model will provide valuable guidance to more targetedly design the clinical experiments and collect the data, and serve as a customizable interface between the clinical data and the robotic RT strategy design.
To this end, we aim to build a pervasive simulation model for PwDs to characterize their probabilistic behaviors during RT and develop a RL-based conversation strategy for robot-assisted RT. Specifically, our contributions are three-fold. Firstly, we design a parameterized pervasive PwD model which incorporates the PwD’s response relevance, emotion levels and confusion conditions as the mental states, and depicts the probabilistic behaviors of PwDs during RT as probabilistic transitions between different mental states. Secondly, we define a Markov Decision Process (MDP) model for the robot-assisted RT and design a Q-learning (QL) algorithm to achieve the optimal conversation strategy for the robot. The strategy is sensitive to the PwD’s mental states and promotes the PwD’s talking by duly adjusting the difficulty level of the prompts, repeating or explaining the prompts to clear confusion, and comforting to help the PwD out of the bad moods. In case that bad moods continue, the strategy offers the PwD the initiative to continue or change the topic, or stop the conversation so that the stress of RT is mitigated. The impacts of the PwD’s choices are also considered during the learning towards the optimal strategy. Finally, simulations are conducted to demonstrate the learning convergence and validate the efficacy of the achieved strategy in promoting conversation.
The remainder of the paper is organized as follows. Section 2 describes the simulation model of PwD in the context of RT. Section 3 elaborates the design of the robotic conversation strategy, including the definition of MDP and the revised QL algorithm based on the proposed PwD model. The experimental results are presented and discussed in Section 4 with suggested future work.
2 Simulation Model for a Person with Dementia (PwD)
In the robot-assisted RT, the robot provides memory triggers (e.g., photographs, music, or video clips) and stimulates the PwD to talk about relevant past memory and experiences. During the conversation, the PwD with limited cognitive capacity may provide relevant, irrelevant, or even no response to the robot. In addition, the PwD may show different emotions, such as joy and discomfort, as a reaction to different memory triggers and the robot’s actions. Moreover, the PwD may become confused about a question or memory trigger provided by the robot. Thus, in the context of robot-assisted RT, we represent the current state of a PwD by their response relevance, emotion levels, and confusion conditions. In this model, a PwD’s response relevance can be relevant response (RR), irrelevant response (IR), or no response (NR). A PwD’s emotional level is categorized as negative (Neg), neutral (Neu), and positive (Pos
). A PwD’s confusion condition is classified as confused (Yes) and unconfused (No).
Depending on the robot’s actions during RT, the PwD’s state may switch from one to another. We consider the following robot actions. On one hand, to stimulate the conversation during RT, the robot can provide appropriate prompts (e.g., questions) about the current memory trigger [5, 13]. The difficulty level of the prompt can be adjusted, i.e., easy prompt (, e.g., yes-no question), moderately difficult prompt (), and difficult prompt (
, e.g., open-ended question). On the other hand, when the PwD gets confused or in a negative emotion, conditions known as ”harmful” or ”bad” moments, the robot may take actions to help the PwD out of these bad moments. Inspired by previous relevant studies [8, 14], the robot will repeat () or explain () the prompt when the PwD feel confused [5, 13], and comfort () the PwD to alleviate their fears or discomfort  during RT.
The probabilistic behaviors of the PwD is modelled with the transition probabilities (See Appendix I or go to the next url: https://drive.google.com/drive/folders/1FmhNsXJnG_WUUKtEpBBflig3ipks1qTb?usp=sharing) among different PwD states given each robot action. Basically, as the difficulty level of the prompt from the robot increases (e.g., vs. ), the probabilities of the PwD responding relevantly and showing positive emotion will decrease, and the probabilities of getting confused will increase. When the PwD is in negative emotion or confused, the probabilities of relevant response will be smaller. If the PwD gets confused and the robot chooses to repeat or explain on the prompt, the probabilities of the PwD responding and showing non-negative emotion will increase, with the confusion condition possibly changed. If the robot chooses to comfort when the PwD is in bad moment, the probabilities of relevant response will be increased, with the emotion levels possibly changed better.
Moreover, our designed RT strategy will give the PwD initiative to control the conversation tendency when the bad moments continue. This will give the PwD a sense of control, thus mitigating the RT stress . If the PwD shows confusion or negative emotion continuously twice, the PwD will be provided with the choice of stopping the RT, continuing to talk about the current memory trigger, or changing to another memory trigger. If the PwD chooses to stop, the current RT session will terminate. If the PwD chooses to continue, the PwD’s next state will remain unchanged. If the PwD chooses to change the memory trigger, the PwD’s next state is considered to be no response (NR), with neutral emotion (Neu), and no confusion. We define the robot’s action of providing choices as .
Note that the state transition probabilities of different memory triggers are set to be identical in our simulations, but can be set different according to personal preferences  in the future.
3 Adaptive Robot-Assisted Reminiscence Therapy
In this section, we apply the technique of reinforcement learning (RL) to learn a conversation strategy for the robot to deliver reminiscence therapy. The goal is to maintain the RT for a target number of conversation rounds, stimulate the PwD to express as much as possible, and keep the PwD’s state as positive as possible. A revised Q-learning (QL) algorithm is used to achieve the best conversation strategy personalized to the PwD modelled in Section 2.
3.1 Definition of Markov Decision Process
In order to learn the optimal policy, we firstly formulate the problem of robot-assisted RT for PwD as the following MDP model :
State Space . A state in this problem is defined as the collection of the PwD’s response relevance to the prompt from the robot, the emotion level, and the confusion condition. Based on the designed simulation model of PwD in Section 2, the state space has a cardinality of .
Action Space . During RT, there are seven actions possibly taken by the robot, i.e., providing easy prompt (), providing moderately difficult prompt (), providing difficult prompt (), repeating the prompt (), briefly explaining the prompt (), comforting the PwD (), and giving the PwD choices (). Note that even the PwD responds to a prompt incorrectly, the robot will NOT correct the PwD (RT is not aimed to correct PwDs). However, the response relevance will be considered by the RL agent in the reward function. Moreover, as mentioned previously, the robot will take action as long as the PwD shows confusion or negative emotion twice in a row. In other words, the condition of taking action is determinant, therefore, the actual action space for the RL only includes actions . Although the Q values of taking action is not learned during the training, the impacts of taking is deliberately integrated into the Q-value update of other actions, which will be detailed in section 3.2.
Reward Function . The design of the reward function aligns with the objectives of the robot-assisted RT, i.e., stimulating the PwD to talk while keeping the PwD in a generally positive mood. Thus, the reward function is a function of the PwD’s response relevance, emotion level and confusion condition. Specifically, the robot should always try to prevent the PwD from getting trapped in the bad moments, i.e., being in negative mood or confused, as bad moments will hamper the conversation and lead to a higher chance of terminating the current session. Accordingly, the reward component of PwD’s emotion level being negative, neutral, and positive is set to , , and , respectively. The reward component of the confusion condition is set to and , respectively, for being confused and unconfused. As to the difficulty level of the prompt, it should be properly adjusted according to the PwD’s cognitive capability and mental state so that the PwD is more engaged and interested, thus stimulating their memory and conversation to the most extent. In other words, optimal tradeoff needs to be learnt between taking an easy prompt for higher chance of being in positive state and taking a more difficult prompt (e.g., an open question) to encourage the PwD to talk more. Correspondingly, we provide two reward settings (as listed in Table 1) for prompts as a function of the difficulty level and resultant response relevance for later experimental study.
|No response (NR)||Irrelevant response (IR)||Relevant response (RR)|
3.2 Learning Algorithm Design and Training
Although the RL agent only learns the optimal policy for taking actions , the previously taken actions have decisive impact on the probability of taking action . For example, if the robot always takes action (providing difficult prompt) and ignores PwD’s bad moments, there will be a very high chance of the PWD choosing to stop. Therefore, we revise the standard QL algorithm and deliberately integrate the negative impact of taking in the Q-value updates of other actions to avoid over-aggressive policies, as summarized in Algorithm 1.
The RL agent is trained for epochs, each with episodes. The learning rate and discount factor are set to be and , respectively. At the beginning of each episode, the environment is reset to an initial state, [NR, Neu, No]. In each iteration, the -greedy approach () is used to select actions. An episode is terminated if the PwD chooses to stop, the maximum rounds are reached, or the number of memory triggers having been discussed reaches .
3.3 Evaluation Metrics
To evaluate the performance, we compare the average return per epoch obtained by the revised QL (denoted as -greedy QL) to that obtained by a random policy (denoted as Random action). Also, we extract the temporal policy suggested at the th Episode of each epoch, apply it to run experiments, and calculate the average return (denoted as Greedy QL). Moreover, we monitor the averaged sum of Q-table per epoch (i.e., Q-value sum) as well as its relative change (i.e., Q-value update) to evaluate the convergence performance. Additionally, all the optimal policies suggested in the last episodes are recorded. We use the top five policies mostly suggested to run experiments and choose the policy that obtains the maximum return as the final policy, denoted as . Finally, we conduct 20 experiments with and observe the dynamics of state-action transition in each experiment.
4 Results and Discussion
The learning process of our revised QL (i.e., -greedy QL) with reward function is shown in Fig. 1. As shown on the left of Fig. 1, the average return per epoch obtained by -greedy QL (blue curve) was much greater than the random action selection policy (black curve), which validated the efficacy of applying the RL approach for the robot to automate RT. The average return per epoch obtained by greedy QL was greater than -greedy QL. This makes sense because the greedy QL always took optimal policy due to greedy action selection provided the achieved strategy is optimal, while the -greedy QL selected random action for exploration. With respect to the convergence, the curve of average return per epoch (blue curve in left figure of Fig. 1) indicated the RL agent was able to converge within epochs, whereas the Q-values sum and Q-values update (the middle and right figure in Fig. 1) converged within in epochs. Additionally, we observe that the optimal policy suggested by the RL agent in the last episodes is still changeable, which might be due to the design of reward function and the model of simulated PwD. In Table 2, we listed the dynamics (e.g., state-action transition) of one experiment using the most nearly optimal action policy, , learned by Q-learning with reward .
Compared to reward function , Q-learning with reward function showed similar performance, i.e., curve of average return per epoch, converging Q-values sum, and changeable optimal policy during the last episodes. We present the most nearly optimal policy learned by QL using the two types of reward function, (the blue square) and (the red circle) in Fig. 2. note: 0,1,2
The scatter plots demonstrates that the robot is able to comfort the PwD when they feel negative emotion. For example, the optimal actions suggested for state = [NR, Neg,No] with and were both , comforting. When the PwD feel confused (e.g., = [NR, Pos,Yes]), the RL agent with and both suggested to take action, , repeating the prompt to the PwD. There were states, [NR, Neu,Yes], [RR, Neu, No], and [RR, Pos, No], where the RL agent suggested different actions regarding and . The RL agent with suggested to take action and when the PwD in a state of =[RR, Neu, No] and [NR, Pos, No], respectively. Comparatively, the RL agent with would take action when the PwD in state =[RR, Neu, No] or [NR, Pos, No]. Such difference make sense because the two types of reward function and (in Table. 1) indicated how much we value PwD’s response relevance and the level of PwD’s conversation being stimulated. The reward function was more aggressive compared to , that is, the memory and conversation being stimulated was much more valued by than the condition of emotion and confusion. On the other hand, this scatter plot also indicates that our RL approach was able to learn to adjust the difficulty level of prompt adaptive to PwD’s conditions.
In this paper, we employed a revised QL to learn a conversation strategy for the robot to stimulate a PwD to talk as much as possible while keeping the PwD in a generally positive mood during RT. The PwD was modelled as the transition probabilities among different conditions consisting of the response relevance, emotion levels and confusion status. Our experimental results showed that the strategy learned by QL was capable to adjust the difficulty level of prompt (e.g., yes-no vs. open-ended question) according to the PwD’s states, take actions such as repeating/explaining the prompt or comforting to help the PwD out of bad moments [14, 8], and allow the PwD to mitigate potential conversation stress during RT. To the best of our knowledge, this is the first time for technology-enabled RT to learn adaptive strategy, while taking into consideration of complicated PwDs’ mental states and communication strategies suggested in the traditional healthcare field. This might offer a promising solution for automatic, person-centered RT for PwD living alone.
However, there are still some limitations in this study. The patient model, the matrix of state-action transition probabilities, was created based on previous qualitative studies. As we discussed earlier, the nature underlying our PwD’s model might result in the optimal policy was still changing during the last episodes. For better learning of RL for RT as well as the real-world application of robot-assisted RT, a patient model based on real-world data should be developed, which is our next step. Additionally, we designed two types of reward function to test the feasibility of RL approach. However, the design of reward function might be associated with a PwD’s own personality and needs (e.g., psychological needs vs cognitive stimulation). From this perspective, in future work, we will closely collaborate with professional facilitators in this field and PwDs to adjust the reward function, to ensure an effective, person-centered RT using RL.
-  (2021) 2021 alzheimer’s disease facts and figures. Alzheimer’s & dementia: the journal of the Alzheimer’s Association 17 (3), pp. 327–406. Cited by: §1.
-  (2020) Automatic reminiscence therapy for dementia. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pp. 383–387. Cited by: §1.
-  (2012) Behavioral and psychological symptoms of dementia. Frontiers in neurology 3, pp. 73. Cited by: §1.
-  (2019) Assessing wellbeing in people living with dementia using reminiscence music with a mobile app (memory tracks): a mixed methods cohort study. Journal of healthcare engineering 2019. Cited by: §1.
-  (2004) Conversational coherence: discourse analysis of older adults with and without dementia. Journal of Neurolinguistics 17 (4), pp. 263–283. Cited by: §2.
-  (2004) Designing a multimedia conversation aid for reminiscence therapy in dementia care environments. In CHI’04 Extended Abstracts on Human Factors in Computing Systems, pp. 825–836. Cited by: §2.
-  (2017) Towards adaptive social behavior generation for assistive robots using reinforcement learning. In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI, pp. 332–340. Cited by: §1.
-  (2017) When reminiscence is harmful: the relationship between self-negative reminiscence functions, need satisfaction, and depressive symptoms among elderly people from cameroon, the czech republic, and germany. Journal of Happiness Studies 18 (2), pp. 389–407. Cited by: §2, §4.
-  (2019) Effects of age-related cognitive decline on elderly user interactions with voice-based dialogue systems. In IFIP Conference on Human-Computer Interaction, pp. 53–74. Cited by: §1.
-  (2019) Autonomous robotic dialogue system with reinforcement learning for elderlies with dementia. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 3416–3421. Cited by: §1.
-  (September 2018) World alzheimer report 2018. Alzheimer’s Disease International (ADI), London. Cited by: §1.
-  (2020) ‘There is still so much inside’: the impact of personalised reminiscence, facilitated by a tablet device, on people living with mild to moderate dementia and their family carers. Dementia 19 (4), pp. 1131–1150. Cited by: §1.
-  (2012) Training family care partners to communicate effectively with persons with alzheimer’s disease: the traced program.. Canadian Journal of Speech-Language Pathology & Audiology 36 (4). Cited by: §2.
-  (2011) Memory and communication support in dementia: research-based strategies for caregivers. International Psychogeriatrics 23 (2), pp. 256. Cited by: §2, §4.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.1.
-  (2009) The use of socially assistive robots in the design of intelligent cognitive therapies for people with dementia. In 2009 IEEE international conference on rehabilitation robotics, pp. 924–929. Cited by: §3.1.
-  (2017) Dementia, decision making, and quality of life. AMA J Ethics 19 (7), pp. 637–639. External Links: Cited by: §2.
-  (2018) Reminiscence therapy for dementia. Cochrane database of systematic reviews (3). External Links: Cited by: §1.
-  (2012) REMCARE: reminiscence groups for people with dementia and their family caregivers-effectiveness and cost-effectiveness pragmatic multicentre randomised trial. Health Technology Assessment 16 (48). Cited by: §2.
-  (2021) A systematic review of robotic rehabilitation for cognitive training. Frontiers in Robotics and AI 8, pp. 105. Cited by: §1.
-  (2021) A simulated experiment to explore robotic dialogue strategies for people with dementia. arXiv preprint arXiv:2104.08940. Cited by: §1.