Improving Mild Cognitive Impairment Prediction via Reinforcement Learning and Dialogue Simulation

02/18/2018 ∙ by Fengyi Tang, et al. ∙ 0

Mild cognitive impairment (MCI) is a prodromal phase in the progression from normal aging to dementia, especially Alzheimers disease. Even though there is mild cognitive decline in MCI patients, they have normal overall cognition and thus is challenging to distinguish from normal aging. Using transcribed data obtained from recorded conversational interactions between participants and trained interviewers, and applying supervised learning models to these data, a recent clinical trial has shown a promising result in differentiating MCI from normal aging. However, the substantial amount of interactions with medical staff can still incur significant medical care expenses in practice. In this paper, we propose a novel reinforcement learning (RL) framework to train an efficient dialogue agent on existing transcripts from clinical trials. Specifically, the agent is trained to sketch disease-specific lexical probability distribution, and thus to converse in a way that maximizes the diagnosis accuracy and minimizes the number of conversation turns. We evaluate the performance of the proposed reinforcement learning framework on the MCI diagnosis from a real clinical trial. The results show that while using only a few turns of conversation, our framework can significantly outperform state-of-the-art supervised learning approaches.

READ FULL TEXT VIEW PDF

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The progression of Alzheimer Disease (AD) has consistently been a heavy area of research in clinical medicine because while the disease itself is incurable, early intervention at the prodromal phases of the disease has proven to delay the onset of AD-related mental degeneration and systemic issues for months to years (Olazaran et al., 2004; Cummings et al., 2007). Consequently, much of the recent clinical research efforts have focused on detecting early stages of mild cognitive impairment (MCI), which is a prodromal phase in AD progression occurring months to years before visible mental decline begins (Gauthier et al., 2006). If successfully detected at this stage, intervention methods may confer numerous benefits in the longevity of cognitive and physiological health of AD patients (Olazaran et al., 2004; Cummings et al., 2007).

Brain imaging, such as the structural magnetic resonance imaging (MRI), was shown to contain prime markers of AD, capturing the physiologic changes in the AD pathological process (Johnson et al., 2012; Heister et al., 2011). However, the identification of MCI from normal aging (NL) is particularly challenging due to the fact that structural changes in the brain in this phase are minor and hard to detect through structural MRI (Jack Jr et al., 2010), even though decline in mental status and cognitive have already begun in most cases. Recently, the structural connections among brain regions inferred from diffusion MRI have provided promising predictive performance of MCI detection (Zhan et al., 2015; Wang et al., 2016), yet sketching brain networks via imaging still remains rather prohibitively expensive and difficult to scale. Moreover, the high dimensionality of brain imaging combined with small sample size usually imposes significant challenges in learning algorithms and leads to unstable generalization performance.

On the other hand, behavior and social markers could offer a much more cost- effective option for MCI detection (Dillon et al., 2013; Chapman et al., 2011; H Dodge et al., 2015; Asgari et al., 2017). A recent clinical trial has studied differentiating early stage MCI from NL cohort groups using transcripts of extensive conversations between patients and trained interviewers (H Dodge et al., 2015). In a recent preliminary study (Asgari et al., 2017), the authors trained supervised learning models from the lexical distribution of the conversation, and showed that conversational responses of MCI and NL patients take on different distribution under various conversational topics. The success of  (Asgari et al., 2017) in predicting MCI using human dialogue introduced an alternative natural language processing (NLP) approach to a classically clinically expensive problem. However, the use of human interviewers still requires substantial amounts of interaction between trained staff which incur significant expense in its current form. Thus, the bottleneck questions remain: (1) can we cut down on the amount of conversations needed to achieve accurate prediction, (2) can we improve upon baseline performance given limited cohort-specific data?

To address the aforementioned questions above, in this paper we propose a novel reinforcement learning (RL) framework, which learns a MCI diagnosis agent using only very limited amount of offline human dialogue transcripts. The learned diagnosis agent can conduct subject-specific conversation with humans, asking questions based on existing conversations to efficiently sketch the lexical distribution and give high-performance MCI prediction. In order to facilitate RL using offline transcripts, we introduce a dialogue simulator pipeline which generates new conversational episodes that are less noisy and out-perform the original corpus for MCI prediction. Our dialogue pipeline provides a self-contained framework for directing dialogue generation for diagnostic screening which can potentially replace the need for human-expert interviews. Our RL-agent learns optimal dialogue strategies that are adaptive to unseen users, enabling medically-relevant NLP data to be generated on a large scale if deployed in a realistic setting. Furthermore, data generated from our dialogue simulations may be used for data augmentation or to perhaps guide the medical data collection process in the future. Ultimately, by greatly decreasing the cost of data collection and the amount needed for high-level performance, we introduce a clinical direction that is much more cost-effective and scalable to large-scale diagnostic screening and data collection. The combination of NLP features with our reinforcement learning framework may extend the process of diagnostic screenings to well beyond the confines of hospitals and primary care facilities.

2. Related Work

MCI Prediction via Utterance Data. (Asgari et al., 2017)

used a classical supervised learning framework to formulate MCI prediction as binary classification problem. For each interview, a corpus was constructed using only the participant responses to interviewer questions. For each participant, the response corpus over several interviews was preprocessed into feature vectors using the Linguistic Inquiry

Word Count (LIWC) dictionary  (Pennebaker et al., 2001). The LIWC dictionary transforms each word in a given corpus to a 69-dimensional feature vector with latent dimensions representing grammatical and semantic properties of each word. A final 69-dimensional feature vector is then constructed at the end of the corpus by aggregation of all previous LIWC vectors. The resulting feature representation is a

matrix. The best performing classifier in this benchmark study uses linear support vector machines (SVM) with

-norm regularization  (Asgari et al., 2017). The resulting performance is 72.5% AUC over 5-fold validation.

Dialogue Systems. Dialogue systems provide a natural human-computer interface and have been an active research field for decades. Task-oriented dialogue systems are typically designed for retrieval- tasks in which users provide queries and the chat-bot provides appropriate responses based on an external knowledge base (Wen et al., 2016; Dhingra et al., 2016; Chen et al., 2017), or identifying correct answers by looking into vast amounts of documents (Hu et al., 2017; He et al., 2017). Such dialogue systems are typically designed to be a pipeline containing a set of components including a language understanding unit that parses the intention and semantics from the input from humans, a dialogue manager that manages dialogue state tracking and policy learning, and a language generation unit that generates response (Chen et al., 2017; Schatzmann et al., 2006; Singh et al., 2002)

. While each of the components can be handcrafted or trained individually, recent advances of deep learning allows end-to-end training 

(Wen et al., 2016; Dhingra et al., 2016; Li et al., 2017, 2017) and significantly improves the performance and the capability to adapt to new domains (Bordes and Weston, 2016). The end-to-end systems can be trained using supervised learning (Wen et al., 2016; Liu and Lane, 2017) or reinforcement learning (RL), by leveraging a user simulator (Dhingra et al., 2016; Li et al., 2017)

. The main advantage of RL is that less training samples are needed to learn the high-degree-of-freedom deep models. In our work, we design a simulator to enable RL due to the limited amount of clinical data available for supervised training. We note that even though our dialogue system also tries to achieve a task (identifying MCI patients), the nature of our system is radically different from existing task-oriented dialogue systems: its goal is to efficiently sketch a disease specific lexical distribution through asking subject-specific questions and give classification results.

Healthcare Applications of Dialogue Systems. Dialogue systems have been widely adopted in the healthcare domain for various applications. For example, chat-bots are available to assist the patient intake process (Ni et al., 2017), retrieve restaurant accommodation information for young adults with food allergies (Hsu et al., 2017), and perform dialogue analysis and generation conversation to perform mental health treatment (Oh et al., 2017). In the context of Alzheimer’s disease research, (Montenegro and Argyriou, 2017) designed a virtual reality based chat-bot to evaluate memory loss using predefined questions and answers. (Salichs et al., 2016) discussed applications of chat-bots as caregiviers for Alzheimer’s patients, providing safety, personal assistance, entertainment, and stimulation. More recently, (Tanaka et al., 2017) introduced a computer avatar to ask a list of pre-defined questions from neuropsychological tests to diagnose dementia. This work is closely related to our system as it utilizes dialogue to glean disease-related information. However, one major issue in this approach is that the questions were obtained from mini-mental state examination (MMSE) (Tombaugh and McIntyre, 1992), which is a confirmatory measure used to define clinical dementia (such as MCI) rather than a diagnostic tool to predict it. It is more clinical meaningful to identify diagnostic markers associated with the pathological pathways, such as lexical distribution associated with the cognitive changes for the purpose of diagnostic screening.

3. Methodology

Figure 1. Overview of the proposed methodology. Complete conversation from participants are used to build user simulators. The simulators are then used to train an MCI diagnosis agent (chat-bot), which conducts minimal turns of conversation with participants to sketches the lexical distribution that is then used to perform MCI classification.

The framework we propose in this paper involves the use of reinforcement learning to learn the optimal set of questions to ask participants for the purposes of distinguishing MCI. On test set, we generate new episodes from these questions for prediction rather than the original corpus. To actualize the RL + dialogue simulation framework, we proposed a multi-step approach for implementation which capitalizes on the vast existing knowledge of NLP research. In the following section, we present the details of each component of the dialogue system. Figure 1 shows an overview of the components of our experimental pipeline.

3.1. Overview of Pipeline

Our proposed framework contains three key learning modules: the user simulator, the MCI classifier and the RL-agent. The proposed pipeline is illustrated in Figure 2

. First, the user simulator is trained by unsupervised learning, which simulates the distributed representation of user responses given feasible question inputs. Next, the MCI classifier predicts the patient label based on the averaged distributed representation of its corpus responses. The above two components and dialogue manager comprise the training environment for the RL-agent. The dialogue manager utilizes the user simulator and MCI classifier to handle the state transitions and also computes of the reward based on the ground-truth labels from the training set and MCI classifier prediction. After training in this environment, the RL-agent is able to deliver the optimal sequences of questions for training-set users at various stages of conversations. During testing, the RL-agent produces query inputs to the test-set user simulators, which represent the unseen users. Using these new queries, the user simulators generate the corresponding distributed representation of test-set user responses for MCI prediction. In the following subsections, we will present each component of the pipeline in detail and demonstrate the effectiveness of the RL framework in improving prediction accuracy while reducing conversational turns.

Figure 2. Illustration of reinforcement learning components in our proposed approach.

3.2. Construction of Turn-Based Dialogue

Since utterance data was collected in the form of conversational transcripts for each participant, we must reconstructed turn-based dialogue from participant-responses. The participant responses were unstructured while interviewer questions ranged over preset question topics, as illustrated below.

Interviewer: so what did you do yesterday?
Participant: i had yesterday morning i yesterday was a busy day for me. i im forgetting i went to where did i go in the morning. well i went to albertsons yesterday...
Interviewer: what do you see in this picture?
Participant: we got a picture gosh. it looks like my uncle lou. but he never ...
Interviewer: when do you think this picture was taken? Participant: this picture was probably eighteen seventy or something or nineteen twenty. so he looks too old for war he must have been ...


In total there were well over 150 possible queries from the interviewers. However, for the purposes of this study, we re-compiled the question list into 107 general questions which were ubiquitous across all conversations. A snapshot of questions are in Table 1.

Category Question
Activity Did you go outside lately?
So what did you do yesterday?
Social Did you run into any familiar faces lately?
Where did you have dinner?
Picture What do you see in this picture?
Where do you think this picture was taken?
Tech How are you with the computer?
Did you use your computer lately?
Unspecified ¡unspecified scheduling comment¿
¡unspecified picture comment¿
Table 1. Examples of questions from conversations:

We created a total of 16 question categories, including: greetings, activity check, living situation, travel, entertainment, social, picture-related, tech, occupation, hobbies, family, pets, confirmation, clarification, goodbye and unspecified comments. For some of these comments, we delexicalised certain topic words such as “¡activity¿”, “¡social topic¿” in order to (1) control for domain expansion (Henderson et al., 2014) and (2) reduce model complexity of our user simulators. In the past,  (Henderson et al., 2014) and  (Liu and Lane, 2017) have shown the effectiveness of delexicalisation in controlling for domain expansion in user simulators without sacrificing the contextual meaning of sentence queries. Additionally, we also created unspecified comments category, which included comments that deviated from general question prompts. These comments often result from interviewer follow-up on specific topics mentioned by the user. We consolidated these comments into a single category to distinguish the context-specific from general questions based on the corpus. However, we do demarcate the type of unspecified comment used by the interviewer. For example, a follow-up comment to an occupational story is tagged ¡unspecified occupational comment¿ whereas a follow-up comment about a health concern is tagged ¡unspecified health comment¿. The role of these comments serve to build rapport and improve flow of conversation. In future studies we may look to generate user-specific grounding statements for these slots (Chai et al., 2016). Implemented in this way, the corpus is tokenized into turn-based responses to questions for each user.

3.3. Unsupervised Learning for User Simulator

To effectively capture contextual representation of user conversation style, we utilize vector embedding of user corpus at the sentence-level representation  (Kiros et al., 2015; Mikolov et al., 2013). Given that we want to capture the flow of the conversation from one response to the next, we implement skip-thought embedding, which has shown effectiveness over large corporal datasets by capturing contextual information of sentences given neighboring ones  (Kiros et al., 2015). For encoding sentences, we use a model that was pretrained on the BookCorpus dataset, which contains turn-based conversations from various English novels  (Kiros et al., 2015). For the decoder, we train skip-thought vectors to recover the original response of the user during NLG portion of the pipeline.

Since each user has individual response styles to questions, we train a personalized user-simulator for each user. For each user, the conversation corpus is divided into question-response turns. In our dataset, for example, the number of turns per conversation ranged from 30-275 turns. We used a multilayer perceptron (MLP) with 2 hidden layers of 512 output nodes each to train the user simulator. We also introduce regularization with

-norm penalty to constrain model complexity. Because we utilize preset questions by the interviewer, we use one-hot encoding of questions, denoted

, as input for training. Given the original skip-thought vector , the user simulator serves as a function which maps . The output of the MLP is the skip- thought embedding representation of the utterance, denoted . Here, denotes the size of our question dictionary, denotes the dimension of skip-thought embeddings, parameterizes the MLP model for the given user, denotes the user index and

denotes the turn number. The loss function of the MLP is given by the mean- squared error (MSE) between the MLP output and original skip-thought vector

:

In the case where questions are not preset, more state-of-art methods such as end-to-end recurrent neural network systems can be deployed to train the user simulator instead  

(Wen et al., 2016; Li et al., 2016b). To evaluate the performance of our user simulator, we computed the mean squared error on the outputs of the simulator and the original thought vector representation of the user response for each turn.

3.4. Reinforcement Learning Components

Again, let denote the size of skip-thought embeddings and denote the size of question dictionary. We formulate the dialogue and task managers portions of the dialogue system into a standard RL setting where an agent interacts with environment over a finite number of steps. At time step , the agent receives a state and samples an action (asks a question) based on its current policy .

The environment transitions to the next state and the agent receives a scalar reward . In this setting, the RL-agent tries to learn an optimal policy over all possible states, including ones that are unseen by the agent during training. To do this, the agent has to learn an approximate action-value function, which maps state-action pairs to expected future rewards (Sutton and Barto, 1998). Formally, the action-value function is defined as follows:

where is a discount factor and is the max # of turns.

Environment : The environment in this case consists of the dialogue manager (DM), user simulator and MCI classifier. DM is composed of the reward and state generating functions. In previous works, a task manager, composed of a database and a query manager  (Chen et al., 2017; Wen et al., 2016), is used by the DM to generate observations in retrieval tasks. In our case, however, the the user simulator and MCI classifier is equivalent to the task manager and is used by the DM to generate observations. Here, the DM uses the MCI classifier to (1) predict probabilities for both the MCI and the NL classes based on current moving-average of skip-thought vectors at each turn, and (2) predict the label of the current user at the end of the episode for reward calculation. The result of (1) is also used by the agent as part of its internal state-representation. The result of (2) is used by the DM for credit assignment for the generated conversational episode. The MCI classifier is trained separately on the training set corpus before the dialogue system phase.

Action : The RL-agent chooses its actions from a set of discrete actions consisting of predefined questions, where each question is represented by a one hot vector in . It is worth noting that we use and to differentiate the action taken by our RL-agent and the questions asked during the actual interviews, respectively.

State : The state representation by the RL-agent is used to approximate the action-value function. There are five main components of the state representation vector:

  • Skip-thought vector of utterance at current turn: , which is the output vector from user simulator given action at turn t.

  • Moving average of skip-thought vectors across all utterances in current episode:

  • First hidden layer weights of user-simulator:

  • Predicted probability of current user for MCI and NL classes by classifier

  • Number of turns above threshold: .

The total dimension of the state vector is . At each turn, the DM queries the MCI classifier to output a probability vector composed of and , where denotes NL and denotes MCI. This 2-dimensional vector keeps track of the classifier's confidence-level for MCI prediction based on the current moving-average of skip-thought vectors generated from turns. Keeping track of classifier confidence incentivizes the RL-agent to terminate the conversation as soon as it reaches a threshold level of confidence for the prediction task.

Reward : Since we want to minimize the number of dialogue turns, we designed the environment to output a negative reward (-10) at every time step unless it reach a terminal state (e.g. when agent says “goodbye”). At the terminal state, the reward depends on the classification using the averaged skip-thought vector collected from this episodes. If the existing classifier is able to make the correct prediction, the agent receives a positive reward (1000), otherwise it receives a moderately negative reward (-500). We also set the maximum length of episodes . Additionally, we added a linearly increasing penalty for each passing turn where the classifier predicts with probability for either class (MCI/NL). We denote this penalty threshold as the number of turns above confidence threshold (). Formally, the reward function is defined as:

(1)

State transitions: The state transition function has two parts:

  • [leftmargin=0.1in]

  • Within User. The state transition rule between turns is characterized by:

    Given a policy , the probability of the environment transitioning to state at depends only on current state . Internally, the DM utilizes the user simulator to generate skip-thought from .

  • Between Users. In addition to state transitions within episodes, the state-generating function changes between users, leading to different transition probabilities between similar states among different users. To capture this, we apply two changes when training the RL-agent on multiple users: (1) the first hidden layer weights of each user are incorporated in the state representation vector so that the RL-agent can distinguish between dissimilar users. When used this way, the user simulator provides a means for the RL-agent to learn similar policies for similar users and dissimilar policies for dissimilar users. (2) During training, both the user simulator and classifier of the training environment is reset between users by re-initializing the user simulator weights to correspond to the new user.

Deep Q-Networks (DQN).

In this work, the action value function needs to estimate expected reward based on the high-dimensional state representations as described in previous section. In order to approximate the action value given different users and the complicated internal state changing during the conversation, we learn a deep

-network parameterized by to tackle this challenging problem. The learning procedure can be conducted by optimizing the loss function as follows:

(2)

with

(3)

where denotes the parameters of target -network. In order to learn the estimator under complex situations, two key ingredients were proposed in (Mnih et al., 2015): experience replay and fixed target -network. During the training, the -network () is updated in an online fashion by conducting the gradient descent of Eq (2) while the target -network () is fixed to compute the target values as in Eq (3) and only updated after a certain number of iterations, which is essential to the convergence of -network in this work. We also observe when the experience replay samples minibatch from previous experiences to update the -network, the training performance stabilizes more consistently.

Policy-masking. One challenge in our problem is creating an environment that can train the agent to produce responses which best align with the flow of conversations. For example, an agent may learn that the question “can you elaborate on that? ” is useful for generating a wide distribution of words from the user, but it would not make sense to include that in the first sentence of a conversation or before relevant topics are introduced. To achieve this, we created a policy-modifying function in which confirmation and clarification type questions are masked from the policy set at turn if the action history of the agent from does not include any questions from social, activity, tech, picture-related, hobbies, occupation, travel, entertainment and family categories. At each turn, we keep track of an action history vector and construct a policy-masking vector to be applied element-wise over the agent's Q-value output. Specifically:

(4)

where the denotes the -th element in policy-masking vector . And represents the action values of all 107 available actions given current state . Then the is valid action values vector after the policy masking. To achieve effective masking, we assure the elements of

is positive by using ReLU 

(He et al., 2016)

as the activation function for the output layer of Q-network and a step of pre-training on Q-network as described in following section.

3.5. Training the RL-Agent

We outline below the training procedure for our RL-agent. To expedite the learning process, we first train the RL-agent over the original corpus from the training set. For each user, we perform an initial pass through the entire corpus using the existing action history to generate episodes . We use these corpus-generated episodes to train the Q-estimator network. This initialization procedure is motivated by previous studies which have cited the effectiveness of pre-training with successful episodes so that the RL-agent can discover large terminal reward signals in games with delayed rewards  (Anderson et al., 2015).

  Initialize replay memory
  Initialize Task Manager with classifier
  Pre-train action-value function
  for  do
     Initialize Environment with User Simulator
     Initialize with true label for user
     for  do
         Reset Get the initial state .
         for  do
            Obtain policy mask as Eq (4).
            With probability select a random action otherwise select
            Execute action in observe reward and state
            Store transition in
            Sample random minibatch of from
            if terminal  then
               
            else if non-terminal  then
               
            end if
            Perform a gradient descent step on
         end for
     end for
  end for
Algorithm 1 RL-Training Protocol

During training, we stabilize the target Q-network for minibatch generation and transfer weights from learning Q-network every 50 conversational episodes. During testing, we use the RL-agent to generate new actions for each test set user . New episodes are then generated by each user simulator from each new action set for prediction. These simulated episodes often differ from the original corpus in both the questions asked by the agent as well as in the skip-thought responses by the user.

4. Experiments

Evaluation of dialogue systems differ widely depending on the task. Previous works typically involve using metrics such as perplexity and averaged reward per response to measure the quality of the natural language generation (NLG) phase of the dialogue system 

(Chen et al., 2017; Wen et al., 2016; Schatzmann et al., 2006). However, because the utility of our framework comes from the quality of questions that the chat-bot generates for the off-conversational task, we propose a framework of evaluation which emphasizes the agent’s off-conversation performance. We gauge utility of the dialogue system by its ability to improve (1) prediction accuracy against baseline techniques and (2) the number of turns needed to make accurate prediction.

Data. Data used for this study was obtained from a randomized controlled behavioral clinical trial to ascertain the effect of unstructured conversation on cognitive functions. Details of the study protocol was explained in (Dodge et al., 2015). In this clinical study, conversational data was collected in QA format for each participant during web-cam interviews with trained interviewers. Each participant was interviewed multiple times over the course of 4-6 weeks, and dialogue responses were transcribed for each interview session (Asgari et al., 2017). On average, there are 2.81 conversational episodes per participant, and each conversation lasted between 30-45 minutes  (Asgari et al., 2017; Dodge et al., 2015). MCI labels were generated using clinical assessment of each participant's cognitive status by medical professionals  (Asgari et al., 2017; Dodge et al., 2015).

Baselines vs. RL Performance. We first compare the performance of several baseline classifiers for the MCI prediction task. For our specific dataset,  (Asgari et al., 2017) had previously achieved benchmark performance of 72.5 AUC score on 5-fold validation while using linear SVM with -norm penalty and feature engineering by Linguistic Inquiry and Word Count (LIWC) dictionary  (Asgari et al., 2017). LIWC embeds each word into a 69-dimensional word vector space with each dimension representing a latent feature of the English language  (Asgari et al., 2017). Since 2013, various contextual representations of words and sentences have been proposed, many of which have outperformed classical rule-based contexual embedding techniques  (Mikolov et al., 2013; Kiros et al., 2015). Distributed representation such as Word2Vec allows for more flexible and corpus-dependent latent features to be created for individual words (Mikolov et al., 2013). More recently, Skip-thought vectors  (Kiros et al., 2015) have risen to popularity due to the ability to embed entire sentences into ”thought vectors” that capture contextual meaning and syntactic information from neighboring sentences. For this reason, we compare various word and phrase embedding techniques to establish new baseline performances for our classification task.

Model Feature AUC Sen. Specificity F1-Score
LR + RD
RFC RD
SVM + RD
SVM + RD
MLP RD
LR + W2V
RFC w2v
SVM + W2V
SVM + W2V
MLP W2V
LR + LIWC
RFC LIWC
SVM + LIWC
SVM + LIWC
MLP LIWC
LR + SKP
RFC SKP
SVM + SKP
SVM + SKP 0.797 0.660 0.933 0.716
MLP SKP
RL(T=1) SKP 0.6070.109 0.3800.166 0.8330.134 0.4470.172
RL(T=3) SKP 0.7060.092 0.5000.205 0.9110.097 0.5830.154
RL(T=5) SKP 0.7070.072 0.4800.133 0.9330.102 0.5940.129
RL(T=10) SKP 0.7720.115 0.6000.237 0.9440.102 0.6830.186
RL(T=15) SKP 0.7980.115 0.6400.265 0.9560.102 0.7140.190
RL(T=20) SKP 0.7980.121 0.6400.250 0.9560.102 0.7190.190
RL(T=25) SKP 0.8080.111 0.6600.254 0.9560.102 0.7320.184
RL(T=30) SKP 0.8080.119 0.6600.269 0.9560.102 0.7300.190
RL(T=35) SKP 0.8180.102 0.6800.204 0.9560.102 0.7610.140
  • Here, LR

    denotes sparse logistic regression classifier,

    RFC

    denotes random forest classifier,

    SVM denotes support vector machines, and MLP denotes multi-layer perceptron. For feature representation of corpus, RD represents raw distribution of word counts. w2v denotes averaged 300-dimension Word2Vec embeddings across all words appearing in the corpus for each user  (Mikolov et al., 2013). LIWC denotes the original rule-based embedding used by  (Asgari et al., 2017). SKP denote averaged 4800-dimension Skip-Thought vectors across all turn-based responses for each user  (Kiros et al., 2015).

Table 2. Performance of basline vs. RL on MCI prediction on 10 stratified shuffle splits

The first four sections of Table 2 show the performance of these baseline classifiers. Using the original LIWC representation, we were able to recover close to the 72.5 AUC baseline from the original paper using SVM and LR classifiers. When implementing skip-thought embedding, we used pre-trained skip-thought encoders by  (Kiros et al., 2015) to embed each user response across all conversational turns. The encoder was pre-trained on the BookCorpus dataset, which is a large collection of novels pertaining to numerous literary genres. The advantage of pre-training on this dataset is that BookCorpus contains an abundant number of turn-based dialogues between various character types. These conversations capture a wide range of conversational response styles, idiosyncrasies and temperaments. As seen in Table 2, the best performing baseline model was the SVM classifier with norm, using Skip-Thought embedding as features. For this reason, we choose this classifier for the RL portion of our pipeline. As a baseline reference, we also included performance using raw word count distributions for all models.

We then evaluate the performance of our RL-agent across 10 stratified shuffle splits. Each split uses 65 of data for training and 35 for testing. We compare the performance of RL-Agent when manually restricting the number of questions to 1, 3, 5, 7, 10, 15, 20, 25, 30 and . By restricting the number of turns, we can observe the number of questions needed to recover the original baseline performance using the SVM classifier.

Figure 3. RL-Agent vs. Baseline w/ Variation on Turns

The last section of Table 2 illustrates the performance of the RL-agent under various turn constraints. Here, the turns notation RL(T=) denote the number of questions the agent is allowed to ask before a prediction is made from the simulated user responses. It is important to note that turn 0 was set to greetings by default and was not counted toward the conversation.

We see from constraint conditions that the performance of our RL-agent started to surpass baseline performances starting at 25 questions and was able to achieve comparable performance using only 15 questions. At full conversation length of 35 turns, we were able to achieve 0.818 AUC, an improvement upon current and previous baselines. In comparison, the mean number of conversational turns per user in the original corpus was 105.71. Additionally, since 2.81 conversations were conducted per user, we adjusted the number of turns allowed based on the mean number of turns per conversation, which was 37.36 per user. For this reason, we set the upper bound constraint to 35 questions, which is just slightly less than a full conversation with the user.

Figure 3 visualizes this relationship between performance and number of questions asked by the RL-agent. We see that performance improvements with additional questions saturate after 15 questions. This was expected, as the highest-yield questions discovered by the RL-agent were asked first during test conversations.

Evaluation of User Simulators. User simulators serve a pivotal role of simulating the user response in the RL training environment  (Schatzmann et al., 2006; Li et al., 2016a). In previous works, the user simulators are evaluated based on accuracy of generated user query to unseen bot responses  (Li et al., 2016a; Schatzmann et al., 2006). Metrics such as BLEU and perplexity are used at the NLG phase of dialogue, as the generation of user query is pivotal in retrieval-type training systems.

In our case, however, the goal of the user simulator is quite different; the RL-agent is responsible for generating queries while the output from the user simulator is actually an encoded thought-vector of the user response, which is then used for state representation and downstream prediction purposes. For this reason, we evaluate the performance of the user-simulator not on the decoding portion of the dialogue system, but rather on the performance of the user-simulator in generating accurate thought-vector version of the responses.

Figure 4. Distribution of mean squared error (MSE) across all user simulators.

We compute mean-squared error (MSE) between the corpus Skip-Thought vector and user simulation prediction at each turn. The resulting MSE scores are averaged across all turns for the conversation. Given that each user has on average 2.81 conversations, we evaluate the performance of the user simulator in a leave-one-out fashion: for each user, the simulator is trained on all conversations except for the last one, which is used for evaluation. Figure 4 visualizes the performance of user simulators. The mean MSE is 0.004952.93E-06, averaged across all test set performances.

Top-Performing Policies. It is interesting to note that the simulated episodes by our RL-agent were able to provide a performance boost for the prediction task. In this section, we look qualitatively at the types of questions at 5, 10, 15, 20 and 35 turns by RL-agent in comparison with the original corpus. We also compare the performance of @5, @10, @20, @30 and @35 with the performance using the first 5, 10, 20, 30 and 35 responses of the original corpus. Again, we note that responses to greeting and parting queries such as “Hi” and “goodbye” are not counted toward prediction.

Model AUC Sen Spec F1-Score
Corpus@5 0.5040.070 0.1200.098 0.8890.099 0.1750.145
Corpus@10 0.5130.076 0.1600.174 0.8670.130 0.1930.200
Corpus@20 0.6140.077 0.3400.254 0.8890.131 0.3820.223
Corpus@30 0.6580.121 0.3600.233 0.9560.056 0.4600.266
Corpus@35 0.6990.125 0.4200.244 0.9780.044 0.5390.248
@5 0.7070.072 0.4800.133 0.9330.102 0.5940.129
@10 0.7720.115 0.6000.237 0.9440.102 0.6830.186
@20 0.7980.121 0.6400.250 0.9560.102 0.7190.190
@30 0.8080.119 0.6600.269 0.9560.102 0.7300.190
@35 0.8180.102 0.6800.204 0.9560.102 0.7610.140
Table 3. Prediction @5, 10, 20, 30 and 35 Turns

As shown in Table 3, the optimal policy learned by our framework outperformed the original corpus for each turn constraint. For example, when our RL-agent asked only 5 questions to test set users, the classifier was able to achieve 0.707 AUC and 0.594 F1 using the simulated response. In contrast, using the first 5 questions from the original corpus for each test set user produced 0.504 AUC and 0.175 F1. When using the first full-length conversation with 35 turns, the original corpus recovers an AUC score of 0.699, which is far from the performance of @35. In Table 4, we rank the most frequently appearing questions in @5, and .

Turns Question Count
1-5 when did you start working? 40
1-5 so how long did you go out for? 37
1-5 when did you meet your SO? 28
1-5 ¡unspecified hobby comment¿ 24
1-5 what did you like about ¡activity¿? 24
6-10 what was ¡occupation¿ like for you? 30
6-10 ¡unspecified tech comment¿ 28
6-10 when did ¡tech problem¿ start? 22
6-10 what do you see in this picture? 19
6-10 ¡unspecified picture comment¿ 19
10-15 what is your opinion on ¡social topic¿? 42
10-15 did you see any shows lately? 38
10-15 how many people do you think can fit in this? 33
10-15 what you were doing during this time period? 30
10-15 what type of ¡hobby¿ do you do? 28
15-20 ¡goodbye¿ 27
15-20 where did you meet your so? 25
15-20 did you enjoy school? 24
15-20 anyone visit you lately? 24
15-20 what was the show about? 20
Table 4. Most frequently questions in @5, 10, 15 and 20

@5. The most effective question in @5 appears to be “when did you start working”. In the context of our problem, this question seems to generate the most polarizing responses from the cohort. We also see that the RL-agent included a few elaboration questions such as “what did you like about ¡activity¿” and “why did you do that,” for some users to expand upon previous responses. From the clinical perspective, it is also interesting to note that the RL-agent picks questions such as “what did you do yesterday” and “how long did you go out for,” which are similar to questions used clinically to assess immediate recall in MCI patients  (Folstein et al., 1975).

@10. As seen in @5, occupational questions were the most popular topic asked by the RL-agent. This is also the case with @10, where the RL-agent follows up the previous query with an elaboration question regarding past occupational experiences. It is interesting to note that the RL-agent transitions to picture-related questions, which are often used by the clinical interviewers to facilitate creative responses by participants (Asgari et al., 2017).

We also observe the RL-agent asking questions such as “¡unspecified tech comment¿” and “when did ¡tech problem¿ start”. These were frequently asked questions during the course of the original dialogue, as technical difficulties were often encountered with connection and webcam issues during the interviews  (Asgari et al., 2017). Unfortunately, the responses vary greatly and may at times generate verbose responses from participants. The RL-agent did not seem to be able to recognize this caveat during training.

@20. As we approach questions 11 through 20, we arrive at mid- to late- dialogue for most conversations. Overall, we observe more widespread topics during this portion of conversation. The most polarizing question asked at this stage was “what is your opinion on ¡social topic¿?” Here, we used delexicalised slots (Liu and Lane, 2017) ¡social topic¿ to reduce model complexity, but the slots may be substituted with a wide range of social topics from political trends to recent news.

Additionally, we observe that the RL-agent learns to say “goodbye” to terminate the conversation early in numerous cases. As mentioned previously, we designed the state function to include the predicted probability [0.0-1.0] of MCI by the classifier at each time-step. The environment penalizes the agent for additional turns in which the prediction probability exceeds 0.65 for either class. By opting to terminate the episode, the RL-agent learns to avoid dragging on the dialogue unnecessarily in cases where it is confident in the prediction.

One notable question in @20 is “how many people do you think can fit in this?” This is actually a picture-specific question related to one of the more provocative pictures. In fact, we confirmed from the original corpus that it generated more follow-up response from users when compared to other picture-related questions such as “when do you think this picture was taken?” and “interesting, what makes you say that?”. By ranking this question highly, the RL-agent indirectly prioritizes this picture over others in generating user responses. This exemplifies how the ranking of questions by may be used to direct future data collection process.

@35. When approaching the end of conversations, we notice that the questions asked by the agent were more spread-out among the remaining choices. For this reason, we rank only the top 10 questions during the final 15 turns of simulated conversations.

Rank Question Count
1 what is your opinion on using ¡new tech¿? 112
2 did you do anything else? 106
3 so how long did you go out for? 98
4 what you were doing during this time period? 95
5 when do you think this picture was taken? 95
6 ¡goodbye¿ 94
7 anything new with you lately? 91
8 what did you like about it? 85
9 ¡unspecified picture comment¿ 76
10 how often do you ¡do activity¿? 72


In this latter portion of , we note that the RL-agent utilized more elaboration questions such as “what do you like about it” and “how often do you ¡do activity¿”. We also see that technology-related questions such as “what is your opinion on using ¡new tech¿” are included more often when compared to topics such as occupation or social items. This indicates that tech-related questions may not be as high-yield in distinguishing MCI responses, as these questions are prioritized later during conversation by the RL-agent.

5. Discussion and Conclusion

In this paper, we introduce a RL framework for approaching a classically supervised learning problem in clinical medicine, where the data is often noisy, scarce, and prohibitively expensive to obtain. We show that a properly trained RL framework can (1) greatly cut down on the amount of data needed to make accurate predictions, and (2) synthesize relevant new data to improve performance.

To achieve this framework, we proposed a multi-step approach which capitalizes on the vast existing knowledge of the human language and NLP research. First, we used a state-of-art distributed representation to preprocess our data. We then set up a simulation environment for reinforcement learning using supervised learning to create customized user simulators. Lastly, we utilize the trained RL-agent to generate new questions from to obtain more targeted responses for our prediction task.

A careful examination of the optimal policies discovered by our agent demonstrates that the overall framework is self-contained for directing dialogue generation for diagnostic screening, which can potentially replace the need for trained interviewers. Our trained RL-agent is able to discover relevant questions to ask users where the agent has no prior experience of interaction. We also show various clinical insights which could be deduced from observing the ranking of questions in at various turn constraints.

In order for this framework to be effectively deployed in a realistic setting, a user-simulator that could be trained online and in real-time should be considered. In its current form, our user-simulators are trained offline, which may not be scalable to larger corpus and user volumes. Additionally, a natural language generator phase may be needed to make the questions more adaptable to the natural flow of human conversation. These will be areas of research we will explore in future studies.

Acknowledgements.
This material is based in part upon work supported by the National Science Foundation (IIS-1565596, IIS-1615597), National Institute on Aging (R01AG056102, R01AG051628, R01AG033581, P30AG008017, P30AG053760) and Office of Naval Research (N00014-17-1-2265).

References

  • (1)
  • Anderson et al. (2015) Charles W Anderson, Minwoo Lee, and Daniel L Elliott. 2015. Faster reinforcement learning after pretraining deep networks to predict state dynamics. In IJCNN. IEEE, 1–7.
  • Asgari et al. (2017) Meysam Asgari, Jeffrey Kaye, and Hiroko Dodge. 2017. Predicting mild cognitive impairment from spontaneous spoken utterances. Alzheimer’s & Dementia: Translational Research & Clinical Interventions 3, 2 (2017), 219–228.
  • Bordes and Weston (2016) Antoine Bordes and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683 (2016).
  • Chai et al. (2016) Joyce Y Chai, Rui Fang, Changsong Liu, and Lanbo She. 2016. Collaborative Language Grounding Toward Situated Human-Robot Dialogue. AI Mag 37, 4 (2016).
  • Chapman et al. (2011) Robert M Chapman, Mark Mapstone, John W McCrary, Margaret N Gardner, Anton Porsteinsson, Tiffany C Sandoval, Maria D Guillily, Elizabeth DeGrush, and Lindsey A Reilly. 2011. Predicting conversion from mild cognitive impairment to Alzheimer’s disease using neuropsychological tests and multivariate methods. Journal of Clinical and Experimental Neuropsychology 33, 2 (2011), 187–199.
  • Chen et al. (2017) Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A Survey on Dialogue Systems: Recent Advances and New Frontiers. arXiv preprint arXiv:1711.01731 (2017).
  • Cummings et al. (2007) Jeffrey L Cummings, Rachelle Doody, and Christopher Clark. 2007. Disease-modifying therapies for Alzheimer disease Challenges to early intervention. Neurology 69, 16 (2007), 1622–1634.
  • Dhingra et al. (2016) Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2016. End-to-end reinforcement learning of dialogue agents for information access. arXiv preprint arXiv:1609.00777 (2016).
  • Dillon et al. (2013) Carol Dillon, Cecilia M Serrano, Diego Castro, Patricio Perez Leguizamón, Silvina L Heisecke, and Fernando E Taragano. 2013. Behavioral symptoms related to cognitive impairment. Neuropsychiatric disease and treatment 9 (2013), 1443.
  • Dodge et al. (2015) Hiroko H Dodge, Jian Zhu, Nora C Mattek, Molly Bowman, Oscar Ybarra, Katherine V Wild, David A Loewenstein, and Jeffrey A Kaye. 2015. Web-enabled conversational interactions as a method to improve cognitive functions: Results of a 6-week randomized controlled trial. Alzheimer’s & dementia: translational research & clinical interventions 1, 1 (2015), 1–12.
  • Folstein et al. (1975) Marshal F Folstein, Susan E Folstein, and Paul R McHugh. 1975. “Mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of psychiatric research 12, 3 (1975), 189–198.
  • Gauthier et al. (2006) Serge Gauthier, Barry Reisberg, Michael Zaudig, Ronald C Petersen, Karen Ritchie, Karl Broich, Sylvie Belleville, Henry Brodaty, David Bennett, Howard Chertkow, et al. 2006. Mild cognitive impairment. The Lancet 367, 9518 (2006), 1262–1270.
  • H Dodge et al. (2015) Hiroko H Dodge, Nora Mattek, Mattie Gregor, Molly Bowman, Adriana Seelye, Oscar Ybarra, Meysam Asgari, and Jeffrey A Kaye. 2015. Social markers of mild cognitive impairment: Proportion of word counts in free conversational speech. Current Alzheimer Research 12, 6 (2015), 513–519.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In ECCV. Springer, 630–645.
  • He et al. (2017) Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, et al. 2017. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. arXiv preprint arXiv:1711.05073 (2017).
  • Heister et al. (2011) D Heister, James B Brewer, Sebastian Magda, Kaj Blennow, Linda K McEvoy, Alzheimer’s Disease Neuroimaging Initiative, et al. 2011. Predicting MCI outcome with clinically available MRI and CSF biomarkers. Neurology 77, 17 (2011), 1619–1628.
  • Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Steve Young. 2014. Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised adaptation. In SLT. IEEE, 360–365.
  • Hsu et al. (2017) Paris Hsu, Jingshu Zhao, Kehan Liao, Tianyi Liu, and Chen Wang. 2017. AllergyBot: A Chatbot Technology Intervention for Young Adults with Food Allergies Dining Out. In CHI. ACM, 74–79.
  • Hu et al. (2017) Minghao Hu, Yuxing Peng, and Xipeng Qiu. 2017. Reinforced mnemonic reader for machine comprehension. CoRR, abs/1705.02798 (2017).
  • Jack Jr et al. (2010) Clifford R Jack Jr, David S Knopman, William J Jagust, Leslie M Shaw, Paul S Aisen, Michael W Weiner, Ronald C Petersen, and John Q Trojanowski. 2010. Hypothetical model of dynamic biomarkers of the Alzheimer’s pathological cascade. The Lancet Neurology 9, 1 (2010), 119–128.
  • Johnson et al. (2012) Keith A Johnson, Nick C Fox, Reisa A Sperling, and William E Klunk. 2012. Brain imaging in Alzheimer disease. Cold Spring Harbor perspectives in medicine 2, 4 (2012), a006213.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems. 3294–3302.
  • Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. 2016a. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155 (2016).
  • Li et al. (2017) Xuijun Li, Yun-Nung Chen, Lihong Li, and Jianfeng Gao. 2017. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008 (2017).
  • Li et al. (2016b) Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016b. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688 (2016).
  • Liu and Lane (2017) Bing Liu and Ian Lane. 2017. An end-to-end trainable neural network model with belief tracking for task-oriented dialog. arXiv preprint arXiv:1708.05956 (2017).
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
  • Montenegro and Argyriou (2017) Juan Manuel Fernandez Montenegro and Vasileios Argyriou. 2017. Cognitive evaluation for the diagnosis of Alzheimer’s disease based on turing test and virtual environments. Physiology & behavior 173 (2017), 42–51.
  • Ni et al. (2017) Lin Ni, Chenhao Lu, Niu Liu, and Jiamou Liu. 2017. MANDY: Towards a Smart Primary Care Chatbot Application. In International Symposium on Knowledge and Systems Sciences. Springer, 38–52.
  • Oh et al. (2017) Kyo-Joong Oh, Dongkun Lee, Byungsoo Ko, and Ho-Jin Choi. 2017. A Chatbot for Psychiatric Counseling in Mental Healthcare Service Based on Emotional Dialogue Analysis and Sentence Generation. In MDM. IEEE, 371–375.
  • Olazaran et al. (2004) J Olazaran, Rubén Muñiz, B Reisberg, J Peña-Casanova, T Del Ser, AJ Cruz-Jentoft, P Serrano, E Navarro, ML García de la Rocha, A Frank, et al. 2004. Benefits of cognitive-motor intervention in MCI and mild to moderate Alzheimer disease. Neurology 63, 12 (2004), 2348–2353.
  • Pennebaker et al. (2001) James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates 71, 2001 (2001), 2001.
  • Salichs et al. (2016) Miguel A Salichs, Irene P Encinar, Esther Salichs, Álvaro Castro-González, and María Malfaz. 2016. Study of scenarios and technical requirements of a social assistive robot for alzheimer’s disease patients and their caregivers. International Journal of Social Robotics 8, 1 (2016), 85–102.
  • Schatzmann et al. (2006) Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge eng. rev. 21, 2 (2006), 97–126.
  • Singh et al. (2002) Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker. 2002. Optimizing dialogue management with reinforcement learning: Experiments with the NJFun system.

    Journal of Artificial Intelligence Research

    16 (2002), 105–133.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
  • Tanaka et al. (2017) Hiroki Tanaka, Hiroyoshi Adachi, Norimichi Ukita, Manabu Ikeda, Hiroaki Kazui, Takashi Kudo, and Satoshi Nakamura. 2017. Detecting Dementia Through Interactive Computer Avatars. IEEE journal of translational engineering in health and medicine 5 (2017), 1–11.
  • Tombaugh and McIntyre (1992) Tom N Tombaugh and Nancy J McIntyre. 1992. The mini-mental state examination: a comprehensive review. J. of the Ame. Geriatrics Soc. 40, 9 (1992), 922–935.
  • Wang et al. (2016) Qi Wang, Liang Zhan, Paul M Thompson, Hiroko H Dodge, and Jiayu Zhou. 2016. Discriminative fusion of multiple brain networks for early mild cognitive impairment detection. In ISBI. IEEE, 568–572.
  • Wen et al. (2016) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562 (2016).
  • Zhan et al. (2015) Liang Zhan, Yashu Liu, Yalin Wang, Jiayu Zhou, Neda Jahanshad, Jieping Ye, and Paul Matthew Thompson. 2015.

    Boosting brain connectome classification accuracy in Alzheimer’s disease using higher-order singular value decomposition.

    Frontiers in neuroscience 9 (2015), 257.