Measuring Conversational Productivity in Child Forensic Interviews

06/08/2018 ∙ by Victor Ardulov, et al. ∙ University of Southern California 0

Child Forensic Interviewing (FI) presents a challenge for effective information retrieval and decision making. The high stakes associated with the process demand that expert legal interviewers are able to effectively establish a channel of communication and elicit substantive knowledge from the child-client while minimizing potential for experiencing trauma. As a first step toward computationally modeling and producing quality spoken interviewing strategies and a generalized understanding of interview dynamics, we propose a novel methodology to computationally model effectiveness criteria, by applying summarization and topic modeling techniques to objectively measure and rank the responsiveness and conversational productivity of a child during FI. We score information retrieval by constructing an agenda to represent general topics of interest and measuring alignment with a given response and leveraging lexical entrainment for responsiveness. For comparison, we present our methods along with traditional metrics of evaluation and discuss the use of prior information for generating situational awareness.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A Forensic Interview (FI) with a child involves a legal expert navigating a semi-structured interaction with an objective to elicit substantive and detailed information regarding abuse or neglect that the child might have witnessed or been a victim of. Given the potential criminal underpinning, risks, and consequences associated with these spoken interviews, it is critical they are conducted to evoke genuine information from a possible sufferer. We explore the role of speech and language processing to provide supporting analytical tools in this domain.

Since FI is conducted with a potential victim of criminal maltreatment, it runs the risk of emotionally impacting the child (e.g. potentially having a victim re-experience trauma). Establishing a sense of trust and comfort with the children strongly impacts the outcome of the interviews [1, 2]. FI with children is further challenged as the child’s age and cognitive development and state may impact their ability to provide a complete and accurate response. For instance, [3] demonstrated that sexually-abused children were more likely to deny the abuse during their first line of questioning. This imposes a tremendous importance on developing strategies for rapport building with the child which can adequately capture the multi-faceted complexity of the objective and constraints.

In order to effectively research and develop best practices for FI, we find it necessary to have an objective method to assess quality of interactions. A number of contemporary studies [4, 5, 6] have evaluated interaction using a measure of “productivity” scoring of child responses. The dominant metrics used are utterance word count along with a variety of human-coded measures such as “richness” [7] indicating the number of informative details provided [8]. In this paper, we address the limitations of these methods and present a novel approach to identify momentary instances of verbal productivity and evaluate for each utterance.

This work both builds upon existing NLP methods and presents new metrics to computationally model topics presented in goal-oriented dialogue, which extend to a mathematical formulation and framework for extracting the conversational “agenda” to represent information that the interviewer wishes to address. We present a number of scoring methods which leverage both a child’s responsiveness to an immediate prompt as well as the alignment of utterances with the agenda. To the best of our knowledge, this is the first approach to scoring verbal recall productivity using computational metrics, and believe that it has the potential to strongly impact the field of FI, giving researchers an objective measure which they can optimize, score, and inform their decision making.

2 Background

FI of children is an important and active area of research in psychology and law. The situational context is high stakes as children are frequently the victims of crimes perpetrated by their caretakers or legal guardians [9], creating a conflict for both interviewers and the child. Furthermore, children are often the only witnesses to the abuse of others [10] making their recall and accounts provided during interviews crucial to the legal process. Speech and language analysis performed in [8] and [11] provides details on the effectiveness across linguistic features for questions and show that contemporary interviewing research is conducted using word count and human coding of new relevant information .

Word count is generally recorded as a metric of responsiveness during rapport building. Aside from establishing a comfortable setting, rapport building phases are important to practicing narrative techniques in children [12]. However, Figure 1

shows that the expressiveness in terms of per-turn average word count and its variability per group grows with a steady relation to their age. Generally though, we see that most children will have relatively low variance in the number of words they express during a session. Thus, word count is more likely to be reflective of an individual’s language usage and ability, rather than a indicator of expressiveness or narrative structuring. By introducing a computational metric to capture responsiveness as a measure of topic expression we hope to facilitate the evaluation with higher fidelity. Furthermore, the ability to automate the interview evaluation alleviates some of the need for human involvement in data creation and labeling, making the process more scalable, faster, and less subjective.

Figure 1: Expressiveness (in terms of average turn-level word count) of children across age groups collected over 527 Forensic Interview Transcripts

The importance of social cues and rapport during an interview make it similar in nature to psychotherapeutic (PT) dialogues such as motivational interviewing, where the interviewer (counselor) is guiding the subject (client) towards a goal through appropriate dialogue actions[13]. Both FI and PT require a foundation of trust, which is built over the interactions and do not always align with the direct objectives of the dialogue. In psychotherapy trust is used to promote behavioral change in the client, typically over a series of sessions over a long period of time. However, the legal setting surrounding FI limits engagement opportunities that interviewers have, and is more focused in maximizing information elicitation from the subject without coercion.

Previous work on dialogue-summarization shows success in tracking and modeling topics discussed in dyadic conversations. One source of inspiration for our work is the topic modeling performed in DiaSumm [14], which uses tf-idf and TextTiling [15] to find topics and their boundaries in the CALLHOME dataset. However, since we tasked with scoring the relevance of a response, our approach only uses the interviewers side of the dialogue to extract the topics.

The work presented in [16]

takes advantage of knowledge-graphs (e.g. Wikipedia) to improve dialogue summary and topic identification robustness. However within the specific application of Child FI, the vocabulary used by the client will be strongly influenced by the child’s language development. As a result, semantic representations from these large knowledge-graphs do not necessarily contain information that can improve the topic-modeling of the dialogues.

Finally, both Question-Answering Systems (QASs) and Machine Dialogue Systems (MDSs) provide frameworks for evaluating machine-generated responses. Many different flavors exist for QASs, both Natural-Language-Understanding based and structured-query, but both are assessed in whether or not the response was correct in reference to a ground truth [17]. MDSs are generally evaluated for “human-like” linguistic features using word-perplexity as a first pass, and then relying on human analysis of the generated response [18, 19]. When evaluating goal-oriented dialogue systems such as [20], the response is scored by the final objective measure of the system’s ability to outperform a human in a negotiation. Ultimately, both of these domains present relevant methodologies towards how a dialogue may be scored, but neither present a specific criteria that may be applied to the context of Child FI.

In juxtaposition, this paper presents a scoring metric which evaluates dialogue interactions without a known ground truth which we hope to elicit, nor a known objective metric that is being maximized through the interaction. The methodologies presented in Sections 3.1 and 3.2 address the limitations of human-coding scalability and word count applicability. Our results demonstrate that despite these challenges a computationally motivated metric generates robust and informative values indicative of interview effectiveness.

3 Methodology

Let us introduce some notation to begin. An interview is denoted as a sequence of questions () and response phrases (). An interviewer has a vocabulary

which is a collection of unique n-grams for an interviewer in

. We chose arbitrary n = 3 and exclude stop words as they by definition do not carry meaning.

3.1 Agenda

We define an agenda, for an interview session in the following way

Where

is a function to construct a vector of term-frequency,

, of word in .

We construct a vector to represent a response phrase within the interviewers vocabulary :

From this we present our first productivity score () of a given response as:

We will refer to this as the Agenda productivity, and justify that is a representation of the topics, that are most relevant to the dialogue. By taking the dot product we compare how much of the agenda is expressed in the response as time . This does not discredit the word count, however we apply less weight to utterances that are not relevant to the topics we believe require information.

3.2 Responsiveness

Observe that the agenda is constructed over the entire session. However, during a dialogue the agenda becomes revealed over time, implying that early interactions may be given low scores inherently because not enough of the agenda has been revealed to the subject in order for them to respond meaningfully. Simultaneously, there is a desire to assign productivity to the response elicited if the child responded directly to a prompt from the interviewer. We will use this as inspiration to construct a term that will give benefit to responsiveness (immediate address of prompted agenda topic). Let us then construct which is the rolling agenda, a measure of the agenda that has been revealed:

This allows us to treat as a representation of recently evoked words using to represent a discount on words or topics that are brought up in the past. Alignment or use of these words can be observed as a lexical entrainment or responsiveness to the question, which are indicators of trust between the interviewer and the client. Entrainment refers to instances when conversational partners begin to use the same vocabulary and match each other’s speaking patterns [21]. We define an instance of entrainment or responsiveness as the reuse of vocabulary from a previous point in time as:

Then to capture both of the desired scores we construct the following productivity measure:

where allows us to flexibly leverage the complementary importance of the skills being captured by each sub-metric.

Figure 2: Age distribution across 527 sessions

4 Results and Discussion

Below, we apply the considered productivity scoring functions across a set of 527 child forensic interviews. Each data-point represents a single interview between an expert-legal interviewer and a unique child (i.e. each interview is with a different child). The children interviewed are aged between 2 and 17 years old with the exact distribution shown in Figure 2. The data are collected as part of a study for the development of interviewing practices and contain examples of various interviewing strategies. To validate our methods we present samples randomly selected from the dataset, highlighting ability of our method to extract important concepts and the differences observed across the use of 4 different scoring metrics: word count, agenda () scoring, responsiveness () scoring, and a combined responsiveness/agenda () scoring.

4.1 Agenda and Responsiveness

bathroom mister cousin gonna really
gonna pinched al mommy understand
outside kids cousin al aunt gonna
touched thank touched touched sometimes
garage gonna private touching times
uncle mom talking michael important
sitting wrong doors clothes clothes
clothes kid outside peepee kids
fix id gonna pants mom
mom really legs grandma touched
Table 1: Top 10 weighted words from 5 agendas

Table 1

shows examples of agendas that have been constructed using our methods with simple stop-word filtering. The extraction clearly identifies important concepts that are relevant to potential episodes of abuse that the subject experiences or witnesses. Particularly we take note that these words were extracted purely from the lexicon of the interviewer, and are not directly modeling the response of the child. This is justified by the assumption that in an interview setting, a concept that is repeated frequently in the questions is indicative of a broader interest in that topic.

Word Count Agenda Responsive Responsiveness/Agenda
Utterance Excerpt () () ()
I like to play with my friends.
I have a new student … 35 (1) 0.00 (128) 0.00 (129) 0.00 (128)
Outside in the bathroom…
(child reveals details) 9 (26) 27.00 (1) 0.16 (3) 0.23 (2)
Over my skirt clothes,
but outside he got me under my clothes. 12 (17) 12.00 (3) 0.37 (1) 0.40 (1)
Table 2: Comparison of productivity produced by different methods and their relative rank (1 being highest rated rank). Demonstrates the ability to score substantive information from utterances. For Responsiveness+Agenda hyper-parameters and are chosen to be 0.5

A closer exploration of utterance level interactions and their scoring can be found in Table 2, which exhibits the differences and similarities between scoring metrics. Comparing the rating of each method against the others’ top-rated utterances informs us as to what features each criterion is sensitive to.

When using the word count, a response to the interviewer asking the child to talk about their experience in school is rated highly, despite not being substantive to overall interview. While the response does generally contain a lot of words (and thus information) and can be seen as an demonstration of trust, it does not directly indicate any clues relevant to the suspected abuse the child might have experienced. In contrast with word count, our scoring metrics assign higher score to utterances that contain information relevant to more targeted questions.

This is not to say that asking about the child’s schooling is unproductive to the interview. Rather, the information explored in that specific utterance is not indicative of the overall interview session’s effectiveness. The use of the agenda allows us to assign higher productivity to utterances in which the child reveals a strong cooperation with the goals of the interviewer. When taken in conjunction with the possibility of value-iteration methods, we allow for a decision theoretic model to assign value to rapport building dialogue actions which will optimize for the desired objective of building responsive communication and eliciting information.

4.2 Signal Sparsity

Figure 3: Various proposed productivity scores over the course of a single session, agenda(top, left), responsiveness (top, right), combined (bottom, left) and word count (bottom, right). Each score is normalized with respect to its maximum.

By comparing the differences in signal sparsity shown in Figure 3, it can be seen that the word count metric, while being highly abundant, also demonstrates a highly volatile nature suggesting many hidden underlying influences. The signal produced by the productivity metric, while sparser, can be interpreted as a information-relevance filter convolved against the lexical information provided by the child. This implies that sparsity will not hinder the study of productivity, instead allows us to isolate the effects of differing strategies specifically to the information retrieval criteria.

Metric r
Word Count 0.46
Agenda 0.26
Responsive 0.24
Responsive/Agenda 0.25
Table 3: Pearson correlation (r) indicating strength of relationship between different productivity metrics and the age of the child. All scores were significant at .

4.3 Age-Related Analysis

For comparison with Figure 1, Figures 4 and 5 show us the resulting dependency between productivity scores and child age. Moreover, Table 3 reveals the difference in Pearson scores to be statistically significant, implying a weaker correlation between age and productivity, further suggesting that the metrics are also more robust to a specific child’s language ability and development. From this we can extrapolate that these metrics can be used to evaluate the productivity adaptively to each individual and capture a more general set of interaction characteristics.

Figure 4: Distribution of scores and variances by age using only the Agenda criterion
Figure 5: Distribution of scores and variances by age using Responsiveness scoring methods.

5 Conclusions

Our experiments and evaluations demonstrate the usefulness of using topic modeling for productivity assessment. We further suggest the value captured by measuring responsiveness is equivalent to a time-dilated matching of vocabulary usage corresponding to lexical entrainment of local interactions, a known indicator of trust in a dyadic interaction [22]. A number of promising examples for these methods have been shown to significantly improve productivity measurements of Child FI while also automating the process. Furthermore, the results demonstrate that this can be achieved using a rather simple trigram model.

By introducing hyper-parameters and to the proposed productivity measure we allow the system to maintain flexibility for different criteria, without sacrificing consistency of scoring across specific examples. For the first time, FI researchers will be able to rigorously assign a value to a response within the context of a conversation. Aside from simply reevaluating prior work, these techniques allow for interviewers to construct “gold standard” agendas prior to the interview. Using these prepared agendas will allow interviewers to use situational hypotheses and prior knowledge to improve and inform the specific policy or strategy they employ.

More broadly, these techniques can be used to model similar goal-oriented dialogues such as negotiation and question answering in an effort to make information discovery a larger part of the dialogue process.

In conclusion, our results suggest that the computational models of these interviews are better informed by the proposed measure of productivity. This creates a clear criterion that can be evaluated and optimized, providing real-time updates and adaptive scoring metrics.

6 Acknowledgements

We would like to thank our collaborators at the Child Interviewing Lab at the Gould School of Law for their collection and sharing of the data presented in this paper.

References

  • [1] K. J. Sternberg, M. E. Lamb, I. Hershkowitz, L. Yudilevitch, Y. Orbach, P. W. Esplin, and M. Hovav, “Effects of introductory style on children’s abilities to describe experiences of sexual abuse,” Child Abuse & Neglect, vol. 21, no. 11, pp. 1133–1146, 1997.
  • [2] M. E. Lamb, Y. Orbach, K. J. Sternberg, J. Aldridge, S. Pearson, H. L. Stewart, P. W. Esplin, and L. Bowler, “Use of a structured investigative protocol enhances the quality of investigative interviews with alleged victims of child sexual abuse in britain,” Applied cognitive psychology, vol. 23, no. 4, pp. 449–467, 2009.
  • [3] T. D. Lyon, “False denials: Overcoming methodological biases in abuse disclosure research,” Child sexual abuse: Disclosure, delay and denial, pp. 41–62, 2007.
  • [4] E. C. Ahern, S. N. Stolzenberg, and T. D. Lyon, “Do prosecutors use interview instructions or build rapport with child witnesses?” Behavioral sciences & the law, vol. 33, no. 4, pp. 476–492, 2015.
  • [5] E. A. Price, E. C. Ahern, and M. E. Lamb, “Rapport-building in investigative interviews of alleged child sexual abuse victims,” Applied Cognitive Psychology, vol. 30, no. 5, pp. 743–749, 2016.
  • [6] V. Talwar, K. Hubbard, C. Saykaly, K. Lee, R. Lindsay, and N. Bala, “Does parental coaching affect children’s false reports? comparing verbal markers of deception,” Behavioral sciences & the law, vol. 36, no. 1, pp. 84–97, 2018.
  • [7] M. E. Lamb, “Effects of investigative utterance types on israeli children’s responses,” International Journal of Behavioral Development, vol. 19, no. 3, pp. 627–638, 1996. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/016502596385721
  • [8] E. C. Ahern, S. J. Andrews, S. N. Stolzenberg, and T. D. Lyon, “The Productivity of Wh- Prompts in Child Forensic Interviews,” Journal of Interpersonal Violence, p. 088626051562108, 2015. [Online]. Available: http://journals.sagepub.com/doi/10.1177/0886260515621084
  • [9] L. Radford, S. Corral, C. Bradley, H. Fisher, C. Bassett, N. Howat, and S. Collishaw, “Child abuse and neglect in the uk today,” 2011.
  • [10] M. E. Lamb, K. J. Sternberg, Y. Orbach, I. Hershkowitz, and D. Horowitz, “Differences between accounts provided by witnesses and alleged victims of child sexual abuse,” Child Abuse and Neglect, vol. 27, no. 9, pp. 1019–1031, 2003.
  • [11] J. Z. Klemfuss, J. A. Quas, and T. D. Lyon, “Attorneys’ Questions and Children’s Productivity in Child Sexual Abuse Criminal Trials,” HHS Public Access, vol. 51, no. 1, pp. 87–100, 2015.
  • [12] G. D. Anderson, J. N. Anderson, and J. F. Gilgun, “The influence of narrative practice techniques on child behaviors in forensic interviews,” Journal of Child Sexual Abuse, vol. 23, no. 6, pp. 615–634, 2014. [Online]. Available: http://dx.doi.org/10.1080/10538712.2014.932878https://doi.org/10.1080/10538712.2014.932878
  • [13] S. P. Lord, D. Can, M. Yi, R. Marin, C. W. Dunn, Z. E. Imel, P. Georgiou, S. Narayanan, M. Steyvers, and D. C. Atkins, “Advancing methods for reliably assessing motivational interviewing fidelity using the motivational interviewing skills code,” Journal of Substance Abuse Treatment, vol. 49, pp. 50–57, 2015.
  • [14] K. Zechner and A. Waibel, “Flexible Summarization of Spontaneous Dialogues in Unrestricted Domains DiaSumm : 2 Issues and Approaches : Overview,” Proceedings of the 18th Conference on Computational Linguistics, vol. 2, pp. 968–974, 2000.
  • [15] M. a. Hearst, “TextTiling: Segmenting text into multi-paragraph subtopic passages,” Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997.
  • [16]

    Y. Noh, J. W. Son, and S. B. Park, “Keyword extraction from dialogue sentences using semantic and topical relatedness,”

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , vol. 8226 LNCS, no. PART 1, pp. 129–136, 2013.
  • [17] A. Bouziane, D. Bouchiha, N. Doumi, and M. Malki, “Question Answering Systems: Survey and Trends,” Procedia Computer Science, vol. 73, no. Awict, pp. 366–375, 2015. [Online]. Available: http://dx.doi.org/10.1016/j.procs.2015.12.005
  • [18]

    I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau, “Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models,”

    Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence, pp. 3776–3784, 2016.
  • [19] X. Li, Y.-N. Chen, L. Li, J. Gao, and A. Celikyilmaz, “End-to-End Task-Completion Neural Dialogue Systems,” 2017. [Online]. Available: http://arxiv.org/abs/1703.01008
  • [20] M. Lewis, D. Yarats, Y. N. Dauphin, D. Parikh, and D. Batra, “Deal or No Deal? End-to-End Learning for Negotiation Dialogues,” 2017. [Online]. Available: http://arxiv.org/abs/1706.05125
  • [21] S. E. Brennan, “Lexical Entrainment in Spontaneous Dialog,” Proceedings, 1996 International Symposium on Spoken Dialogue, ISSD-96, pp. 41–44, 1996.
  • [22] L. E. Scissors, A. J. Gill, K. Geraghty, and D. Gergle, “In CMC we trust: the role of similarity,” 27th international conference on Human factors in computing systems - CHI 09, pp. 527–536, 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1518701.1518783