Slices of Attention in Asynchronous Video Job Interviews

09/19/2019 ∙ by Léo Hemamou, et al. ∙ EASYRECRUE Télécom ParisTech 0

The impact of non verbal behaviour in a hiring decision remains an open question. Investigating this question is important, as it could provide a better understanding on how to train candidates for job interviews and make recruiters be aware of influential non verbal behaviour. This research has recently been accelerated due to the development of tools for the automatic analysis of social signals, and the emergence of machine learning methods. However, these studies are still mainly based on hand engineered features, which imposes a limit to the discovery of influential social signals. On the other side, deep learning methods are a promising tool to discover complex patterns without the necessity of feature engineering. In this paper, we focus on studying influential non verbal social signals in asynchronous job video interviews that are discovered by deep learning methods. We use a previously published deep learning system that aims at inferring the hirability of a candidate with regard to a sequence of interview questions. One particularity of this system is the use of attention mechanisms, which aim at identifying the relevant parts of an answer. Thus, information at a fine-grained temporal level could be extracted using global (at the interview level) annotations on hirability. While most of the deep learning systems use attention mechanisms to offer a quick visualization of slices when a rise of attention occurs, we perform an in-depth analysis to understand what happens during these moments. First, we propose a methodology to automatically extract slices where there is a rise of attention (attention slices). Second, we study the content of attention slices by comparing them with randomly sampled slices. Finally, we show that they bear significantly more information for hirability than randomly sampled slices.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The procedure of personnel selection includes gathering data about the potential candidates, for example, in a job interview [2012TheSelection]. Research in Affective Computing can be helpful in many ways with respect to job interviews, for example virtual recruiters can help candidates train their social skills and rehearse [Hoque2013MACHb]. This automatic processing can help recruiters assess candidates. Additionally, it can help researchers and recruiters understand the evaluation process done by recruiters when assessing a candidate. Initially conducted face to face or via phone, job interviews are now often done by online video conferencing systems or by asynchronous video recordings. An asynchronous video interview is an emergent tool now offered by several companies responding to the needs of initial assessment in personnel selection. The procedure is as follows: the candidate connects to a web platform and answers a sequence of questions predefined by the recruiter while recording a video of himself with his webcam, smartphone or tablet. Later, recruiters connect to the same platform, watch the candidate’s answers, rate the answers and then decide whether they want to invite the candidate to a face-to-face interview.

Researchers are already developing systems for automatically predicting hirability based on non verbal cues of candidates in asynchronous video interviews[Chen2017, Nambiar2017AutomaticInterviews]. In this context and in addition to new legislative constraints (General Data Protection Regulation), such automatic systems require interpretability and transparency. With these systems, candidates will be able to improve their non verbal behavioral strategy, and recruiters will be able to assess these decision support models. These models could even help recruiters understand their own biases.

Classical approaches in social computing consist of building machine learning models and interpreting the importance of the features [Naim2018AutomatedPerformance, RaoS.B2017AutomaticStudy, Wortwein2015MultimodalAssessment]. These approaches are not able to bring forward and stress the effect of unexpected and influential social cues. We previously proposed a deep learning model trained only with recruiter’s decision and videos [Hemamou2019HireNetInterviews]. Our model is able to consider temporality and influential slices of video-based asynchronous interviews, due to the use of an attention mechanism running on top of a recurrent neural network. A sequence of features is processed by a recurrent neural network, and the attention mechanism aims at learning a different weight for each time step to enhance the performance of the classification task. Overall, such techniques could be useful for understanding human behaviour, as they aim to separate task relevant time steps in a sequence from irrelevant time steps [Yu2017TemporallyContent]. However, research in this area is limited to studying and highlighting just a few examples of peaks of attention curves [Yu2017TemporallyContent, Martins2016FromClassification]. Hence, a consistent validation is needed in order to ascertain the usefulness of the system’s output for predicting hirability as rated by recruiters.

With this in mind, this article describes three experiments we conducted to understand whether slices of video interviews highlighted by the attention model do carry information that is useful to recruiters. In section

IV, we propose a methodology to automatically extract thin slices where attention values are high. In section V, we test whether the non verbal behavior occurring during these slices is different from behavior occurring in randomly picked slices. Finally, in section VI, we evaluate whether the extracted slices are more informative with regards to the hirability of a candidate.

Fig. 1: Example of attention curve and salient moments detected with peaks in HireNet.

Ii Related Works

Ii-a Job interview and Non Verbal Behaviour

Non verbal visual and audio cues have been studied in order to predict interview performance [FORBES1980NonverbalInterviews], anxiety[Feiler2016BehavioralAnxiety], personality of the candidates [Degroot2009CanInterviews] or deception [Schneider2015CuesInterview]. Numerous visual cues such as physical attractiveness, hand gestures, smiling, eye contact, nodding, head movement, body orientation, facial cues, leg movements have been used throughout experiments. For example torso movement, face touching, leg fidgeting, [Feiler2016BehavioralAnxiety] neutral expression and less smiling have been found [Gifford1985NonverbalJudgments.] to negatively correlate to interview performance, whereas eye contact, hand gesture[Feiler2016BehavioralAnxiety], head movement [Schneider2015CuesInterview] correlates positively with interview performance. Moreover, these cues could have a different impact on interviewer evaluations depending on interview structure, job position (blue collar vs white collar) or settings of the interview (such as telephone, computer-mediated video chat and asynchronous video interview)[Frauendorfer2015TheInterview].

Putting aside efficiency, annotating every segment of a video is time consuming. One common approach to deal with the task of annotations is to only annotate a part of the job interview. In fact, it has been shown that, using only a short amount of information, people can infer correctly personal characteristics, traits or states of an individual[Murphy2015ReliabilityInteractions, Carney2007AImpressions]. This approach is called thin slice analysis and has already been used in social interactions study [Murphy2015ReliabilityInteractions], first impressions[Carney2007AImpressions], public speaking[Chollet2017AssessingBehavior], or job interviews[Nguyen2015IMinute]. Another advantage of this method is that it highlights brief, non verbal behavior with respect to perceived impressions. Nonetheless, the duration and sampling strategy for thin slices remain an open question. Previous studies focus on sampling thin slices randomly[Chollet2017AssessingBehavior], using the structure of the job interview (slices based on questions and answers) [Nguyen2015IMinute], or at the beginning and end of the interactions[Degroot2009CanInterviews]. Automatic methods based on social signal processing could give way to a better selection of thin slices and their duration, by selecting regions that carry more information.

Ii-B Automatic methods for understanding human behavior in job interviews

Recent advances drastically reduce time spent in manually coding behavioral cues. Tools are now available to automatically code vocal [Eyben2016TheComputing] or visual [Baltrusaitis2018OpenFaceToolkit] cues. Recent studies use social signal processing and machine learning to understand the links between non verbal cues and hirability. These studies have been applied to different job interview settings: face to face interviews [Nguyen2014HireBehavior, Nguyen2015IMinute], asynchronous video interviews [Chen2017] and computer-mediated video chat [RaoS.B2017AutomaticStudy].

Among investigated traits in job interviews (communication skills[RaoS.B2017AutomaticStudy], personality [Chen2017], etc), hirability remains the most studied one. Usually, two methods are used to understand which extracted features are important for hirability: correlation analysis and feature importance analysis conducted on a trained machine learning model. However, feature importance analysis is highly dependent on a machine learning pipeline. To the best of our knowledge, only traditional machine learning (SVM, Lasso, Ridge, etc) has been used so far. In a previous work, we investigated the use of deep learning techniques, and established their superiority in terms of predictive capabilities [Hemamou2019HireNetInterviews]. However no in-depth analysis about the feedback returned by the model was conducted.

Ii-C Neural Networks and explainability

Neural networks are able to find more statistical patterns than traditional machine learning methods such as SVMs, logistic regressions or Random Forests. Moreover, specific architectures such as recurrent neural networks allow for modeling temporality by managing sequences. However, the freedom they have to construct intermediate representation comes at the cost of an extreme opacity. This opacity hinders their usability for critical applications such as healthcare, justice, or human resources. Therefore several researchers have tried to propose methods to better explain these networks. First, the visualization of hidden states has been explored to better understand intermediate representations automatically built by the networks, especially in computer vision


A second approach, called knowledge distillation, consists of learning an interpretable model from an already trained complex neural networks [Liu2019ImprovingDistillation]. A third method consists of explaining predictions for specific instances (as opposed to explaining the whole model). Some of the attempts build a local boundary for these predictions [Ribeiro2016WhyClassifier], using sensitivity methods or analyzing integrated gradients for image analysis [Sundararajan2017AxiomaticNetworks]. Finally, attention mechanisms have recently gained popularity for enhancing performance and interpretability. Specifically in the Social Computing area, the use of attention mechanisms for rapport detection [Yu2017TemporallyContent] or for the evaluation of job interview performance in asynchronous video interviews [Hemamou2019HireNetInterviews] has been proposed to extract fine grained information at temporal level using only coarse annotation at the interview level. However, most of the studies restrict the analysis of attention mechanisms to the display of examples and do not conduct an in-depth analysis. Moreover, the validity of the attention curves as an explanation has recently been called into question [Jain2019AttentionExplanation].

Iii Experimental Setup

Iii-a Dataset

As our goal is to evaluate and assess the relevance of attention mechanisms that are already trained, we use the same database previously collected by us [Hemamou2019HireNetInterviews]

. This database contains real French asynchronous video interviews of 7938 candidates applying for 475 sales positions. Each interview of a specific position has the same number of questions predefined upstream by the recruiter. Once the candidates finish answering the set of predefined questions on the web platform, recruiters and managers can connect to this platform, watch these answers, and evaluate the candidate. They can like, dislike, shortlist candidates, evaluate them on predefined criteria or write comments. Based on this information, candidates who have been liked or shortlisted have been labelled ”hirable”, otherwise they are labelled ”not hirable”. If candidates received different annotations from multiple recruiters, a majority vote was taken. In case of draw, the candidate is considered ”hirable”. To the best of our knowledge, this database is the one with the highest number of real applicants assessed by real practitioners for a real position. We extracted verbal content using an automatic speech recognition tool (Google API). These Asynchronous Video Interviews have been recorded in the wild from various devices leading to a wide range of setups. Due to this condition, technical problems could occur such as videos without audio, illumination problems in videos, or failure in the automatic speech recognition. Descriptive statistics for each modality are available in Table

I. The dataset can not be made available to the public due to high privacy constraints.

Modality Text Audio Video
Train set 6350 6034 5706
Validation set 794 754 687
Test set 794 755 702
Questions per
interview (mean)
5.05 5.10 5.01
Total length 3.82 M words 557.7 h 508.8 h
Length per
question (mean)
95.2 words 52.19 s 51.54 s
Hirable label
45.0 % 45.5 % 45.4 %
TABLE I: Number of candidates in each set and overall statistics of the dataset.

Iii-B HireNet

In a previous article, we proposed HireNet, an attention neural network to infer hirability from structured video interviews. HireNet [Hemamou2019HireNetInterviews]

was conceived to represent a sequence of questions and their answers containing themselves a sequence of social signals. In the following sections, we focus only on the low level encoder of our model which aims to detect salient social signals. This encoder is a bidirectional Gated Recurrent Unit (GRU)

[cho2014learning] which encodes information from a sequence of low level descriptors. This encoder is followed by an attention mechanism that weights each timestep differently according to its importance. We aim to validate the usefulness of attention mechanisms to automatically extract the most useful slices to predict hirability. For the following study, we define an attention slice

as a slice selected according to the attention curve. For the sake of simplicity, we decided to focus only on visual features in the first question. The first question is highly linked to self presentation tactics, and initial impressions play an important role in the variance of interview scores

[Swider2016InitialOutcomes]. Moreover, our preliminary inspection of attention curves has shown that for the visual modality, attention peaks appear more frequently than for other modalities. These visual features are the position and orientation of the head, and continuous and categorical facial action units activations which have been extracted using OpenFace[Baltrusaitis2018OpenFaceToolkit]. Values were smoothed with a time window of 0.5s and an overlap of 0.25s before being fed to our model. This duration is frequently used in the literature of Social Computing [Varni2018ComputationalInteractions] and we validated it for our corpus by annotating the duration of social signals in a set of videos.

We trained our model to achieve best results for the mean of F-1 score of the positive and negative class rather than only on the positive class as we did in [Hemamou2019HireNetInterviews]. We call this average Mean F1

. As neural networks are subject to various variability sources such as random weights initialization, stochastic gradient descent or dropout, we chose to train five different instances of the model and then averaged the attention values. That way, we aim to capture the more general behaviour of attention mechanisms


. Mean performance and confidence interval details on test set are reported in Table


Model F1 Positive Class F1 Negative Class Mean F1
HireNet 0.607 0.023 0.628 0.013 0.618 0.008
TABLE II: Performances of our model on test set for hirability prediction task

Iv Do attention curves actually expose distinguishable peaks ?

Iv-a Methodology : Extraction of attention slices by unsupervised outlier detection

Attention curves mostly consist of noisy fluctuations with some high value peaks [Yu2017TemporallyContent][Hemamou2019HireNetInterviews]. A typical example is the red curve in figure 2. The first step of our methodology consists of filtering attention curves containing peaks and then extracting where attention rises (attention slices

). In order to achieve this, we use and adapt an unsupervised outlier detection method already proposed in another study on attention

[KimInterpretableAttention]. We sample timescale by randomly selecting samples according to the distribution given by the attention curves. Then, points with higher attention values have a higher chance of being selected. An example of this sampling process is available in figure 2. Once this sampling is done, we use DBSCAN (Density-Based Spatial Clustering of Applications with Noise)[Ester1996ANoise], an unsupervised density based algorithm, which aims to find regions where the density of points drawn is higher. This method proved to be efficient because it manages the noisy values of attention curves, and the number and expansion of regions (duration in our special case of time series) do not need to be specified. A typical result of this algorithm is depicted by blue boxes in figure 2.

Fig. 2: Attention slice extraction.
The Attention curve is in red, the histogram of points drawn is the result of the sampling procedure, and the detected peaks are highlighted by blue boxes.

Iv-B Results and descriptive statistics about extracted peaks

Table III provides a summary of the data serving as a basis for our study in terms of answers containing peaks. Some attention curves from candidate answers do not have peaks. In fact, some candidates may not display any particularly important moments during the answer to the first question. Figures 4 and 4 describe how long and when the peak with the largest amplitude occurs during an interview. It is interesting to note that the duration of the important slices extracted by the attention mechanism follows a very similar distribution to the duration of facial expressions which typically lasts between 0.5 and 4s[Matsumoto2011EvidenceEmotion]. Moreover, it seems they occur more often at the beginning and at the end of an answer. Such cues could indicate that non verbal behaviours occurring at the beginning (turn taking) and at the end of the answer (turn giving) have a strong impact on recruiter’s evaluation as in other face to face interactions [Cassell2001Non-verbalStructure, GOODRICH1979Face-to-FaceTheory].

Set Percentage of answers containing peaks
Train and Validation 63.8% (3644 answers kept)
Test 57.4% (403 answers kept)

TABLE III: Descriptive table of number of answers containing peaks
Fig. 3: Histogram of the duration of attention slices
Fig. 4: Histogram of the starting time of attention slices relative to the total duration of the answer

V Are social signals during attention slices different from those in random slices?

V-a Method : Supervised classification between attention slices and random slices

In this section we study the relationship between the values fed to our model and the attention slices

. Attention values could heavily depend on the context and on the model’s memory, and depend very little on the time-frame they point to. Our model uses Contextual Attention learned on top of a bidirectionnal GRU. GRU is a sequence modelling component that outputs vectors that depend on the current timestamp as well as it’s previous output. We hypothesize that a rise in attention is due to a change of behaviour within an answer.

To ensure that attention mostly stems from what is happening in the concerned time-steps, we construct a binary classification task. For our task, we take as one class the most important attention slices (the slice containing the peak with the highest amplitude) extracted in each candidate answer. As the other class, we take four moments with the same duration sampled in the candidate’s answer according to a distribution proportional to (). is the average of the output values of the attention mechanisms of the candidate’s answer. Through this sampling, we aim to select moments of varying importance, and not only the most unimportant ones, while still avoiding the most important moments. As our goal is to understand if attention slices

are different, we decided to use traditional classifiers with which a methodology to detect important features is well established. Thus, classifiers we used for this task are Lasso (linear and transparent model) and Random Forest (non linear model). As these classifiers take as input fixed vector, we summarize the features of the selected moments’ time-windows through the use of the following functions: mean, mean of positive gradients and mean of negative gradients.

We use these functions here as an attempt at capturing temporal dependencies while keeping an explainable set of features. As the attention mechanisms are trained on top of GRUs, they capture temporal variations. Our gradient functions have been used successfully in a previous behavioral classification work [Ryoo2015PooledVideos]. These functions are applied on the same feature set used in subsection III-B to train HireNet, the unique difference remains on a preprocess step of Z-normalization regarding the whole answer. For this experiment, we keep the same training, development, and test sets as we did before in [Hemamou2019HireNetInterviews] to prevent any sort of data leaks.

V-B Results and analysis

Model F1 Positive Class F1 Negative Class Mean F1
Random Baseline 0.286 0.614 0.450
Majority Class 0 0.888 0.444
Lasso 0.812 0.955 0.884
Random Forest 0.760 0.945 0.852
TABLE IV: Classification results between random slices and attention slices

As shown in Table IV the classifier’s performance is significantly above the random baseline, proving that despite the influence of sequence modelling and the use of context information, the importance of a moment is still mainly defined by the events occurring in it. It shows that specific moment where peaks of attentions occur are distinguishable from others slices of the same answer.

V-C Non verbal features importance analysis

Group Lasso Random Forest Permutation
Positive coefficients Negative coefficients
Lower Face AU20, AU23, AU17 AU26,AU12,AU17, AU20,AU10, AU25 AU20, AU26, AU23 ,AU17, AU25, AU23, AU17, AU15, AU12, AU14
Upper Face AU7, AU2, AU4, AU2 , AU1 AU2, AU7 AU4, AU2
Blink and Gaze AU45 AU45 AU45, AU45, gaze_angle
Position and rotation of the head , , ,
Confidence of OpenFace confidence, confidence

denotes that feature is ranked in position. and stand respectively for mean of positive and negative gradients. Bold indicates that feature

is significantly different from random slices on the test set based on two tailed t-test

TABLE V: Feature Importance Analysis

An experiment about features’ importance has also been done in order to highlight the features that contribute the most when identifying attention slices. Such analysis provides useful knowledge about what lets a slice be selected. In order to obtain this feature analysis, we inspect coefficients of Lasso Model and rank them according to their magnitude. Concerning, Random Forest importance, we run a permutation importance analysis through the use of Boruta Package [Kursa2010FeaturePackage]. The result table for top twenty features of both methods is available in table V. Blinking (AU45), lip stretcher (AU20), jaw drop (AU26) and lip tightener (AU23) are considered by both analyses to be the top 4 features with the most importance. Based on the sign of the coefficients of the Lasso model, we notice that attention slices are induced by: i) Eyes closed longer than usual, ii) The activation of lip stretcher and lip tightener, iii) The non-activation of jaw drop. These cues could indicate that moments when a candidate is not talking (absence of jaw drop) or when he/she displays social cues of anxiety (lip stretcher and lip tightener) are considered more important by the attention mechanism[Feiler2016BehavioralAnxiety]. Frames with chin raiser (AU17) are also considered important. Also, outer brow raiser (AU2) and brow lowerer (AU4) appear in both features importance analyses. Coefficients of Lasso and directions of gradients support that moments when candidates raise and keep outer brow raised are also judged more important. Another interesting features is the use of the depth position (). This analysis seems to indicate that movements back and forth could also be detected as important moments. Finally it’s interesting to highlight the confidence of OpenFace and negative gradients of OpenFace’s confidence are selected by Random Forest permutation analysis.

Vi Are attention slices more informative with regard to hirability than random slices ?

Thin slices integrated AUC
Random Forest* Lasso* SVM Linear* SVM RBF*
Random thin slices 0.545 0.005 0.517 0.005 0.518 .005 0.528 0.005
Attention slices 0.554 0.003 0.550 0.003 0.543 0.004 0.537 0.003
statistical significance is based on two-tailed t-test *
TABLE VI: Result of classification task for hirability prediction from attention slices and random slices

Vi-a Method : Supervised classification of hirability based on random slices or attention slices

We intend to evaluate that the moments highlighted by the attention mechanism carry more useful information than random moments. The following procedure aims at testing that the highlighted moments have superior predictive capabilities compared to the rest of the interview.

We constructed a classification task based on only one slice of the candidate’s answer. We ran two instances of this task: The first one uses the most important moment for each candidate as judged by the attention mechanism, while the second one uses the same sampling as in section V-A. We used the same features as in section V-A

. As the input slices are different from the ones required by our model (classification of structured asynchronous video interview), we choose to experiment with non-sequential algorithms. We divide our algorithms into 3 sets. The first set is only composed of Lasso, an L1 regularized linear classifier trained with the same loss as HireNet: a binary cross-entropy. This first set has processing capabilities inferior to those of HireNet. In fact, inspite of their complexity, the GRUs composing our hierachical model process each input with only one non-linearity. Consequently, it is capable of drawing 3 linear separators: 1 for each hierarchy level, and 1 for the final dense layer, and of adding up sequential elements in a learnable fashion. The second set is comprised of only linear-SVM. It is an algorithm with strictly inferior processing capabilities compared to HireNet, trained with a different loss function. The third set comprises SVM with a Radial Basis Function kernel (RBF) and Random Forest, two algorithms with processing capabilites unavailable in HireNet, and with loss functions different than that of HireNet. We choose the Area Under the Curve (AUC) as evaluation metric, as it has the advantage of not requiring any threshold and it is suited to comparing different models. For each of the algorithms used, we performed a bootstrapping procedure as follows; we trained 100 instances, each on a subset of the training set sampled with replacement. We then obtained a set of scores that allowed us to calculate confidence intervals for our results to get a sense of their statistical significance.

Vi-B Results and discussion.

Results are reported in Table VI. We observed only based on short slices of 0.5s to 4s, that the prediction of hirability is above random. For all the classifiers, the results show statistically significant differences in the predictive performance of the attention slices in comparison to random slices. We can note that the importance of the use of attention slices is clearer (larger performance gap) for linear classifiers compared to non linear classifiers. This highlights an important consideration to keep in mind when using attention mechanisms: The attention slices are selected with regard to the learner they are fed to. As shown by our results, processing capabilities vary the obtained importance of thin slices.

Vii Conclusion, Limits and Future works

In this paper, we established that moments with peaks of attention are different from randomly picked slices. We described this difference in terms of the input visual social signals. Visual cues seem to relate with anxiety (activation of lip strecher and lip tightener), blinking and pauses (non-activation of jaw drop). Attention slices are more likely to occur during turn taking (at the beginning of the answer) and turn giving (at the end of the answer) as in real face to face interaction [GOODRICH1979Face-to-FaceTheory, Cassell2001Non-verbalStructure]. We also study the predictive value of the selected moments in comparison to random moments, and consequently put into perspective the use of the expression ”important moment” to qualify an interview slice. In future work, we would like to investigate different types of attention mechanisms [Martins2016FromClassification] in our hirability prediction model, and a larger range of classifiers for the study of attention slices. Attention values highlight the importance of moments ,but do not include information about whether they have a positive or negative impact on recruiters’ decisions. We aim in our future work to distinguish attention slices that have a positive impact from ones having a negative impact. We plan to expand our work to other questions, modalities and the use of more than the most important peaks. Finally, as our approach is based on a learned model (eg HireNet), one research direction is to improve it in terms of performance and bias control. Next steps of our work will also be dedicated to the design of a procedure to quantify the proportion of important moments. In that sense, we plan to conduct an annotation task and a user study, in order to: i) quantify the aforementioned proportion; ii) study links between macro cues and micro cues under the light of attention slices spotted by our model; iii) build an interface that provides useful feedback for candidates, and higher decision transparency for recruiters.


This work was supported by the company EASYRECRUE. We would like to thank Jeremy Langlais and Amandine Reitz for their support and their help. We would also like to thank Erin Douglas for proofreading the article.