Characterizing Hirability via Personality and Behavior

06/22/2020 ∙ by Harshit Malik, et al. ∙ Indian Institute of Technology Ropar University of Canberra 10

While personality traits have been extensively modeled as behavioral constructs, we model job hirability as a personality construct. On the First Impressions Candidate Screening (FICS) dataset, we examine relationships among personality and hirability measures. Modeling hirability as a discrete/continuous variable with the big-five personality traits as predictors, we utilize (a) apparent personality annotations, and (b) personality estimates obtained via audio, visual and textual cues for hirability prediction (HP). We also examine the efficacy of a two-step HP process involving (1) personality estimation from multimodal behavioral cues, followed by (2) HP from personality estimates. Interesting results from experiments performed on ≈ 5000 FICS videos are as follows. (1) For each of the text, audio and visual modalities, HP via the above two-step process is more effective than directly predicting from behavioral cues. Superior results are achieved when hirability is modeled as a continuous vis-á-vis categorical variable. (2) Among visual cues, eye and bodily information achieve performance comparable to face cues for predicting personality and hirability. (3) Explanatory analyses reveal the impact of multimodal behavior on personality impressions; , Conscientiousness impressions are impacted by the use of cuss words (verbal behavior), and eye movements (non-verbal behavior), confirming prior observations.



There are no comments yet.


page 3

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Behavioral cues such as eye movement, gestural, facial, gazing and cognitive patterns have been employed in many human-centered applications such as mental and emotion state prediction (Parekh et al., 2018; Shukla et al., 2020), personality trait estimation (Subramanian et al., 2013; Batrinca et al., 2012), depression detection (Cummins et al., 2011), privacy-preserving gender prediction (Bilalpur et al., 2017) and cognitive load estimation (Bilalpur et al., 2018; Lukanov et al., 2016) over the past decade. Recently, there has been considerable interest in mining multimodal behavioral cues for predicting the outcome of job interviews (Naim et al., 2018; Gucluturk et al., 2018; Escalante et al., 2020). Given the large number of applications received by top companies on a daily basis (13), there has been an increased push for employing artificial hiring agents (AHAs) to recruit candidates; the rationale is that AHAs assessing video resumeś along with traditional ones can undertake recruitment in early stages, while trained recruiters focus their energies on assessing applicants’ tangible and intangible skills in the later rounds.

In order to make the recruitment process transparent and fool-proof, AHAs need to justify their decisions111alternatively, make recommendations to candidates with sound reasoning, termed explainability

in machine learning parlance. A handful of works have employed both verbal and non-verbal behavioral cues to predict a candidate’s

hirability, i.e., the suitability of a candidate to be invited for interview later. Hirability prediction has been modeled either as a binary classification (suitable/unsuitable) or regression (measure of suitability, e.g., on a 1–5 scale) problem by these works.

The general consensus among social psychologists is that personality traits shape human behavior and influence a wide range of life outcomes; therefore, personality can conversely be viewed as a behavioral construct. We additionally model hirability as a personality construct. While not positing this relation explicitly, hirability prediction (HP) works (Gucluturk et al., 2018; Escalante et al., 2020) have adopted the above rationale; e.g., authors of (Escalante et al., 2020) observe that the apparent personality trait annotations are highly predictive of hirability scores, with

for a linear model. Likewise, an explanatory decision tree denoting (binary) hirability in terms of categorical big-five trait predictors is presented. A recent work 

(Vinciarelli et al., 2019) notes that one’s empathy quotient (EQ, denoting the drive to empathize) and systemizing quotient (SQ, drive to analyze) significantly influence career choices; EQ is in turn associated with Extraversion and Agreeableness (Haas et al., 2015).

We posit hirability as a function of the Openness (O), Conscientiousness (C), Extraversion (E), Agreeableness (A) and Neuroticism (N) measures, also known as the big-five personality traits. As in Fig. 1, we predict hirability (modeled as either a continuous/categorical variable) from behavioral cues as a two-step process: in the first step, OCEAN personality measures are either derived from manual ratings, or estimated from audio, visual and textual behavioral cues. HP from OCEAN measures is performed in the second step. Apart from facilitating explainability, we note that estimating hirability via this two-step process achieves better results than directly predicting from low-level behavioral cues. Overall, this work makes the following research contributions:

  • This work expressly explores the connections between personality traits and hirability. While one’s personality may not directly determine his/her profession, prior works have noted the link between personality traits and career choices (Vinciarelli et al., 2019). For recruiters, personality assessment in the early interview stages would help identify candidates who sync with the job requirements and company culture (Costa, ). While past works (Gucluturk et al., 2018; Escalante et al., 2020) have presented ‘proof-of-concept’ results to show connections between personality and hirability annotations on the First Impressions Candidate Screening dataset (Escalante et al., 2020), we extensively explore relations between behavioral cues, personality traits and hirability.

  • With multiple modalities, we show that the two-stage hirability prediction framework performs better than

    end-to-end HP from behavioral features. This is particularly surprising as end-to-end prediction is known to be less error-prone, and has fueled the use of deep neural networks for pattern recognition problems.

  • Further to (2), predicting hirability exclusively via the OCEAN personality traits enables better explainability than a minimally interpretable ‘black-box’ model involving high-dimensional behavioral inputs.

  • We specifically experimented with  5000 FICS videos (4009 training + validation, 998 test) whose interview ratings are outside of the range [0.4,0.6], which can be construed as the ‘hirability gray area’. The primary objective behind this design is to explain multimodal predictions relating to high/low hirability and apparent personality wherever possible. Explanatory analyses such as discovering the most informative word stems, and highlighting the most informative visual features reveal some interesting patterns; e.g., Impressions of Conscientiousness, which considerably influences hirability, are impacted by both cuss words (verbal behavior) and eye movements (non-verbal behavior), confirming prior findings (Jay and Janschewitz, ; Hoppe et al., 2018).

Figure 1. Study Overview: We posit a significant correlation between a person’s suitability for a vocation (termed hirability) and his/her personality traits. To this end, OCEAN personality measures are either derived from first-impression annotations, or predicted from textual, auditory and visual behavioral cues. Continuous/categorical HP is then achieved from OCEAN measures, and explanations of both hirability and OCEAN personality predictions are attempted.
Figure 2. (a) Boxplots denoting distributions of the OCEAN personality trait and the Interview (In) measures from the First Impressions dataset. Train data (8000 videos) are depicted in yellow, and test data (2000 videos) in purple. Inverse of the N trait is denoted as ES. (b) Heatmap depicting correlations among these six attributes. (c) values obtained for the bestlinear regression model involving 1–5 personality trait predictors. Best viewed in color and under zoom.
Figure 3. Decision trees obtained when hirability is modeled as a continuous (left), and a categorical (right) variable with the (S)elect and (R)eject classes for the sampled FICS videos. I is estimated from continuous OCEAN scores in both cases.
Figure 2. (a) Boxplots denoting distributions of the OCEAN personality trait and the Interview (In) measures from the First Impressions dataset. Train data (8000 videos) are depicted in yellow, and test data (2000 videos) in purple. Inverse of the N trait is denoted as ES. (b) Heatmap depicting correlations among these six attributes. (c) values obtained for the best linear regression model involving 1–5 personality trait predictors. Best viewed in color and under zoom.

2. Related Work

We expressly focus on (a) trait estimation from behavior, and (b) explainable HP, while performing the literature survey.

2.1. Trait prediction from behavioral cues

For long, there has been consensus among social psychologists that personality traits shape human behavior, and influence a large number of our life outcomes. Therefore, design of human-centered intelligent systems has primarily focused on the inverse problem; that of employing behavioral cues (typically audio and visual) to deduce attributes such as the big-five personality traits (Subramanian et al., 2010, 2013; Hoppe et al., 2018), and traits highly correlated with personality such as depression (Cummins et al., 2011) and stress (Finnerty et al., 2016).

Recently, hirability prediction (HP) has been attempted by a number of researchers (Naim et al., 2018; Gucluturk et al., 2018; Escalante et al., 2020) from multimodal behavioral cues. Fool-proof HP would enable large organizations to employ artificial hiring agents (AHAs) and effectively reach out to the vast number of applicants contacting them on a daily basis. HP algorithms have typically modeled hirability (or interview variable I) as an adjunct to the OCEAN personality traits, thereby predicting IOCEAN traits from multimodal behavior. However, we posit a strong connect between positional requirements and personality traits, as recruiters would typically look for certain traits in candidates reflective of the organization’s culture and values. Moreover, recent studies (Vinciarelli et al., 2019) have proposed a connection between the empathy quotient psychometric, relating to Conscientiousness and Agreeableness, and one’s career choices. Given these recent findings, this work explicitly explores the connection between hirability and the big-five OCEAN personality traits. Our experiments show that estimating hirability from OCEAN measures is more effective than HP performed directly from multimodal behavior.

2.2. Explainable hirability prediction

To ensure transparent and fool-proof recruitment, AHAs need to be capable of justifying their decisions/recommendations, termed explainability in machine learning parlance. The handful of works that have examined HP from behavior have essentially focused on isolating behavioral correlates of the IOCEAN traits from quantitative results. Two recent works on HP (Gucluturk et al., 2018; Escalante et al., 2020) have loosely explored explainability in HP. Specifically, (Gucluturk et al., 2018) explains hirability predictions based on personality annotations, while (Escalante et al., 2020) shows typical faces reflective of apparent traits. Differently, we show (a) how candidates’ verbal behavior impacts their apparent traits, and (b) what deep neural networks focus on, given the candidate’s face image or their portrait with the face blurred, by way of explaining trait predictions.

2.3. Inferences from literature survey

Summarizing prior HP works, we note that (1) HP has been attempted from behavioral measures, but not from personality estimates obtained via behavioral measures; we posit a strong correlation between personality and hirability measures, and hypothesize that HP would be more effective and explainable if modeled as an exclusive function of the OCEAN traits; the effectiveness of performing HP from OCEAN measures is confirmed by our experiments. (2) Very limited material pertaining to explainability of IOCEAN impressions is available; We explicitly perform explanative analyses to show how language and visual cues influence trait impressions, particularly the Conscientiousness trait.

3. Overview of the FICS dataset

Figure 4. Pearson correlations between visual behavioral measures and IOCEAN annotations for 4009 sampled FICS training videos. Non-significant correlations are crossed out.

This section is designed to provide readers with an overview of the First Impressions Candidate Screening (FICS)dataset, and serves as a prelude for the forthcoming sections. Interested readers may refer to (Escalante et al., 2020; Gucluturk et al., 2018) for further details.

The FICS video dataset comprises 10000 videos (6000 training, 2000 validation and 2000 testing), and was designed with the objective of developing AHAs to make decisions/recommendations based on multimedia CVs (Escalante et al., 2020). All videos contain labels for apparent OCEAN personality traits (reflecting first impressions of a human observer viewing the CV), and a hirability/interview trait, indicating whether the video candidate should be invited for a job interview. OCEAN and interview (I) scores are denoted within the [0,1] range. Since Neuroticism (N) is a negative trait, N scores are replaced by Emotional Stability (ES) scores in FICS, and therefore, the terms N and ES are used interchangeably from hereon even if they strictly refer to opposite traits.

Fig. 3(a) shows the FICS rating distributions. For experimental purposes, we combined the trainingvalidation videos (8000 in total). Roughly similar training and test distributions can be noted for the I, C and E traits from Fig. 3

(a). Annotation distributions for all traits are roughly Gaussian, and 70% of A scores fall within one standard deviation from the mean implying a ‘tight’ clustering, while the ‘loosest’ clustering is noted for the ES trait, with 67% samples falling within the same range. In terms of inter-quartile range (IQR) denoting the difference between the

and percentiles, A has the lowest IQR of 0.18, while C has the highest IQR of 0.22. I score has an IQR of 0.21, and can be seen to be highly correlated with the OCEAN traits from Fig. 3(b). Finally, a linear regression model with OCEAN measures predicting the I score (Fig. 3(c)) shows N as the single-best predictor of I scores, and O as the worst.

We set out to explore if hirability predictions could be explained, at least for the high and low hirability samples, and therefore, we ignored videos with I scores between [0.4,0.6] for our analysis. All our experiments are therefore performed on the sampled FICS dataset, with 4009 training samples (2134 +ve and 1875 -ve), and 998 test samples (544 +ve and 444 -ve), aggregating 5007 out of the 10K original videos.

In subsequent sections, we present results where the I trait is modeled as a continuous/categorical

variable, and the OCEAN predictor variables are also modeled as

continuous/categorical. We predict both the I and OCEAN scores from multimodal behavioral measures, and show that estimating I from OCEAN estimates is more effective than direct prediction from behavioral cues. To this end, we define the regression and classification performance metrics as the I score estimation accuracy, , where MAE denotes the mean absolute error over the test set; this evaluaton metric is also employed in (Gucluturk et al., 2018).

Fig. 3 illustrates the computed decision trees when I is predicted as a continuous and as a categorical variable from OCEAN annotations. Both decision trees obtained for the sampled FICS dataset place minimal emphasis on the O trait as an I predictor, similar to the linear regression model in Fig. 3(c). The regression and classification decision trees achieve an accuracy of 0.96 and 0.99 respectively on the sampled FICS test set. We will compare all models presented in the next section with the above annotation-based benchmarks.

To provide a flavor of how behavioral measures affect I and OCEAN impressions, Fig. 4 presents correlations among visual behavioral cues and the IOCEAN traits based on Openface (Baltrusaitis et al., 2018) outputs. FICS videos are 15s long, and upon dividing each video into non-overlapping 1s thin slices (Subramanian et al., 2013), we computed statistics for: motion of 68 facial landmarks (

), gaze direction vector (

), head pan (), head tilt (), and the proportion of time for which the candidates’ eyes are pointing towards the camera/viewer (), over all 1s thin slices.

Focusing only on significant correlations > 0.1 in magnitude, one can observe the following: (a) consistently high facial landmark movement over all thin-slices is +vely correlated with the O, E and ES traits (highly expressive candidates are open-minded, extroverted and emotionally balanced); (b) consistently high head-tilt motion is +vely correlated with all traits (head nodding is viewed as +ve behavior), and (c) intermittent landmark motion, captured by high , is -vely correlated with Conscientiousness (unprepared candidates are more likely to exhibit awkward/sudden facial movements). Evidently, IOCEAN annotations can be explained by visually examining the candidate’s non-verbal behavior in a fine-grained manner.

4. Behavioral Cues to Hirability

This section examines (a) the utility of various language (verbal), auditory and visual cues for HP, (b) compares HP from behavioral cues vis-á-vis the two-step process of personality estimation from behavior, followed by HP from OCEAN estimates, and (c) attempts to explain prediction patterns relating to personality and hirability.

4.1. Verbal (Textual) Cues

As the FICS dataset is accompanied by transcriptions of the candidate videos (Escalante et al., 2020; Gucluturk et al., 2018), we examined the impact of candidates’ language on their apparent OCEAN and I scores via 4009 training videos and 998 test videos.

4.1.1. Experimental Settings

Videos having I score considered as -ve samples, while videos with were considered as +ve samples for classification. For both continuous (regression) and categorical prediction (classification) of I scores, continuous

OCEAN estimates derived from textual cues were used. The following feature extraction and regressor/classifier frameworks were examined.

Bag of Words (BoW) feature extraction: As in (Gucluturk et al., 2018), we adopted the BoW approach for text analyses. From video transcripts, stopwords were removed and we used 4 word categories :- adjectives, adverbs, verbs and nouns to construct our vocabulary. The top 5000 most frequently appearing words were selected as feature vectors; each transcript is therefore denoted by a 5000-D vector, which was input to the following algorithms.

Regressor/classifier frameworks:

For regression, we employed the random forest (RF) and Support vector Regressor (SVR), while for classification, we used the (a) Naive Bayes (NB) classifier provided by NLTK (

), (b) Binomial Naive Bayes (B-NB), (c) Logistic Regression (LR), (d) Support Vector Regressor (SVR) and (e) AWD-LSTM, the stochastic gradient descent-based long short-term memory pipeline provided by FastAI (

4.1.2. Quantitative and Qualitative Results

Table 3 presents regression and classification on the IOCEAN traits. Table 3 presents continuous/categorical I score prediction from regression-based OCEAN estimates in Table 3. Furthermore, we found the top 10 most informative word stems for each trait via the NB classifier (Table 3). Informative word stems were identified as follows: We computed the relative likelihood of selection (S) vs rejection (R) given stem via importance weights (IW) as /. Therefore, implies that the selection likelihood of a candidate using the word stem is 10 times higher than one who does not use the stem; In short, stems with +ve IWs positively impact trait impressions, while stems with -ve IWs elicit a negative trait impression in the observer. +ve and -ve stems corresponding to the IOCEAN traits are respectively coded in green and red in Table 3.

Model I O C E A N
RF 0.817 0.852 0.839 0.840 0.858 0.832
SVR 0.837 0.863 0.853 0.851 0.868 0.848

NB 0.632 0.618 0.638 0.605 0.617 0.639
B-NB 0.620 0.650 0.618 0.602 0.605 0.653
LC 0.606 0.610 0.602 0.592 0.578 0.643
SVC 0.665 0.651 0.664 0.631 0.647 0.676
AWD-LSTM 0.624 0.627 0.634 0.612 0.624 0.637
Table 2. HP from OCEAN measures (textual cues).
Regression 0.847 0.849 -
Classification - 0.652 0.657
Table 3. Exemplar +ve (green) and -ve (red) word stems for the IOCEAN traits. IWs specified in brackets.
I lucki (6.7) dead (-6.5) achiev (6.1) discuss (6.1)
O perfectli (-6.6) limit (6.6) young (6.6) knowledg (-5.8)
C fuck (-13.0) healthy (10.6) diet (8.2) dead (-6.2)
E address (-7.4) discuss (6.6) hobbi (6.0) fashion (6.0)
A mention (8.2) discuss (6.1) maintain (-5.8) monitor (5.5)
ES(N) lucki (6.7) dead (-6.5) discuss (6.1) maintain (-5.8)
Table 1. Quantitative IOCEAN prediction from textual cues.

4.1.3. Discussion

From Tables 3,3,3, we make the following remarks. (a) Continuous IOCEAN estimates are more effectively predicted by all models as compared to categorical values, as per the Acc values for regression and classification in Table 3. (b) Continuous I score prediction from textual features (max Acc of 0.837) is less effective than predicting from estimated OCEAN measures (max Acc = 0.849) as per Tables  3 and 3. (c) Most importantly, intuitive connections between word stems and traits are noted via IWs. E.g. stems such as achieve, lucki and discuss are seen as +ve with respect to hirability, while dead is deemed -ve. Use of dead is also as a sign of anxiety, conveying high Neuroticism.

The word discuss is seen as +ve in the context of Agreeableness and Extraversion, while hobbi and fashion also convey an impression of high Extraversion consistent with Ashton’s theory that extraverts engage in attractive social activities (Ashton et al., 2002). Conscientiousness impressions, characterized by diligence and uprightness, are negatively impacted by the use of cuss words (Jay and Janschewitz, ), and positively impacted by the use of words such as healthy and diet related to well-being. Overall, while examining verbal behavior requires the generation of transcripts which is tedious/challenging, our experiments reveal the utility of such an exercise, as word choices impact both trait and hirability impressions.

MFCCs Form a representation where frequency bands are not linear but distributed on the mel-scale
Energy Squared-sum of signal values, normalized by the frame length
ZCR Zero crossing rate of the signal within a particular frame
Tempo Beats per minute
Sp. flatness Measure to quantify noise-like trait of a sound spectrum
Sp. bandwidth ’th-order spectral bandwidth, default
Sp. roll-off Frequency below which 90% spectrum is concentrated
Sp. contrast

For each sub-band, compare mean energy of top quantile with mean of bottom quantile.

Tonnetz Tonal centroid features
Table 4. Description of extracted audio features.

4.2. Auditory cues

4.2.1. Feature extraction

For predicting IOCEAN traits from audio, we extracted low-level speech signal statistics from the Librosa library (, and audio spectrograms. Librosa features were fed to a random forest (RF), while speech sprectrograms were fed to a VGG11 (CNN) for regression/classification as in Table 6. A total of 56 audio statistics including for 20 MFCC coefficients (Table 4) were employed for analysis.

4.2.2. Experimental Settings

For IOCEAN estimation, we considered the IOCEAN traits as both continuous and categorical; regression and classification results are coded as (R) and (C) respectively in Table 6. As a second step, we predicted continuous/categorical I scores from continuous/discrete OCEAN estimates (Table 6). We also adopted the thin-slice approach (as in Sec. 3) for audio analysis, aggregating 1s Librosa statistics over 2–15 second time-windows to predict continuous IOCEAN measures (Fig. 5). Results in Table 6,6 correspond to 15s time windows (equal to the length of FICS videos).

Model I O C E A N
RF (R) 0.8783 0.8916 0.8863 0.8819 0.8901 0.8803
CNN (R) 0.8799 0.8835 0.8809 0.8773 0.8880 0.8768
RF (C) 0.8116 0.7766 0.8036 0.7786 0.7725 0.8066
CNN (C) 0.7565 0.7455 0.7345 0.7445 0.7385 0.7615
Table 6. HP from OCEAN measures (audio cues). Labels R and C denote continuous/categorical OCEAN estimates.
Regression Classification
RF (R) CNN (R) RF (R) CNN (R) RF (C) CNN (C)
0.8946 0.8821 0.8156 0.7876 0.8235 0.7912
Table 5. Audio performance for IOCEAN estimation.
Figure 5. IOCEAN prediction from Librosa features with varying time windows.

4.2.3. Results and Discussion

We make the following remarks from our experimental results. (1) As with text analysis, continuous IOCEAN prediction is better achieved (max Acc = 0.8916) than discrete (max Acc = 0.8116). (2) Consistent with text-based results, better prediction of I scores is achieved from continuous OCEAN estimates (max Acc = 0.8946), than from audio features (max Acc = 0.8799). (3) The time-window varying experiment was designed to verify if 15s of audio data is indeed necessary for accurate IOCEAN prediction. From Fig. 5, we note that the Acc results saturate beyond 6s, reflecting that reliable trait estimation is achievable upon observing only tiny behavioral episodes, and conveying that 15s windows is redundant for audio-based trait estimation. Overall, the O and A traits are best reflected by audio features, while Interview scores are not well predicted via Librosa statistics.

4.3. Visual Analysis

Non-verbal behavior cues, especially visual, have been extensively employed for human-centered applications earlier (Subramanian et al., 2010; Cummins et al., 2011; Subramanian et al., 2013; Finnerty et al., 2016). This is due to the fact that visual behaviors such as gazing, facial emotions and movements, and body movements convey a significant amount of informative and communicative cues during social interactions. Especially during interview sessions, visual behaviors can convey a lot of information (is the candidate calm or emotional when facing a tough situation?) to the interviewer.

Figure 6. Inputs to the visual model include the cropped face image (left), cropped eye region (center) and face-blurred portrait to examine the influence of holistic body movements for trait prediction.

Given the critical contribution of visual behavior to IOCEAN prediction, we opted to examine multiple visual cues different from prior HP works (Escalante et al., 2020; Gucluturk et al., 2018; Naim et al., 2018). Instead of examining only facial cues for trait prediction, we also proceeded to examine the eye and the body movements; we therefore additionally input an eye-crop and a body-crop with the face blurred (Fig. 6) to the prediction frameworks, to evaluate the contribution of eye and body movements towards IOCEAN prediction. The face and eye-crops are obtained via Openface (Baltrusaitis et al., 2018), while the face-blurred body-crop is obtained by smoothing the facial region in the video frame using a Gaussian filter, so that the facial details are not apparent to the observer.

4.3.1. Experimental settings

We considered the following prediction models in our experiments.


A 19 layered VGG model, which processes 2D frame information was used. The VGG output layer was removed, and two hidden fully-connected layers with 512 and 64 neurons respectively were added along with output layer involving 6 neurons (one neuron each for the IOCEAN traits). Mean squared error (MSE) for regression, and binary cross-entropy (BCE) loss for classification were used during training on a single, representative frame from the video sequence, with learning rate of 1e-4 and a batch size of 64.

3D-CNN: An 18 layered ResNet-3D model (, with pre-trained weights for human activity recognition, was used. The 3D-CNN model took inputs from 16 uniformly spaced visual frames, sampled at seconds into the video. The ResNet-3D output layer was removed, and two hidden fully-connected layers with 128 and 32 neurons respectively were added instead, along with final output layer of 6 neurons. The 16 stacked frames are re-sized to prior to input. Mean squared error (MSE) Loss was used during training (3D-CNN was employed only for regression), with learning rate 1e-4 and batch size 32.


which denotes a Long-term Recurrent Convolutional Neural network ( with a pre-trained ResNet-50 encoder and a single-layer LSTM decoder. This model takes 40 uniformly-spaced video frames as input; the encoder CNN learns 512-D features for each frame, which are fed to the LSTM decoder across different time frames. The 512-D LSTM output is fed into a linear layer of size 256, which is then connected to the final layer composed of 6 neurons. L1-loss was used for model training, with learning rate for the pre-trained ResNet set to 1e-6, and varying between 1e-4 to 1e-5 for other layers. The Adam optimizer was used to train the LRCN.

2D(F) 2D(B) 2D(E) 3D(F) 3D(B) LR(F) LR(B) LR(E)
I 0.897 0.909 0.869 0.910 0.903 0.902 0.896 0.891
O 0.897 0.903 0.881 0.903 0.903 0.896 0.896 0.887
C 0.895 0.909 0.876 0.904 0.900 0.901 0.894 0.889
E 0.894 0.898 0.874 0.908 0.897 0.895 0.892 0.886
A 0.893 0.902 0.878 0.904 0.902 0.900 0.896 0.892
N 0.889 0.896 0.867 0.902 0.895 0.892 0.885 0.882
Table 8. IOCEAN Classification from visual cues: 2D, 3D and LRC refer to 2D-CNN, 3D-CNN and LRCNN respectively. Codes F, E and B denote facial, eye and body cues.
2DC (F) 2DC (B) 2DC (E) LRC (F) LRC (E)
I 0.7856 0.8287 0.7101 0.8106 0.7934
O 0.7525 0.7826 0.7011 0.7675 0.7512
C 0.7776 0.8267 0.7101 0.7996 0.7853
E 0.7545 0.7745 0.6911 0.7916 0.7733
A 0.7295 0.7796 0.6670 0.7605 0.7442
N 0.7766 0.8307 0.7081 0.8036 0.7944
Table 9. HP from continuous OCEAN estimates. 2D, 3D and LR refer to 2D-CNN, 3D-CNN and LRCNN. F, E and B codes in brackets stand for facial, eye and body cues. R/C codes denote continuous/categorical HP.
2D(FR) 2D(BR) 2D(ER) 3D(FR) 3D(BR) LR(FR) LR(BR) LR(ER)
0.90 0.91 0.87 0.91 0.91 0.91 0.91 0.90
2D(FC) 2D(BC) 2D(EC) 3D(FC) 3D(BC) LR(FC) LR(BC) LR(EC)
0.82 0.85 0.74 0.87 0.86 0.85 0.84 0.83
Table 7. IOCEAN Regression from visual cues: 2D, 3D and LRC refer to 2D-CNN, 3D-CNN and LRCNN. Codes F, E and B denote facial, eye and body cues.

4.3.2. Results & Discussion

Tables 99 and 9 present trait predictions from the multiple visual cues. From Table 9, which estimates continuous IOCEAN values from the face, eye and body cues, we make the following remarks: (1) In terms of the general predictive power, the 3D-CNN is more potent than the 2D-CNN and LRCNN frameworks. Acc values are often observed with the 3D-CNN, while the 2D-CNN and LRCNN perform slightly inferiorly. (2) An interesting finding is that the eye and body-cues achieve performance comparable to the face cue. This is particularly important as it opens up the possibility of AHAs being able to examine video CVs and make reasonable trait-related decisions while honoring the candidate’s privacy (processing only a mid-to-low resolution image of the eye, or blurring the face will render the facial information unusable as a biometric). (4) The face cue is nevertheless critical, and produces the best prediction for the Interview trait. (5) Among OCEAN traits, C and A are the two best-predicted traits from visual cues.

Focusing on IOCEAN classification results in Table 9, in line with the text and audio-based results, considerably lower Acc values than regression are noted for classification. Interestingly, body cues produce the best categorical IOCEAN estimates, and achieve considerably better performance than face or eye cues. This results indicates that perhaps, a fine-grained visual examination of the candidate’s behavior may not be necessary to make a coarse-grained decision (i.e., suitable or unsuitable) regarding the candidate’s hirability. A distant examination could still be adequate. Among IOCEAN traits, N is predicted best based on body cues by the 2D-CNN, which is revealing as the N trait is associated with anxiety, which may manifest via body-fidgeting, etc.

Examining Table 9 which presents continuous/categorical HP from OCEAN estimates, we again note that is achieved for all conditions (second table row), except with the 2D-CNN employing eye information. The best prediction of categorical I labels (Acc = 0.87) is achieved when continuous OCEAN scores are estimated employing facial information; this implies that reasonable coarse-grained hirability decisions are possible even when accurate OCEAN estimates are available to the AHA in lieu of a multimedia CV.

4.3.3. Explaining visual Predictions

While the above inferences may be logically derived from experimental results, we explored if any explanations of the visual predictions are possible. Prior works (Gucluturk et al., 2018; Escalante et al., 2020) show some visual correlates of the IOCEAN traits without explicitly showing where their predictive models are looking at. Differently, we employed the Grad-CAM algorithm (Selvaraju et al., 2019) to highlight image regions deemed important for a trait prediction. Using Grad-CAM, gradients of the IOCEAN output neurons are used to get a weighted-sum of the convolutional layer output maps, termed attention maps depicting where the network sees to accurately predict the trait. We generated activation maps for the IOCEAN traits highlighting important facial and body cues (Figures 88).

Fig. 8 shows Grad-CAM outputs for a high and a low trait exemplar. One can note that the attention maps relate to the eye and the mouth regions for the IOCEAN traits, which are likely to be of interest to a human interviewer as well. Conscientiousness is one (possible) exception where attention is more localized to the eyes. Conscientiousness is associated with sincerity and uprightness, and is traditionally gauged from eye-movement cues (Hoppe et al., 2018). Conversely, when the face is blurred so as to make the facial cues indecipherable (Fig. 8), the activation maps are focused around the neck region, hand movements and clothing. When the face is represented as a blob, the neck region becomes important as it determines the relative orientation between the face and body. These visual explanations cumulatively convey the importance of eye and mouth movements, hand gestures and attire for HP.

Figure 7. Exemplar grad-cam outputs for a person eliciting high trait scores (top) and low trait scores (bottom). Eyes are the primary cue for eliciting apparent Conscientiousness impressions, while other traits are influenced by holistic facial structure and facial emotions. Best-viewed in color.
Figure 8. Exemplar grad-cam outputs on blurred face portraits for a person eliciting high trait scores (top) and low trait scores (bottom). Attention maps indicate a focus on the neck region, which determines the relative orientation between the face and body, hand gestures and clothing. Best-viewed in color.
Figure 7. Exemplar grad-cam outputs for a person eliciting high trait scores (top) and low trait scores (bottom). Eyes are the primary cue for eliciting apparent Conscientiousness impressions, while other traits are influenced by holistic facial structure and facial emotions. Best-viewed in color.

5. Discussion & Conclusion

At the outset, the objectives of this work were two-fold: (1) to explicitly and rigorously explore the correlations between hirability and the OCEAN personality traits, given that this dependence has been exploited earlier in a limited way (Gucluturk et al., 2018; Escalante et al., 2020), and (2) to provide explanations supporting IOCEAN predictions made by the multimodal behavioral models. Based on the experimental results, we conclude that this work has substantially achieved both objectives.

With respect to (1), we note that continuous/categorical HP from OCEAN estimates, which are in-turn obtained from audio, visual and verbal behaviors, is more effective than directly predicting from behavioral measures. While this may seem surprising, we believe that this result is only an implication of designing a simple HP model with only the OCEAN trait predictors, rather than a ‘black-box’ model with high-dimensional inputs but limited interpretability.

Regarding (2), we note that all considered modalities and features provide some explanations towards IOCEAN prediction. With respect to text, we found that IWs of word stems are highly informative; e.g., use of the word dead negatively impacts hiring impressions, and conveys anxiety (indicator of Neuroticism). The words hobbi and fashion convey a high level of Extraversion. Apparent Conscientiousness is negatively impacted by cuss words, but positively by words relating to well-being. While audio-related explanations are not explicitly presented, we note from Figure 5 that IOCEAN predictions saturate beyond 6s time-windows, implying that tiny behavioral episodes suffice for reliable trait prediction.

Visual cues are also highly informative, as confirmed by both quantitative and qualitative results. Quantitative results show that the eye and body cues achieve IOCEAN prediction comparable to face cues. This is a useful result, as processing facial information incapable of revealing identity would assuage candidates’ privacy concerns. That body cues can achieve high accuracy on categorical IOCEAN prediction implies that fine-grained behavioral analytics may not be necessary for making coarse-grained decisions. Also, Table 9 conveys that coarse hiring decisions are possible solely based on a candidate’s OCEAN estimates. Grad-CAM visualizations show the influence of eye and mouth movements, hand movements and attire on hirability.

Limitations of this study include (a) experiments on only the sampled FICS dataset involving 5K videos, with a clear-cut distinction between high and low-hirability observations; this was nevertheless design to elicit predictive explanations, and (b) experiments on only the FICS dataset. Future work will focus on validating, extending and generalizing current results via experimentation on multiple datasets.


  • M. Ashton, K. Lee, and S. Paunonen (2002) What is the central feature of extraversion? social attention versus reward sensitivity. Journal of personality and social psychology 83, pp. 245–52. External Links: Document Cited by: §4.1.3.
  • T. Baltrusaitis, A. Zadeh, Y. Lim, and L. Morency (2018) OpenFace 2.0: facial behavior analysis toolkit. pp. 59–66. External Links: Document Cited by: §3, §4.3.
  • L. Batrinca, B. Lepri, N. Mana, and F. Pianesi (2012) Multimodal recognition of personality traits in human-computer collaborative tasks. In International Conference on Multimodal Interaction, ICMI ’12, New York, NY, USA, pp. 39–46. External Links: ISBN 9781450314671, Link, Document Cited by: §1.
  • M. Bilalpur, M. Kankanhalli, S. Winkler, and R. Subramanian (2018) EEG-based evaluation of cognitive workload induced by acoustic parameters for data sonification. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI ’18, New York, NY, USA, pp. 315–323. External Links: ISBN 9781450356923, Link, Document Cited by: §1.
  • M. Bilalpur, S. M. Kia, M. Chawla, T. Chua, and R. Subramanian (2017) Gender and emotion recognition with implicit user signals. In International Conference on Multimodal Interaction, ICMI ’17, New York, NY, USA, pp. 379–387. External Links: ISBN 9781450355438, Link, Document Cited by: §1.
  • [6] D. Costa 5 Reasons why you should use personality assessment in recruitment. Note: Cited by: item (1).
  • N. Cummins, J. Epps, M. Breakspear, and R. Goecke (2011) An investigation of depressed speech detection: features and normalization.. pp. 2997–3000. Cited by: §1, §2.1, §4.3.
  • J. H. Escalante, M. Madadi, S. Ayache, E. Viegas, F. Gurpinar, S. A. Wicaksana, C. Liem, A. J. V. M. Gerven, V. R. Lier, H. Kaya, A. A. Salah, S. Escalera, Y. Gucluturk, U. Guclu, X. Baro, I. Guyon, and C. S. J. Jacques (2020) Modeling, recognizing, and explaining apparent personality from videos. IEEE Transactions on Affective Computing, pp. 1–1. Cited by: item (1), §1, §1, §2.1, §2.2, §3, §3, §4.1, §4.3.3, §4.3, §5.
  • A. N. Finnerty, S. Muralidhar, L. S. Nguyen, F. Pianesi, and D. Gatica-Perez (2016) Stressful first impressions in job interviews. In ACM International Conference on Multimodal Interaction, ICMI ’16, New York, NY, USA, pp. 325–332. External Links: ISBN 9781450345569, Link, Document Cited by: §2.1, §4.3.
  • Y. Gucluturk, U. Guclu, X. Baro, H. J. Escalante, I. Guyon, S. Escalera, M. A. J. van Gerven, and R. van Lier (2018) Multimodal first impression analysis with deep residual networks. IEEE Transactions on Affective Computing 9 (3), pp. 316–329. External Links: ISSN 1949-3045, Link, Document Cited by: item (1), §1, §1, §2.1, §2.2, §3, §3, §4.1.1, §4.1, §4.3.3, §4.3, §5.
  • B. W. Haas, M. Brook, L. Remillard, A. Ishak, I. W. Anderson, and M. M. Filkowski (2015) I know how you feel: the warm-altruistic personality profile and the empathic brain. PLOS ONE 10 (3), pp. 1–15. External Links: Link, Document Cited by: §1.
  • S. Hoppe, T. Loetscher, S. A. Morey, and A. Bulling (2018) Eye movements during everyday behavior predict personality traits. Frontiers in human neuroscience 12 (105). Cited by: item (4), §2.1, §4.3.3.
  • [13]

    IBM artificial intelligence can predict with 95

    Cited by: §1.
  • [14] T. Jay and K. Janschewitz The science of swearing. Note: Cited by: item (4), §4.1.3.
  • K. Lukanov, H. A. Maior, and M. L. Wilson (2016) Using fnirs in usability testing: understanding the effect of web form layout on mental workload. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 4011–4016. External Links: ISBN 9781450333627, Link, Document Cited by: §1.
  • I. Naim, Md. I. Tanveer, D. Gildea, and M. (. Hoque (2018) Automated analysis and prediction of job interview performance. IEEE Transactions on Affective Computing 9 (2), pp. 191–204. External Links: Link, Document Cited by: §1, §2.1, §4.3.
  • V. Parekh, P. S. Foong, S. Zhao, and R. Subramanian (2018) AVEID: automatic video system for measuring engagement in dementia. In 23rd International Conference on Intelligent User Interfaces, IUI ’18, New York, NY, USA, pp. 409–413. External Links: ISBN 9781450349451, Link, Document Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2019) Grad-cam: visual explanations from deep networks via gradient-based localization.

    International Journal of Computer Vision

    128 (2), pp. 336–359.
    External Links: ISSN 1573-1405, Link, Document Cited by: §4.3.3.
  • A. Shukla, S. S. Gullapuram, H. Katti, M. Kankanhalli, S. Winkler, and R. Subramanian (2020) Recognition of advertisement emotions with application to computational advertising. IEEE Transactions on Affective Computing, pp. 1–1. Cited by: §1.
  • R. Subramanian, J. Staiano, K. Kalimeri, N. Sebe, and F. Pianesi (2010) Putting the pieces together: multimodal analysis of social attention in meetings. In International Conference on Multimedia, MM ’10, New York, NY, USA, pp. 659–662. External Links: ISBN 9781605589336, Link, Document Cited by: §2.1, §4.3.
  • R. Subramanian, Y. Yan, J. Staiano, O. Lanz, and N. Sebe (2013) On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In International Conference on Multimodal Interaction, New York, NY, USA, pp. 3–10. External Links: Link, Document Cited by: §1, §2.1, §3, §4.3.
  • A. Vinciarelli, W. Riviera, F. Dalmasso, S. Raue, and C. Abeyratna (2019) What do prospective students want? an observational study of preferences about subject of study in higher education. In Innovations in Big Data Mining and Embedded Knowledge, A. Esposito, A. M. Esposito, and L. C. Jain (Eds.), Intelligent Systems Reference Library, Vol. 159, pp. 83–97. External Links: Link, Document Cited by: item (1), §1, §2.1.