Feature Learning from Spectrograms for Assessment of Personality Traits

10/04/2016 ∙ by Marc-André Carbonneau, et al. ∙ 0

Several methods have recently been proposed to analyze speech and automatically infer the personality of the speaker. These methods often rely on prosodic and other hand crafted speech processing features extracted with off-the-shelf toolboxes. To achieve high accuracy, numerous features are typically extracted using complex and highly parameterized algorithms. In this paper, a new method based on feature learning and spectrogram analysis is proposed to simplify the feature extraction process while maintaining a high level of accuracy. The proposed method learns a dictionary of discriminant features from patches extracted in the spectrogram representations of training speech segments. Each speech segment is then encoded using the dictionary, and the resulting feature set is used to perform classification of personality traits. Experiments indicate that the proposed method achieves state-of-the-art results with a significant reduction in complexity when compared to the most recent reference methods. The number of features, and difficulties linked to the feature extraction process are greatly reduced as only one type of descriptors is used, for which the 6 parameters can be tuned automatically. In contrast, the simplest reference method uses 4 types of descriptors to which 6 functionals are applied, resulting in over 20 parameters to be tuned.



There are no comments yet.


page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People spontaneously infer the personality of others from a wide range of cues. These cues may be visual, like facial expressions or posture, and may also be aural, like intonations, choice of words or voice timbre. This assessment of personality traits naturally influences the way we interact with each other [1]. The method proposed in this paper aims at performing this assessment automatically.

Being able to accurately predict the personality of an interlocutor is an important step toward better human-machine interactions. For example, people attribute personality traits to machines and interact differently with them depending on this perceived personality. For instance, extroverted people will interact longer with robots they perceive as extroverted [2]. Detecting and understanding a person’s personality would enable a machine to adapt its behavior to the user. It can also be used in e-learning applications by giving appreciative feedback on the personality projected by a user to improve its leadership or sale skills.

In literature, five personality traits (the Big-Five) corresponding to psychological phenomenon are observable regardless of the situation and culture: openness, conscientiousness, extroversion, agreeableness and neuroticism [3]. These traits influence the way people act and speak. For instance, in [4]

a correlation is established between openness and neuroticism and the probability of maintaining blog. The choice of words by a subject based on his/her personality traits has also been studied in informal texts

[5], conversations [6] and on social medias [7].

In the 2012 edition of the Interspeech competition on paralinguistics, one of the challenges was personality traits assessment from speech. This has motivated the proposition of several methods for this task. The baseline systems for the competition were designed using SVM and random forest (RF) classifiers trained with 6125-dimensional feature vectors

[8]. They performed particularly well, and only two contestants were able to surpass their performance on the test set. It was observed that increasing the number of features tends to increase recognition performance [8]

, thus large feature sets were extracted in the hope of capturing more of the relevant discriminant information. Some of the features were redundant or non-informative which motivated some contestants to use feature selection on the set of 6125 features

[9, 10, 11]. The winners of the competition [12] added 21760 spectral features to the baseline feature set before performing selection.

Since 2012, the Interspeech competition 6125-dimension feature set of the baseline system has grown even larger. In 2015, it had increased to 6373-dimension [13]. Many of these features are statistics on the usual prosody features such as pitch, formants and energy, as well as more complex features, such as log harmonics to noise ratio, harmonicity and psycho-acoustic spectral sharpness. All of these application specific feature extraction techniques require a fair knowledge and experience in speech processing to tune their parameters, select thresholds, pre-process data, etc. Moreover, results may vary significantly from one implementation to another which limits the reproductibility of the experiments.

For instance, the RASTA [14] algorithm was used to extract several features in the challenge. It involves the selection of filter coefficients, non-linear compression and expansion functions and their respective parameters [14]. Also, several features related to local extrema were used, such as first and second order statistics on inter-maxima distance. The detection of these extrema necessitates a peak selection algorithm which must be tuned to achieve a high level of performance [15].

Many practitioners use software tools to extract prosody features, which accelerates the design of recognition solutions. However, even if these tools contain complete implementations of feature extraction algorithms, expertise in speech processing is required to configure the several parameters and options of each modules. For instance, in Praat [16] there are 5 different methods for pitch extraction, each with 3 to 9 parameters to be set. In openSMILE [17], one must choose between the cPitchACF (4 parameters) object and the cPitchShs object (9 parameters) to extract pitch, which in turn must be configured. The user may also use a pitch smoother, where four more parameters must be set. There are even more parameters to consider when extracting formants.

Aside from the complexity and variability of these feature extraction procedures, the use of large feature sets reduces the generalization capability of pattern recognition algorithms


. Indeed, larger feature space are subject to problems associated with the curse of dimensionality

[19]. The exponential growth of the search space increases the amount of data needed to obtain a statistically significant representation of the data. Generally, in affective computing contexts, data is limited because collection is costly, which calls for more compact data representations. Moreover, smaller feature sets are desirable because they allow for faster training and classification.

The difficulties described above have been discussed by several researchers in the affective speech recognition community. The CEICES initiative attempted to create a standardized set of feature for emotion recognition in speech [20]. The proposed set is a combination of 381 acoustic and lexical features selected from a pool of 4024 features that the authors have successfully used in their previous research. While the collection of features was standardized, the implementation of the feature extraction algorithms was not. Recently, another attempt has been made to reduce the size of the feature collection used for automatic voice analysis [18]. A minimal number of descriptors were selected based on theoretical and empirical evidence. While the minimal and extended sets are compact (62 and 88 features respectively) several different algorithms are used for the extraction of the descriptors. Each algorithm requires expertise and a careful parametrization111The feature set has been made publicly available through the openSMILE toolkit [17]..

In this paper, a method inspired by the recent developments in feature learning and image classification is proposed to alleviate these design choices for automatic assessment of personality traits. The temporal speech signals are translated into spectrogram images. Small sub-images, called patches, are densely extracted from these spectrogram images, and used during training to learn a feature dictionary yielding a sparse representation. The dictionary is used to encode each of the local patches. Each spectrogram is thus represented as a collection of encoded patches, which are pooled to create a histogram representation of the entire spectrogram. These histograms are used to train a classifier. During testing, a new speech signal is represented by a histogram, using the same dictionary, before classification.

The proposed method of representation, which is based on local patches, allows to capture para-linguistic information compactly. Because it encodes raw parts of the spectrogram images, the representation is richer than methods which characterize speech signals with statistics on the whole signal [21, 18, 8]

. For instance, these methods use the mean, the standard deviation, kurtosis, min and max of the pitch or spectrum and cepstrum bins, which discard the relevant cues for personality assessment that the local shape of the signal contains. Moreover, when compared to these methods, the proposed method has fewer parameters, which can be more easily tuned using standard automatic hyper-parameter optimization techniques (e.g. cross-validation). Finally, the method inherits the robustness to deformation and noise of local image recognition methods applied to spectrogram analysis

[22, 23].

In essence, the proposed method leverages the power of representation inherent to sparse modeling, which learn features from the data. This approach generally leads to a high level of accuracy [24]. The dimensionality of feature vectors needed for this level of performance is reduced by an order of magnitude when compared to the number of features used in the Interspeech challenges. Moreover, only one method is used for feature extraction which limits the number of parameters needing careful tuning. Finally, the proposed technique does not necessitate a feature selection stage which is usually time consuming during training.

The proposed method is compared to 6 reference methods on the SSPNet Speaker Personality Corpus used in the Interspeech 2012 competition. As stated in the postmortem report of the challenge published in 2015 [15], research in automated recognition of speaker traits is still active, and still requires much exploration to isolate suitable features and models for this task. In this regard, the novel technique proposed in this paper aims to provide a simpler alternative for extraction of a compact set of features that achieve state-of-the-art results.

The rest of the paper is organized as follows: The next section provides background information on feature learning in the context of speech analysis. Section 3 describes the proposed method. Section 4 presents the experimental data, protocol and reference methods. The results are analyzed in Section 5.1.

2 Feature Learning for Speech Analysis

Feature learning algorithms extract relevant features themselves, instead of relying on man-engineered representations, which are time consuming to obtain and often sub-optimal. Feature learning has been used in several speech analysis applications. Some methods use deep neural networks, which intrinsically learn features, to perform automatic speech recognition (ASR)

[25, 26]

. These systems are not suitable for personality trait recognition because they analyze local time series (e.g. a phoneme), and fail to capture the global information in a speech segment. Deep learning has also been used for automatic emotion recognition. In

[27], a neural network learns a feature representation, not from the raw signal, but from a set of prosodic, spectral and video features. In [28]

, utterances where represented using a sparse auto-encoder to perform transfer learning in an emotion recognition task. In

[29], base features were learned using ICA on spectrograms. After a feature selection process, the selected features were combined in a higher hierarchical level, using non-negative sparse coding. These feature combinations were used with an HMM to perform ASR.

Feature learning can be performed on several types of signal representation. When a speech signal is represented as a spectrogram, (i.e. concatenation in time of windowed Discrete Fourier Transform (DFT)), it can be analyzed through image processing. It has been demonstrated by neuroscientists that the same parts of the brain can be used to process both visual and audio signals

[30]. This has motivated several researchers to investigate the application of image recognition techniques to spectrograms to analyze and recognize sound and speech signals. For example, histograms of oriented gradients (HOG) were used to perform word recognition [31]. In [32]

, spectrograms amplitudes are quantized and mapped into a color coded image. Color distributions are then characterized and analyzed. This method is inspired by content-based image retrieval methods

[33]. In [23], spectrograms and cochleograms are divided in frequency sub-bands and analyzed as visual textures using gray-tone spatial dependence matrix features [34]

alongside cepstral features. Audio spectrograms were employed with a convolutional deep Bayesian network, typically used for image recognition, to perform speaker identification and gender classification

[35]. The representation achieved a higher recognition performance when compared to MFCC and raw spectrograms. The Gabor function (sinusoidal tapered by a decaying exponential), were found to be good models of receptive fields in the human visual cortex [36]. This has motivated several authors to apply log-Gabor filter banks to spectrograms[37, 38] to analyze paralinguistics.

A popular paradigm for image analysis is to extract features locally (instead of globally) from salient regions of an image, called patches. The set of patches, is used to represent an entire image. This type of approach, often called bag-of-words, have been successfully applied in numerous contexts for recognition in image [39, 40] and video [41, 42]. Using local features in image recognition may lead to an increased robustness to intra-class variation, deformation, view-point, illumination and occlusion [43]. When working with spectrograms, it translates to an increased robustness to noise [22, 32]. In [44] the SIFT descriptor was used to detect and encode key-points in spectrogram images of musical pieces to perform genre classification. Schutte proposed a deformable part-based model of local spatio-temporal features in speech recognition [22]. The method allowed to improve recognition performance over the HMM baseline system especially in the presence of noise.

Local-based methods in image recognition often exploit a set of predefined basis for decomposition such as wavelets, wedgelets and bandlets [45]. However, it has been shown that learning the basis directly on the data leads to a higher level of accuracy in several applications such as signal reconstruction [46] and image classification [47] and reconstruction [48]. Based on these results, several recently proposed spectrogram analysis methods learn representation on training data in order to benefit from the improved performance. For instance, in [49] the spectrograms are segmented at different scales, and each segment is encoded as the most resembling word in a dictionary learned using the k-means algorithm. In [50] the spectrograms of musical instruments are interpreted as visual textures. Sounds are represented by a vector encoding the resemblance between the spectrogram and a randomly constituted dictionary.

In the aforementioned dictionary-based methods, local descriptors are associated with the most representative code-word in the dictionary. Some algorithms use sparse coding to perform this association and learn a representation [46, 51]. Sparse coding is a type of feature learning which expresses a signal using a small number of basis from a learned set, usually called dictionary. Experiments have shown that encoding audio and visual signals using a sparse decomposition can lead to a high level of accuracy for various tasks such as speaker, gender and phoneme recognition [35]. Also, it was shown that a learned sparse representation of audio signals is akin to the early mammalian auditory system [52]. This is why several recent methods use sparse coding to learn the dictionary and encode signals.

Figure 1: Block diagram of the proposed system for the prediction of a personality trait. The upper part illustrates the operations performed during training. The lower part illustrates sequence of operations performed to process an input speech sequence in test.
Figure 2: Example of spectrogram extracted from a speech file in the SSPNet database. White indicates high values while black indicates low values.

In the context of personality assessment from speech, paralinguistic cues have to be analyzed globally. Methods used in other speech analysis applications, such as ASR, fail to capture this global information. Existing methods for personality recognition capture global information using a statistical operators on low-level features. Unfortunately, this results in a high dimensional representation, which is prone to the curse of dimensionality, and require fair signal processing expertise to extract the low-level features. The proposed method represents a complete speech segment as an image then uses image recognition techniques, and thus, can perform global analysis. Moreover, it uses a feature learning approach, which reduces the burden associated with feature engineering and yields a compact representation, and leads to increased recognition performances.

3 Proposed Feature Learning Method

This section presents a new method for predicting personality traits in speech based on spectrogram analysis and feature learning. The main stages of the proposed method are depicted in Figure 1. Specifics details regarding our proposed solution for feature extraction, classification and dictionary learning are described in the next sections. The upper part is the pipeline for training. At first, for each speech segment F in the data set, a spectrogram S is extracted by applying a Fourier transform on a sliding window, yielding a 2-dimensional matrix. Small sub-matrices, called patches are then uniformly extracted from all the spectrogram matrices in the training set. A dictionary is learned from these patches, and at the same time, the patches are encoded as sparse vectors called code-words . A single -dimensional feature vector representation h

is obtained for each training speech sample by pooling together all code-words extracted from it. A two-class support vector machine (SVM) classifier is trained using these feature vectors for each personality trait.

The lower part is the pipeline used during testing, to predict a personality trait. Like in training, patches are extracted from the spectrograms. Each patch is encoded using the previously learned dictionary. The resulting code-words are then pooled to create a feature vector that is fed to a 2-class classifier to obtain a label representing to which end of the spectrum of a specific personality trait the speech segment corresponds.

3.1 Feature Extraction

Given a speech segment , the spectrogram is the concatenation in time of its windowed DFT:


where is a column vector containing the absolute amplitude of the DFT frequency bins and is the number of DFTs extracted from the signal. The absolute amplitude is favored over the log-amplitude as it has shown to yield better results for spectrogram image classification in[32] and in our own experiments. The spectrograms are normalized: each frequency bin is divided by the maximum amplitude value contained in a time frame. This process results in a 2-D matrix which can be analyzed as a grey-scale image. An example of spectrogram extracted on the SSPNet Speaker Personality Corpus is illustrated in Figure 2.

From the matrix , small patches, or sub-images, of pixels are extracted at regular intervals. A vector representation of each patch () is obtained by concatenating the value of all pixels. The vector is encoded into using a previously learned dictionary containing atoms (more details in Section 3.3). These atoms are vector basis that are used to reconstruct the patches. The code-vector corresponding to the patch is obtained by solving


using the LARS-Lasso algorithm [53]

. The loss function has two terms, each encoding an optimization objective, and

is a parameter used to adjust the relative importance of the two terms. The first term is the quadratic reconstruction error, while in the second term, the norm of the code vector is used to enforce sparseness. Once a code is obtained for each patch , the absolute value of all the codes are summed to obtain a histogram describing the entire spectrogram :


These histograms represent the distribution of patches over speech segments. It is thus possible to directly compare segments of different length.

3.2 Classification

The speech segments are represented by histograms and thus, appropriate distance measure should be employed. Several distance measures have been proposed to compare histograms. In this paper’s implementation, the distance is used because it showed competitive performances for visual bag-of-words histograms [43]. The distance is given by :


where and are the bins of histograms h and y, and corresponds to the number of words in the dictionary.

In this paper is used in an SVM framework with an exponential kernel[54]:


where the parameter controls the kernel size.

While the implementation of this paper employs the distance and an SVM classifier, the proposed methods is not bound to these choices, and other distance functions and classifiers can be used.

3.3 Dictionary Learning

The objective of the dictionary learning phase is to generate a representative dictionary given the matrix containing patch vectors extracted from the training set. Generally, for image classification tasks, best results are obtained with over-complete () dictionaries [55].

A dictionary of atoms and sparse code-words can be obtained by minimizing the following loss function:


In this equation, is the same as in (2) and is used to adjust the weight of the sparseness term in the loss equation. The convex set:


enforces two constraints. The first is used to restrict the magnitude of the dictionary atoms. The second is used to make sure each element of each atom in the dictionary is positive. Since the spectrogram is purely positive, better results are obtained by enforcing this constraint. The joint optimization of and is not convex. However if one term is fixed the problem becomes convex. Thus, a common strategy is two alternate between updating while is fixed and updating while is fixed until a stopping criterion is met [56].

Figure 3 shows an example of dictionary atoms learned using the above described procedure. The audio files from the SSPNet Speaker Personality Corpus were used to learn the atoms. The same dictionary can be used for all traits.

Figure 3: Example of patches from a 100 word dictionary created with sparse coding.

4 Experimental Methodology

The SSPNet Speaker Personality corpus [21] is the largest and most recent data set for personality trait assessment from speech. It consists of 640 audio clips randomly extracted from french news bulletins in Switzerland. All clips have been sampled at 8 kHz and most of the clips are 10 seconds long, but some are shorter. Each clip contains only one of the 322 different speakers. Eleven judges performed annotation on each clip by completing the BFI-10 personality assessment questionnaire [57]. From the questionnaire a score is computed for each of the Big-Five personality traits. Precautions were taken to avoid sequence and tiredness effects in the annotation process. The judges did not understand french and therefore were not influenced by linguistic cues. In [21] the assessment of the judges were considered as positive if the score was greater than 0 and negative otherwise. The labeling scheme was refined for the competition [8]. In this case, an assessment was considered positive if the score given by a judge was higher than the average score given by this particular judge for the trait. In both cases, the final label for an instance was obtained by a majority vote from all of the 11 judges. Preliminary experiments showed a 12% difference in accuracy performance between the two labeling schemes. The results reported in this paper were obtained using the competition’s labeling scheme.

Unweighted Average Recall (%) Number of
Algorithm O C E A N Avr. Features Descriptors Functionals Parameters
Mohammadi & Vinciarelli (LR) [21] 56.1 69.6 72.4 55.7 67.4 64.2 24 4 6 20
Mohammadi & Vinciarelli (SVM) [21] 57.7 68.0 74.3 57.4 65.5 64.6 24 4 6 20
SSPNet Challenge Baseline (SVM) [8] 58.7 69.2 74.5 62.2 69.0 66.7 6125 21 39 200
SSPNet Challenge Baseline (RF) [8] 52.9 69.0 77.5 60.1 68.2 65.5 6125 21 39 200
GeMAPS (SVM) [18] 56.3 72.2 74.9 61.9 68.9 66.8 62 13 10 100
eGeMAPS (SVM) [18] 53.7 72.5 75.1 62.0 66.6 66.0 88 16 12 100
Proposed Method 56.3 68.3 75.2 64.9 70.8 67.1 200-800 1 1 6
Table 1: Performance on the SSPNet Speaker Personality corpus and parameter complexity of the methods.

The metric used to compare accuracy is the unweighted average recall (UAR), which is the same as in the competition. The UAR is the mean of each class accuracy, and thus is unaffected by class imbalance. To assess performances, a 3-fold cross-validation procedure was used to limit the effect of sampling-induced variance in the results. Precautions were taken to make sure that all samples belonging to the same speaker are grouped in the same fold. Sampling-induced variance effects were observed in the Interspeech 2012 challenge. The results obtained for the conscientiousness trait with the development partition are significantly lower than the results obtained with the test partition. For instance, the baseline method using SVM obtained a UAR of 74.5% in training, but increased to 80.1% in testing

[8]. The same phenomenon was observed with the random forest classifier (74.9% to 79.1%). This suggests that the test data may have been easier to classify than the average data. This hypothesis is supported by the fact that the results obtained using a cross-validation procedure in [21] were also closer to 70% than 80%. Nested cross-validation [58] was used to optimize the hyper parameters for all classifiers and the dictionary learning parameters (dictionary size and ). In nested cross-validation, an outer cross-validation loop (3 folds) is used to obtain the final test results, and an inner loop (5 folds) is used to find the best hyper parameter via grid search. Hyper-parameter optimization is thus performed for each of the 3 test folds separately.

For the proposed method, the spectrograms were extracted using a STFT with 128 sample wide Hamming window. There was a 75% overlap between two successive speech segments. The extracted patches were 1616 pixels thus yielding 256-dimensional feature vectors. A new patch was extracted each 8 time steps and each 4 frequency bins. All of these parameters were selected based on prior experiments. An importance weighting scheme was used to deal with class imbalance [59]. This was achieved by attributing different misclassification cost in the SVM hinge loss function to the target classes. The cost for the positive class was multiplied by a factor corresponding to the class imbalance ratio. The SPAMS toolbox [60] was used for dictionary learning and encoding and LIBSVM [61] was used for the SVM implementation.

Three reference methods were selected to compare performance. The methods were chosen because they are well documented and can be reproduced without ambiguity. The first method was proposed by Mohammadi & Vinciarelli in [21]. Prosody features were extracted using Praat [16]

, the same software used in the original paper. The low-level feature extracted were pitch, first two formants, energy of speech, and length of voiced and unvoiced segments. The features were extracted using 40 ms long windows at 10 ms time steps. The features were whitened based on means and standard deviations estimated on the training folds. Four statistical properties were then estimated from the 6 prosody measures yielding a 24-dimensional feature vector for each speech file. The statistical features were the minimum, maximum, mean and the entropy of the differences between consecutive feature values. As in


, an SVM and a logistic regression (LR) were used for classification. The logistic regression implementation of the MATLAB Statistic and Machine Learning Toolbox was used. For the SVM, the LIBSVM implementation was used with the linear and the RBF kernels.

The second method is the baseline used in the Interspeech 2012 speaker trait challenge [8]. The 6125 low-level features were extracted using the openSMILE software [17] with the preset named after the challenge. The features were whitened based on means and standard deviations estimated on the training folds. For the linear SVM, the LIBSVM implementation [61] was used which performs sequential minimal optimization, the optimization algorithm used in the challenge baseline. The use of Gaussian kernel was also explored but did not yield better results. For the random forest (RF) classifier, MATLAB implementation from the Statistic and Machine Learning Toolbox was used. This method was selected because it yield state-of-the-art performance. Only 2 of the methods proposed in the challenge outperformed the baseline with a UAR margin of 0.1% for [62] and of 1% for [12], which is not significant.

The third and most recent benchmark method uses the features prescribed in the Geneva minimalistic acoustic parameter set (GeMAPS) [18]. The minimalistic set can be extended (eGeMAPS) by including MFCC coefficients, spectral flux and additional formant descriptors. The features were extracted using the preset supplied in openSMILE. Classification was achieved by a linear SVM using the LIBSVM implementation. The hyper-parameters were optimized in the same way as for the Interspeech method. This method was selected because it is intended to reduce the complexity of the feature extraction stage in paralinguistic problems, same as the proposed method.

5 Results

5.1 Accuracy

The performance of the proposed and baseline methods on the SSPNet Speaker Personality data set is reported in Table 1. The best average UAR was obtained using the proposed method. However, the results obtained when using the challenge features and GeMAPS with an SVM classifier are comparable. The method proposed by Mohammadi and Vinciarelli yields slightly lower accuracy than the other methods, although the difference in performance in most cases is small and may be insignificant. Particularities in the data set and the type of classifier, as well as its implementation, are most likely the reason for these variations in performance. For instance, using the same features and a different classifier, the SSPNet challenge baseline [8] obtains a UAR of 58.7% (SVM) and 52.9% (RF) for the openness trait. The difference in UAR is not due to the power of the data representation, but on the behavior of the classifiers in this particular case.

There are, however, differences between the four representations. For instance, the proposed method is not well adapted to represent pitch nor speech rate. Estimating the pitch is difficult because once the patches are extracted, their location is discarded. In contrast, all reference methods explicitly extract pitch and compute statistics on the measure. Speech rate is also difficult to represent by the proposed method since patches encode local information while speech rate is more of a global measure. All reference methods capture speech rate better because they extract statistics on the length and proportion of voiced and unvoiced segments. This slightly impedes the proposed method for the recognition of the openness trait, for which pitch and speech rate have been identified as markers [6, 63]. It could explain the 2.4% and 1.4% difference between the proposed and reference methods using SVM. However, these two markers are also indicative of neuroticism [6], and the proposed method performs better on this class because it can more efficiently capture voice timbre and prosody than the other methods. Instead of indirectly measuring it with formants, pitch and spectral features, the proposed method uses raw chunks of the sound spectrogram as representation, and thus might capture the information with more fidelity than binned descriptors and statistics over whole sound files.

5.2 Complexity

While accuracy is generally similar for all methods, the main advantage of the proposed method is the significant reduction of complexity in implementation. Compared to the baseline of the Interspeech challenge, the feature set used in the proposed method is much smaller (at most 800 features instead of 6125). Smaller feature sets are desirable because they reduce algorithmic complexity, and are less subject to problems associated with the curse of dimensionality. Moreover, the amount of human expert intervention necessary is different in the four methods. In the proposed method, only 1 feature extraction algorithm was used instead of 4 for [21], more than 10 for GeMAPS and over 20 in [8]. In addition, in all reference methods, a set of functionals were applied to the extracted features. Some of these functionals were simple measures like mean, min/max and standard deviation, but other were more complex and parametrizable. For instance, functionals relying on peak distance need a peak detector that has to be fine-tuned. These feature extraction algorithms require parametrization which must be performed by a signal processing expert.

The time complexity of the proposed method is slightly higher than for some of the other types of descriptors. This is due to the optimization step of the dictionary learning and encoding procedures. It takes under 3 seconds to analyze a 10 seconds speech segment using an unoptimized MATLAB implementation. Finally, one could argue that more memory is required with the proposed method as it needs to store the dictionary. However, a 800 word dictionary for 1616 pixel patches require storing around 1.6 MB when using the double-precision floating-point format, which a manageable consumption in modern computers.

6 Conclusion

This paper presents a new method for automated assessment of personality traits in speech. Speech segments are represented using spectrograms and feature learning. The proposed representation is compact and is obtained using a single algorithm requiring minimal expert intervention, when compared to reference methods. Experiments conducted on SSPNet data set indicate that the proposed method yields the same level of accuracy as state-of-the-art methods in paralinguistics that employ more complex representations, while remaining simpler to use.

As explained in Section 5.1, the method is not properly equipped to capture pitch and speech rate. Research should be conducted to include these signal characteristics in the representation. In addition, experiments on different paralinguistic problems should be conducted to validate the applicability of the proposed method in different contexts. Experiments should also be conducted where the sparse dictionary learning and classifier algorithms used in our implementation is replaced by other methods enforcing group sparsity and discrimination.


  • [1] J. S. Uleman, L. S. Newman, and G. B. Moskowitz, “People as flexible interpreters: Evidence and issues from spontaneous trait inference,” Adv. Exp. Soc. Psy., vol. 28, pp. 211–280, 1996.
  • [2] A. Tapus and M. J. Mataric, “Socially Assistive Robots: The Link between Personality, Empathy, Physiological Signals, and Task Performance.” in Assoc. for the Adv. of Artif. Int. Spring Symp., 2008.
  • [3] J. M. Digman, “The curious history of the five-factor model.” The Five-Factor Model of Personality, p. 20, 1996.
  • [4] R. E. Guadagno, B. M. Okdie, and C. A. Eno, “Who blogs? Personality predictors of blogging,” Computers in Human Behavior, vol. 24, no. 5, pp. 1993–2004, Sep. 2008.
  • [5] S. Argamon, S. Dhawle, M. Koppel, and J. W. Pennebaker, “Lexical predictors of personality type,” in Joint Annu. Meeting of the Interface and the Classfication Soc. of North America, 2005.
  • [6] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, “Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text,” J. Artif. Int. Res., vol. 30, no. 1, pp. 457–500, Nov. 2007.
  • [7] L. Qiu, H. Lin, J. Ramsay, and F. Yang, “You are what you tweet: Personality expression and perception on Twitter,” J. of Res. in Personality, vol. 46, no. 6, pp. 710–718, Dec. 2012.
  • [8] A. Vinciarelli, F. Burkhardt, R. V. Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, and B. Weiss, “The Interspeech 2012 Speaker Trait Challenge,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [9] C. Chastagnol and L. Devillers, “Personality Traits Detection Using a Parallelized Modified SFFS Algorithm,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [10]

    D. Wu, “Genetic algorithm based feature selection for speaker trait classification,” in

    Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [11] J. Pohjalainen, S. Kadioglu, and R. Okko, “Feature Selection for Speaker Traits,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [12] V. Ivanov and X. Chen, “Modulation Spectrum Analysis for Speaker Personality Trait Recognition,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [13] S. Steidl, A. Batliner, S. Hantke, J. R. Orozco-arroyave, Y. Zhang, and F. Weninger, “The Interspeech 2015 Computational Paralinguistics Challenge: Nativeness, Parkinson’s & Eating Condition,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Dresden, Germany, 2015.
  • [14] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, oct 1994.
  • [15] B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, and B. Weiss, “A Survey on perceived speaker traits: Personality, likability, pathology, and the first challenge,” Computer Speech & Lang., vol. 29, no. 1, pp. 100–131, jan 2015.
  • [16] P. Boersma and D. Weenink, “Praat: doing phonetics by computer [computer program],” 2001.
  • [17] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent Developments in openSMILE, the Munich Open-source Multimedia Feature Extractor,” in Proc. of the ACM Int. Conf. on Multimedia.   ACM, 2013, pp. 835–838.
  • [18] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing,” IEEE Trans. Affective Computing, vol. 7, no. 2, pp. 190–202, apr 2016.
  • [19] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).   Secaucus, USA: Springer, 2006.
  • [20] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and Others, “Combining efforts for improving automatic classification of emotional user states,” Ljubljana, Slovenia, 2006.
  • [21] G. Mohammadi and A. Vinciarelli, “Automatic Personality Perception: Prediction of Trait Attribution Based on Prosodic Features,” IEEE Trans. Affective Computing, vol. 3, no. 3, pp. 273–284, Jul. 2012.
  • [22] K. T. Schutte, “Parts-based Models and Local Features for Automatic Speech Recognition,” Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, USA, 2009.
  • [23] R. V. Sharan and T. J. Moir, “Subband Time-Frequency Image Texture Features for Robust Audio Surveillance,” IEEE Trans. Inf. Forens. Security, vol. 10, no. 12, pp. 2605–2615, dec 2015.
  • [24] R. B. Grosse, R. Raina, H. Kwong, and A. Y. Ng, “Shift-Invariance Sparse Coding for Audio Classification,” 23rd Conf. Uncertain. Artif. Intell., 2007.
  • [25] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Trans. on Audio, Speech, and Lang. Proces., vol. 20, no. 1, pp. 7–13, jan 2012.
  • [26]

    A. r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,”

    IEEE Trans. on Audio, Speech, and Lang. Proces., vol. 20, no. 1, pp. 14–22, jan 2012.
  • [27] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for robust feature generation in audiovisual emotion recognition,” in 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Proces., may 2013.
  • [28]

    J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoder-based feature transfer learning for speech emotion recognition,” in

    Proc. Humaine Assoc. Conf. Affective Comp. and Intel. Interaction (ACII),, sep 2013.
  • [29] M. Heckmann, X. Domont, F. Joublin, and C. Goerick, “A hierarchical framework for spectro-temporal feature extraction,” Speech Commun., vol. 53, no. 5, pp. 736–752, May 2011.
  • [30] L. von Melchner, S. L. Pallas, and M. Sur, “Visual behaviour mediated by retinal projections directed to the auditory pathway,” Nature, vol. 404, no. 6780, pp. 871–876, Apr. 2000.
  • [31] T. Muroi, R. Takashima, T. Takiguchi, and Y. Ariki, “Gradient-based acoustic features for speech recognition,” in Proc. Int. Symp. on Intelligent Signal Process. and Commun. Systems, jan 2009.
  • [32] J. W. Dennis, “Sound Event Recognition in Unstructured Environments using Spectrogram Image Processing,” Ph.D. dissertation, Nanyang Technological University, 2014.
  • [33]

    C. L.-H. Shih J.-L., “Colour image retrieval based on primitives of colour moments,”

    IEE Proc. - Vision, Image and Signal Process., vol. 149, pp. 370–376, dec 2002.
  • [34] R. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for image classification,” IEEE Trans. Syst. Man Cybern., vol. SMC-3, no. 6, pp. 610–621, nov 1973.
  • [35] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Adv. in Neur. Inf. Proces. Syst. 22, 2009.
  • [36] S. Marĉelja, “Mathematical description of the responses of simple cortical cells,” J. Opt. Soc. Am., vol. 70, no. 11, pp. 1297–1300, nov 1980.
  • [37] Y. Gu, E. Postma, and H.-X. Lin, “Vocal Emotion Recognition with Log-Gabor Filters,” in Proc. of the 5th Int. Workshop on Audio/Visual Emotion Challenge, New York, USA, 2015.
  • [38] H. Buisman and E. Postma, “BNAIC: The log-gabor method: Speech classification using spectrogram image analysis,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [39] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in IEEE Conf. on Computer Vis. and Patt. Recog., jun 2007.
  • [40] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in

    Int. Workshop on Statistical Learning in Computer Vision

    , ser. ECCV, Oct. 2004.
  • [41] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in IEEE Conf. on Computer Vis. and Patt. Recog., 2008.
  • [42] M.-A. Carbonneau, A. J. Raymond, E. Granger, and G. Gagnon, “Real-time visual play-break detection in sport events using a context descriptor,” in Proc. IEEE Int. Symp. on Circuits and Systems, ser. ISCAS, may 2015.
  • [43] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study,” Int. J. of Computer Vision, vol. 73, no. 2, pp. 213–238, 2006.
  • [44] T. Matsui, M. Goto, J.-P. Vert, and Y. Uchiyama, “Gradient-based musical feature extraction based on scale-invariant feature transform,” in Signal Process. Conf., 2011 19th European, aug 2011.
  • [45] S. Mallat, A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way, 3rd ed.   Academic Press, 2008.
  • [46] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, dec 2006.
  • [47] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: Transfer learning from unlabeled data,” in Proc. Int. Conf. on Mach. Learning, ser. ICML.   ACM, 2007.
  • [48] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, nov 2006.
  • [49] R. F. Lyon, “Machine Hearing: An Emerging Field [Exploratory DSP],” Signal Process. Magazine, IEEE, vol. 27, no. 5, pp. 131–139, sep 2010.
  • [50] G. Yu and J.-J. Slotine, “Audio classification from time-frequency texture,” in Acoustics, Speech and Signal Process., 2009. ICASSP 2009. IEEE Int. Conf. on, apr 2009.
  • [51] G. Peyré, “Sparse Modeling of Textures,” J. of Math. Imaging and Vision, vol. 34, no. 1, pp. 17–31, 2009.
  • [52] E. C. Smith and M. S. Lewicki, “Efficient auditory coding,” Nature, vol. 439, no. 7079, pp. 978–982, feb 2006.
  • [53] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004.
  • [54] O. Chapelle, P. Haffner, and V. N. Vapnik, “Support vector machines for histogram-based image classification,” IEEE Trans. Neural Network, vol. 10, no. 5, pp. 1055–1064, Sep. 1999.
  • [55] I. Tosic and P. Frossard, “Dictionary Learning,” Signal Process. Magazine, IEEE, vol. 28, no. 2, pp. 27–38, Mar. 2011.
  • [56] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Adv. in Neural Inf. Proces. Syst., 2006.
  • [57] B. Rammstedt and O. P. John, “Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German,” J. of Res. in Personality, vol. 41, no. 1, pp. 203–212, feb 2007.
  • [58] M. Stone, “Cross-Validatory Choice and Assessment of Statistical Predictions,” J. of the Royal Statistical Soc. Series B (Methodological), vol. 36, no. 2, pp. 111–147, 1974.
  • [59]

    A. Rosenberg, “Classifying Skewed Data: Importance Weighting to Optimize Average Recall.” in

    Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [60] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Dictionary Learning for Sparse Coding,” in Proc. Int. Conf. on Mach. Learning, ser. ICML.   New York, USA: ACM, 2009.
  • [61] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1—-27:27, may 2011.
  • [62] C. Montacié and M.-j. Caraty, “Pitch and Intonation Contribution to Speakers ’ Traits Classification,” in Proc. Annu. Conf. of the Int. Speech Commun. Assoc., Portland, USA, 2012.
  • [63] D. W. Addington, “The relationship of selected vocal characteristics to personality perception,” Speech Monographs, vol. 35, no. 4, pp. 492–503, 1968.