Sentence level estimation of psycholinguistic norms using joint multidimensional annotations

05/20/2020 ∙ by Anil Ramakrishna, et al. ∙ University of Southern California 0

Psycholinguistic normatives represent various affective and mental constructs using numeric scores and are used in a variety of applications in natural language processing. They are commonly used at the sentence level, the scores of which are estimated by extrapolating word level scores using simple aggregation strategies, which may not always be optimal. In this work, we present a novel approach to estimate the psycholinguistic norms at sentence level. We apply a multidimensional annotation fusion model on annotations at the word level to estimate a parameter which captures relationships between different norms. We then use this parameter at sentence level to estimate the norms. We evaluate our approach by predicting sentence level scores for various normative dimensions and compare with standard word aggregation schemes.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Psycholinguistic norms are numeric ratings assigned to linguistic cues such as words or sentences to measure various psychological constructs. Examples include dimensions such as valence, arousal, and dominance which are used to analyze the affective state of the author (of the spoken or written text), along with norms of higher order mental constructs such as concreteness and imagability which have been associated with improvements in learning [paivio1968concreteness]. The ease of computing the norms has enabled their application in a variety of tasks in natural language processing such as information retrieval [tanaka2013estimating]

, sentiment analysis

[nielsen2011new], text based personality prediction [mairesse2007using] and opinion mining. The norms are typically annotated at the word level by psychologists who provide numeric scores to a curated list of seed words, which are then extrapolated to a larger vocabulary using either semantic relationships such as synonymy and hyponymy or using word occurrence based contextual similarity [malandrakis2015therapy].

Most applications of psycholinguistic norms in NLP use sentence or document level scores, but manual annotation of the norms at these levels is difficult and not straightforward to generalize. In these cases, estimation of sentence level norms is done by aggregating the word level scores using simple averaging [ramakrishna2017linguistic, malandrakis2015therapy], or by using distribution statistics of the word level scores [gibson2015predicting]. However, such strategies do not account for the non-trivial dependencies of sentence level semantics on the words, and may not be accurate at estimating the norms at the sentence level. In this work, we propose a new approach to estimate sentence level norms using inferred relationships between different dimensions along with partial annotations of the sentence level norms.

Annotation of the normatives at the sentence level is a challenging task when compared to word level annotations since it involves evaluating the underlying semantics of the sentence in the abstract space of the corresponding dimension, with some dimensions in particular being more difficult than others. For example, imagability, a measure of how easy it is to create a mental image of the input word or sentence, is more difficult to annotate at the sentence level when compared to words. On the other hand, norms such as valence are relatively easier to annotate even at the sentence level in comparison. We use this observation along with the parameters learned from a joint annotation fusion model at word level to predict norms at sentence level.

Annotations are typically performed online using crowdsourcing platforms such as Amazon Mechanical (Mturk), which connect researchers with inexpensive workers from across the globe and provide easy scalability. Annotations are collected from several workers over a large number of instances, often on several related dimensions. These are then combined to obtain estimates for the label of interest, typically using aggregation techniques such as simple averaging or majority voting, or using more nuanced aggregation models which assume a structure for the annotators’ behavior [raykar2010learning]. The annotation dimensions are usually modeled independent of each other, but a few recent publications have explored joint modeling of the dimensions and have highlighted the benefits of this approach [Ramakrishna2016, ramakrishna2020joint]. These models assume a joint relationship between the dimensions being annotated, and estimate model parameters that capture this relationship for each annotator, which can be used in estimating the sentence level normatives. Specifically, we can use model parameters learned at the word level to estimate the norms at the sentence level using partial sentence level annotations.

We use the model presented in [ramakrishna2020joint]

, in which the authors assume a matrix factorization model to capture the annotators’ behavior, in which the annotations are assumed to be based on a linear transformation of the underlying label vector. Parameters of this model include a linear transformation matrix,

, which captures the individual contributions of each dimension in the annotation output. In our work, we assume that the annotator specific relationships between the dimensions captured by the parameter is comparable at both word and sentence levels. We collect word level annotations on valence, arousal and dominance and train the joint global annotation model from [ramakrishna2020joint] to estimate the annotator parameters including ; we then use the word level estimates for on sentence level ratings from the same set of annotators. To predict sentence level scores of a given normative dimension, we make use of partial annotations on the remaining dimensions along with . Our proposed approach shows improved performance in predicting the sentence level norms when compared to various word level aggregation strategies.

The rest of the paper is organized as follows. In Section 2, we expand on the joint multidimensional annotation model and detail our data annotation approach in Section 3, followed by experiments in Section 4 and results in Section 5 before concluding in Section 6.

Figure 1: Proposed model. is the set of features for the data point, is the latent label for the dimension and is the rating provided by the annotator. Vectors and (shaded) are observed variables, while is latent. is the set of annotator ratings for the instance.

2 Joint multidimensional annotation model

The annotation model is represented in plate notation in Figure 1. In this model, the underlying label vector

for each data instance is defined as a linear regression model as shown in Equation

1. An annotator, indexed by , is assumed to apply a linear transformation function on vector to produce the annotation vector using the matrix as shown in Equation 2.


where, ; ; ; ; ; . is the annotator specific linear transformation matrix. Each annotation dimension value for annotator is defined as a weighted average of the vector with weights given by .

In this model, the feature vector corresponding to each instance is assumed to be available, along with the annotations , while the label vectors are assumed hidden, as shown in the Figure 1. We use the EM algorithm from [ramakrishna2020joint] to estimate the parameters, listed below for ease of exposition. Detailed derivations for the update equations below can be found in [ramakrishna2020joint].

We use Maximum Likelihood Estimation (MLE) to estimate the model parameters, in which we maximize the model likelihood shown below in Equation 3.


Optimizing the above objective is non-trivial due to the presence of the integral within the log function. To address this, we use the well known Expectation Maximization algorithm

[dempster1977maximum], which uses Jensen’s inequality to derive a lower bound (shown below in Equation 4) on the objective based on current parameter estimates, by computing the expectation with respect to the conditional distribution .


This is followed by parameter estimation using maximization. The alternating expectation and maximization steps form the iterations of the EM algorithm.

(a) CCC
(b) Pearson’s Correlation
Figure 2: Performance of proposed and baseline models in predicting sentence level norms. Results show Concordance Correlation Coefficient (CCC) and Pearson Correlation values between the various estimates and the reference expert ratings on the EmoBank corpus. The estimates of the proposed model for Valence and Arousal are superior while those for Dominance are poor; subsequent analysis show poor human interannotator agreement for dominance ratings as a possible reason. See also Figure 3.

2.1 EM algorithm

Initialization The model is initialized by assigning the mean of annotations for each data instance as the estimate for . Given this, the initial parameters are estimated using update equations listed in the maximization step below.

E-step We compute the expected value of with respect to the distribution , which is assumed to be it’s soft estimate for each data instance.

M-step Given the soft estimate for , parameter estimates are computed by maximizing Equation 4. The update equations for this step are listed below.

Termination We terminate the algorithm when the change in model log-likelihood reduces to less than from the previous iteration.

3 Data annotation

We performed two sets of experiments, collecting word and sentence level annotations on specific dimensions in each. In the first experiment (which we refer to as VAD from now), we collected annotations on the affective norms of Valence, Arousal and Dominance using Mturk for words sampled from [warriner2013norms]. This corpus was chosen because it provides expert ratings on Valence, Arousal and Dominance for nearly 14,000 English words. Annotators were asked to provide numeric ratings between 1 to 5 (inclusive) for each dimension, on assignments consisting of a set of 20 words. In total, we collected 20 annotations each on a set of 200 words randomly sampled from [warriner2013norms]. Instructions for the annotation assignments included definitions along with examples for each of the dimensions being annotated. After filtering incomplete and noisy submissions, we retained only those annotators who provided ratings for at least 100 words in the subsequent sentence level annotation task, to ensure sufficient training data.

Sentence level annotations were collected on sentences from the Emobank corpus [buechel2017emobank], which includes expert ratings on valence, arousal and dominance for 10000 English sentences. 21 different annotators from the word level annotation task described above were invited to provide labels for 100 sentences randomly sampled from this corpus. The assignments were presented in a similar fashion as word level annotations, with each assignment including 10 sentences and the workers providing numeric ratings for valence, arousal and dominance for each sentence. We use the annotator specific parameters estimated at the word level to predict the norms at sentence level using the approach described in the next section.

In our second experiment (which we refer to as IGP from now), we collected word and sentence level annotations on three new psycholinguistic normative dimensions: imagability, which measures the degree of the stimulus’ proclivity to create a mental picture; genderladenness, which measures the degree of masculine or feminine association evoked by the stimulus; and pleasantness, which measures the degree of pleasant feelings associated with the stimulus. We used the same words and sentences used in our previous experiment for annotations on valence, arousal and dominance. Since we do not have expert ratings for pleasantness, imagability and genderladenness, we use the strategy followed in AVEC 2018 challenge to evaluate model performance.

4 Experiments

Given annotator parameters estimated at the word level, we use partial annotator ratings at the sentence level to predict the remaining norms. For example, in the VAD experiment, while predicting sentence level scores of valence, we use the sentence level annotator ratings on arousal and dominance along with the word level parameter matrix . The use of partial annotations enables us to predict sentence level norms on challenging psycholinguistic dimensions using ratings on dimensions which maybe easier to annotate.

(a) Pearson’s Correlation
(b) Mean Squared Error
Figure 3: Performance of the best annotators for each dimension (but over all instances) in our dataset and annotator average when compared with expert ratings from the Emobank corpus
(a) Imagability
(b) Genderladenness
(c) Pleasantness
Figure 4: MSE of predicted and baseline models in predicting Imagability, Genderladenness and Pleasantness

where, is the dimension to predict. is estimated using linear regression.

In both our experiments, we make use of the IID Gaussian noise assumption in Equation 2, which reduces the task of predicting the sentence level norm to a linear regression problem shown in Equation 5. Rows of the matrix are treated as features of the regression model with vector as the regression parameter. Given sentence level partial annotations (vector with removed), and matrix , the regression parameter vector can be estimated using normal equations or gradient descent. For each dimension within a given experiment, we use Equation 5 to estimate the sentence level normatives. The features used in both our experiments were 300 dimensional GloVe embeddings [pennington2014glove] at word level, which were aggregated using simple averaging at sentence level.

In the VAD experiment, we compare the predicted dimensions with expert ratings from the Emobank corpus, which acts as our reference to evaluate model performance. For baselines, we compute different aggregations of word level normative scores after filtering out non-content words as is common in literature [ramakrishna2017linguistic]. Word level scores for the norms were computed using the approach described in [malandrakis2015therapy]. We used unweighted average, maximum, minimum and sum of the word level norms as the baseline aggregation functions.

In the IGP experiment, we train linear regression models using predictions from the proposed model and directly compare the training set error with baselines. Low training error implies higher learnability (due to better correlations with the features) of the predicted signal and serves as a crude proxy for quality. For baselines, we use training error from labels obtained by simple averaging of word level normative scores, and sentence level average of annotations.

We use Concordance Correlation Coefficient () [lawrence1989concordance] and the Pearson’s correlation coefficient (

) as evaluation metrics.

measures any departures from the concordance line (line passing through the origin at angle). Hence it is sensitive to rotations or rescaling in the predicted values of . Given two samples and , the sample concordance coefficient is defined as shown below.


where and

are sample standard deviations, while

is the sample covariance.

5 Results

5.1 Vad

Figure 2 shows the performance of the proposed model along with the different baselines. As seen from the figure, the proposed model outperforms the baselines in predicting valence and arousal in both evaluation metrics, suggesting the efficacy of the approach. Using partial ratings at sentence level along with matrix which captures relationships between the dimensions, the proposed approach seems to outperform the baseline word aggregation schemes in these two dimensions. Performance in appears to be lower than , suggesting the presence of a rotation in the predicted values. This can be attributed to the unidentifiability commonly observed in matrix factorization models such as the annotation fusion model of [ramakrishna2020joint]. Common solutions to address this involve assuming a suitable prior on the parameter , which may lead to better estimates of .

Model performance on dominance, on the other hand, is considerably low in both metrics. To further investigate the reason for this, we examined the performance of the best possible annotator for each dimension in this experiment and compare their predictions with the expert ratings from the Emobank corpus in Figure 3. Evidently, for dominance, we notice very low correlation and high MSE between our best annotators and the experts, suggesting a high disagreement for this dimension. This may have been due to a possibly differing definition and/or interpretation of dominance between the two sets of annotators.

5.2 Igp

In our second experiment, we use model training error as a proxy for evaluating prediction quality since we do not have expert ratings. Figure 4 shows the training error for the proposed model when compared with two baselines. The proposed model shows lowest training error in predicting imagability while the performance is relatively worse in genderladenness and pleasantness, suggesting relatively stronger dependency of imagability on the other dimensions.

6 Conclusion

We presented a novel computational approach to estimate sentence level psycholinguistic norms using joint multidimensional annotation fusion. We evaluate our approach by predicting sentence level normatives on various dimensions in two different experiments, and showed improvements in specific cases. Future work includes evaluating the model on more abstract psycholinguistic dimensions such as concreteness and dominance. The primary challenge there lies in obtaining expert ratings on these dimensions at the sentence level to evaluate the model predictions. Recently, alternate schemes to evaluate a model in the absence of a reliable ground truth or reference have been proposed, such as the evaluation strategy used in the AVEC 2018 challenge [ringeval2018avec]. The challenge organizers proposed a scheme where annotation fusion models are evaluated by training regression models on labels predicted by the fusion models which are evaluated on a disjoint test set.