Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

05/07/2021 ∙ by Adithya V Ganesan, et al. ∙ Stony Brook University 0

In human-level NLP tasks, such as predicting mental health, personality, or demographics, the number of observations is often smaller than the standard 768+ hidden state sizes of each layer within modern transformer-based language models, limiting the ability to effectively leverage transformers. Here, we provide a systematic study on the role of dimension reduction methods (principal components analysis, factorization techniques, or multi-layer auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a function of predictive performance. We first find that fine-tuning large models with a limited amount of data pose a significant difficulty which can be overcome with a pre-trained dimension reduction regime. RoBERTa consistently achieves top performance in human-level tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts. Finally, we observe that a majority of the tasks achieve results comparable to the best performance with just 1/12 of the embedding dimensions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformer based language models (LMs) have quickly become the foundation for accurately approaching many tasks in natural language processing 

Vaswani et al. (2017); Devlin et al. (2019). Owing to their success is their ability to capture both syntactic and semantic information (Tenney et al., 2019), modeled over large, deep attention-based networks (transformers) with hidden state sizes on the order of 1000 over 10s of layers Liu et al. (2019); Gururangan et al. (2020). In total such models typically have from hundreds of millions (Devlin et al., 2019) to a few billion parameters (Raffel et al., 2020). However, the size of such models presents a challenge for tasks involving small numbers of observations, such as for the growing number of tasks focused on human-level NLP.

Human-level NLP tasks, rooted in computational social science, focus on making predictions about people from their language use patterns. Some of the more common tasks include age and gender prediction Sap et al. (2014); Morgan-Lopez et al. (2017) , personality Park et al. (2015); Lynn et al. (2020), and mental health prediction Coppersmith et al. (2014); Guntuku et al. (2017); Lynn et al. (2018). Such tasks present an interesting challenge for the NLP community to model the people behind the language rather than the language itself, and the social scientific community has begun to see success of such approaches as an alternative or supplement to standard psychological assessment techniques like questionnaires  Kern et al. (2016); Eichstaedt et al. (2018). Generally, such work is helping to embed NLP in a greater social and human context Hovy and Spruit (2016); Lynn et al. (2019).

Despite the simultaneous growth of both (1) the use of transformers and (2) human-level NLP, the effective merging of transformers for human-level tasks has received little attention. In a recent human-level shared task on mental health, most participants did not utilize transformers Zirikly et al. (2019). A central challenge for their utilization in such scenarios is that the number of training examples (i.e. sample size) is often only hundreds while the parameters for such deep models are in the hundreds of millions. For example, recent human-level NLP shared tasks focused on mental health have had  Milne et al. (2016),  Lynn et al. (2018) and  Zirikly et al. (2019) training examples. Such sizes all but rules out the increasingly popular approach of fine-tuning transformers whereby all its millions of parameters are allowed to be updated toward the specific task one is trying to achieve Devlin et al. (2019); Mayfield and Black (2020). Recent research not only highlights the difficulty in fine-tuning with few samples (Jiang et al., 2020) but it also becomes unreliable even with thousands of training examples Mosbach et al. (2020).

On the other hand, some of the common transformer-based approaches of deriving contextual embeddings from the top layers of a pre-trained model Devlin et al. (2019); Clark et al. (2019) still leaves one with approximately an equal number of embedding dimensions as training size. In fact, in one of the few successful cases of using transformers for a human-level task, further dimensionality reduction was used to avoid overfit Matero et al. (2019), but an empirical understanding of the application of transformers for human-level tasks — which models are best and the relationship between embedding dimensions, sample size, and accuracy — has yet to be established.

In this work, we empirically explore strategies to effectively utilize transformer-based LMs for relatively small sample-size human-level tasks. We provide the first systematic comparison of the most widely used transformer models for demographic, personality, and mental health prediction tasks. Then, we consider the role of dimension reduction to address the challenge of applying such models on small sample sizes, yielding a suggested minimum number of dimensions necessary given a sample size for each of demographic, personality, and mental health tasks111dimension reduction techniques can also be pre-trained leveraging larger sets of unlabeled data. While it is suspected that transformer LMs contain more dimensions than necessary for document- or word-level NLP Li and Eisner (2019); Bao and Qiao (2019), this represents the first study on transformer dimensionality for human-level tasks.

2 Related Work

Recently, NLP has taken to human-level predictive tasks using increasingly sophisticated techniques. The most common approaches use n-grams and LDA 

Blei et al. (2003) to model a person’s language and behaviors Resnik et al. (2013); Kern et al. (2016). Other approaches utilize word embeddings Mikolov et al. (2013); Pennington et al. (2014) and more recently, contextual word representations Ambalavanan et al. (2019).

Our work is inspired by one of the top performing systems at a recent mental health prediction shared task Zirikly et al. (2019) that utilized transformer-based contextualized word embeddings fed through a non-negative matrix factorization to reduce dimensionality Matero et al. (2019). While the approach seems reasonable for addressing the dimensionality challenge in using transformers, many critical questions remain unanswered: (a) Which type of transformer model is best? (b) Would fine-tuning have worked instead? and (c) Does such an approach generalize to other human-level tasks? Most of the time, one does not have a luxury of a shared task for their problem at hand to determine a best approach. Here, we look across many human-level tasks, some of which with the luxury of having relatively large sample sizes (in the thousands) from which to establish upper-bounds, and ultimately to draw generalizable information on how to approach a human-level task given its domain (demographic, personality, mental health) and sample size.

Our work also falls in line with a rising trend in AI and NLP to quantify the number of dimensions necessary. While this has not been considered for human-level tasks, it has been explored in other domains. The post processing algorithm (Mu and Viswanath, 2018)

of the static word embeddings motivated by the power law distribution of maximum explained variance and the domination of mean vector turned out to be very effective in making these embeddings more discriminative. The analysis of contextual embedding models 

(Ethayarajh, 2019) suggest that the static embeddings contribute to less than 5% to the explained variance, the contribution of the mean vector starts dominating when contextual embedding models are used for human-level tasks. This is an effect of averaging the message embeddings to form user representations in human-level tasks. This further motivates the need to process these contextual embeddings into more discriminative features.

Lastly, our work weighs into the discussion on just which type of model is best in order to produce effective contextual embedding models. A majority of the models fall under two broad categories based on how they are pre-trained - auto-encoders (AE) and auto-regressive (AR) models. We compare the performance of AE and AR style LMs by comparing the performance of two widely used models from each category with comparable number of parameters. From the experiments involving BERT, RoBERTa (Liu et al., 2019), XLNet Yang et al. (2019) and GPT-2 (Radford et al., 2019), we find that AE based models perform better than AR style models (with comparable model sizes), and RoBERTa is the best choice amongst these four widely used models.

3 Data & Tasks

We evaluate approaches over 7 human-level tasks spanning Demographics, Mental Health, and personality prediction. The 3 datasets used for these tasks are described below.

FB-Demogs. (age, gen, ope, ext)

One of our goals was to leverage one of the largest human-level datasets in order to evaluate over subsamples of sizes. For this, we used the Facebook demographic and personality dataset of Kosinski et al. (2013). The data was collected from approximately 71k consenting participants who shared Facebook posts along with demographic and personality scores from Jan-2009 through Oct-2011. The users in this sample had written at least a 1000 words and had selected English as their primary language. Age (age) was self-reported and limited to those 65 years or younger (data beyond this age becomes very sparse) as in (Sap et al., 2014). Gender (gen) was only provided as a limited single binary, male-female classification.

Personality was derived from the Big 5 personality traits questionnaires, including both extraversion (ext - one’s tendency to be energized by social interaction) and openess (ope, one’s tendency to be open to new ideas) (Schwartz et al., 2013). Disattenuated Pearson correlation222Disattenuated Pearson correlation helps account for the error of the measurement instrument (Kosinski et al., 2013; Murphy and Davidshofer, 1988). Following (Lynn et al., 2020), we use reliabilities: and . () was used to measure the performance of these two personality prediction tasks.

CLPsych-2018. (bsag, gen2)

The CLPsych 2018 shared task (Lynn et al., 2018) consisted of sub-tasks aimed at early prediction of mental health scores (depression, anxiety and BSAG333Bristol Social Adjustment Guide (Ghodsian, 1977) scores contains twelve sub-scales that measures different aspects of childhood behavior. score) based on their language. The data for this shared task (Power and Elliott, 2005) comprised of English essays written by 11 year old students along with their gender (gen2) and income classes. There were 9217 students’ essays for training and 1000 for testing. The average word count in an essay was less than 200. Each essay was annotated with the student’s psychological health measure, BSAG (when 11 years old) and distress scores at ages 23, 33, 42 and 50. This task used a disattenuated pearson correlation as the metric ().

CLPsych-2019. (sui)

This 2019 shared task (Zirikly et al., 2019) comprised of 3 sub-tasks for predicting the suicide risk level in reddit users. This included a history of user posts on r/SuicideWatch (SW), a subreddit dedicated to those wanting to seek outside help for processing their current state of emotions. Their posts on other subreddits (NonSuicideWatch) were also collected. The users were annotated with one of the 4 risk levels: none, low, moderate and severe risk based on their history of posts. In total this task spans 496 users in training and 125 in testing. We focused on Task A, predicting suicide risk of a user by evaluating their (English) posts across SW, measured via macro-F1.

 Sap et al.  Lynn et al.  Zirikly et al.
56,764 9,217 496
10,000 9,217 496
5,000 1,000 125
Table 1: Summary of the datasets. is the number of users available for pre-training the dimension reduction model; is the maximum number of users available for task training. For CLPsych 2018 and CLPsych 2019, this would be the same sample as pre-training data. For Facebook, a disjoint set of 10k users was available for task training; is the number of test users. This is always a disjoint set of users from the pre-training and task training samples.

4 Methods

Here we discuss how we utilized representations from transformers, our approaches to dimensionality reduction, and our technique for robust evaluation using bootstrapped sampling.

4.1 Transformer Representations

The second to last layer representation of all the messages was averaged to produce a 768 dimensional feature for each user444The second to last layer was chosen owing to its consistent performance in capturing semantic and syntactic structures Jawahar et al. (2019).. These user representations are reduced to lower dimensions as described in the following paragraphs. The message representation from a layer was attained by averaging the token embeddings of that layer. To consider a variety of transformer LM architectures, we explored two popular auto-encoder (BERT and RoBERTa) and two auto-regressive (XLNet and GPT-2) transformer-based models.

For fine-tuning evaluations, we used the transformer based model that performs best across the majority of our task suite. Transformers are typically trained on single messages or pairs of messages, at a time. Since we are tuning towards a human-level task, we label each user’s message with their human-level attribute and treat it as a standard document-level task (Morales et al., 2019). Since we are interested in relative differences in performance, we limit each user to at most 20 messages - approximately the median number of messages, randomly sampled, to save compute time for the fine tuning experiments.

1:: hidden size, : function to train dimension reduction, : Linear Model,

: Logistic loss function for classification and L2 loss for regression,

: learning rate, : Number of iterations (100).
2:: Pre-training embeddings, : Task training embeddings, : Test embeddings, : Outcome for train set, : Outcome for test set.
6:for  do
8:     Sample from
9:     for  do
11:     end for
13:end for
Algorithm 1 Dimension Reduction and Evaluation

4.2 Dimension Reduction

We explore singular value decomposition-based methods such as Principal components analysis (PCA) 

(Halko et al., 2011), Non-negative matrix factorization (NMF) (Févotte and Idier, 2011)

and Factor analysis (FA) as well as a deep learning approach: multi-layer non linear auto encoders (NLAE) 

(Hinton and Salakhutdinov, 2006). We also considered the post processing algorithm (PPA) of word embeddings555The ’D’ value was set to . (Mu and Viswanath, 2018) that has shown effectiveness with PCA on word level (Raunak et al., 2019). Importantly, besides transformer LMs being pre-trained, so too can dimension reduction. Therefore, we distinguish: (1) learning the transformation from higher dimension to lower dimensions (preferably on a large data sample from the same domain) and (2) applying the learned transformation (on the task’s train/test set). For the first step, we used a separate set of  56k unlabeled user data in the case of FB-demog666these pre-trained dimension reduction models are made available.. For CLPsych-2018 and -2019 (where separate data from the exact domains was not readily available), we used the task training data to train the dimension reduction. Since variance explained in factor analysis typically follows a power law, these methods transformed the 768 original embedding dimensions down to , in powers of 2: 16, 32, 64, 128, 256 or 512.

LM demographics personality mental health
type name
100 AE BERT 0.533 0.703 0.761 0.163 0.184 0.424 0.360
AE RoBERTa 0.589 0.712 0.761 0.123 0.203 0.455 0.363
AR XLNet 0.582 0.582 0.744 0.130 0.203 0.441 0.315
AR GPT-2 0.517 0.584 0.624 0.082 0.157 0.397 0.349
500 AE BERT 0.686 0.810 0.837 0.278 0.354 0.484 0.466
AE RoBERTa 0.700 0.802 0.852 0.283 0.361 0.490 0.432
AR XLNet 0.697 0.796 0.821 0.261 0.336 0.508 0.439
AR GPT-2 0.646 0.756 0.762 0.211 0.280 0.481 0.397
Table 2: Comparison of most commonly used auto-encoders (AE) and auto-regressor (AR) language models after reducing the 768 dimensions to 128 using NMF and trained on 100 and 500 samples () for each task. () pertains to the number of samples used for training each task. Classification tasks (gen, gen2 and sui) were scored using macro-F1 (F1); the remaining regression tasks were scored using pearson-r (r)/ disattenuated pearson-r (). AE models predominantly perform the best. RoBERTa and BERT show consistent performance, with the former performing the best in most tasks. The LMs in the table were base models (approx. 110M parameters).

4.3 Bootstrapped Sampling & Training

We systematically evaluate the role of training sample () versus embedding dimensions () for human-level prediction tasks. The approach is described in algorithm 1. Varying , the task-specific train data (after dimension reduction) is sampled randomly (with replacement) to get ten training samples with users each. Small values simulate a low-data regime and were used to understand its relationship with the least number of dimensions required to perform the best ( vs

). Bootstrapped sampling was done to arrive at a conservative estimate of performance. Each of the bootstrapped samples was used to train either an L2 penalized (ridge) regression model or logistic regression for the regression and classification tasks respectively. The performance on the test set using models from each bootstrapped training sample was recorded in order to derive a mean and standard error for each

and for each task.

To summarize results over the many tasks and possible and values in a useful fashion, we propose a ‘first k to peak (fkp)’ metric. For each , this is the first observed

value for which the mean score is within the 95% confidence interval of the peak performance. This quantifies the minimum number of dimensions required for peak performance.

5 Results

5.1 Best LM for Human-Level Tasks

We start by comparing transformer LMs, replicating the setup of one of the state-of-the-art systems for the CLPsych-2019 task in which embeddings were reduced from BERT-base to approximately 100 dimensions using NMF (Matero et al., 2019). Specifically, we used 128 dimensions (to stick with powers of 2 that we use throughout this work) as we explore the other LMs over multiple tasks (we will explore other dimensions next) and otherwise use the bootstrapped evaluation described in the method.

Table 2 shows the comparison of the four transformer LMs when varying the sample size () between two low data regimes: 100 and 500777The performance of all transformer embeddings without any dimension reduction along with smaller sized models can be found in the appendix section D.3.. RoBERTa and BERT were the best performing models in almost all the tasks, suggesting auto-encoders based LMs are better than auto-regressive models for these human-level tasks. Further, RoBERTa performed better than BERT in the majority of cases. Since the number of model parameters are comparable, this may be attributable to RoBERTa’s increased pre-training corpus, which is inclusive of more human discourse and larger vocabularies in comparison to BERT.

Method Age Gen
100 Fine-tuned 0.54 0.54
Pre-trained 0.56 0.63
500 Fine-tuned 0.64 0.60
Pre-trained 0.66 0.74
Table 3: Comparison of task specific fine tuning of RoBERTa (top 2 layers) and pre-trained RoBERTa embeddings (second to last layer) for age and gender prediction tasks. Results are averaged across 5 trials randomly sampling users equal to from the Facebook data and reducing messages to maximum of 20 per user.
demographics personality mental health
56k* 100 PCA 0.650 0.747 0.777 0.189 0.248 0.466 0.392
PCA-PPA 0.517 0.715 0.729 0.173 0.176 0.183 0.358
FA 0.534 0.722 0.729 0.171 0.183 0.210 0.360
NMF 0.589 0.712 0.761 0.123 0.203 0.455 0.363
NLAE 0.654 0.744 0.782 0.188 0.263 0.447 0.367
500 PCA 0.729 0.821 0.856 0.346 0.384 0.514 0.416
PCA-PPA 0.707 0.814 0.849 0.317 0.349 0.337 0.415
FA 0.713 0.819 0.849 0.322 0.361 0.400 0.415
NMF 0.700 0.802 0.852 0.283 0.361 0.490 0.432
NLAE 0.725 0.820 0.843 0.340 0.394 0.485 0.409
500 100 PCA 0.644 0.749 0.788 0.186 0.248 0.412 0.392
NLAE 0.634 0.743 0.744 0.159 0.230 0.433 0.367
500 PCA 0.726 0.819 0.850 0.344 0.382 0.509 0.416
NLAE 0.715 0.798 0.811 0.312 0.360 0.490 0.409
Table 4: Comparison of different dimension reduction techniques of RoBERTa embeddings (penultimate layer) reduced down to 128 dimensions and 100 and 500. Number of user samples for pre-trianing the dimension reduction model, was except for gen2, bsag (which had 9k users) and sui (which had 496 users). PCA performs the best overall and NLAE performs as good as PCA consistently. With uniform pre-training size (), PCA performs better than NLAE.

5.2 Fine-Tuning Best LM

We next evaluate fine-tuning in these low data situations888As we are focused on readily available models, we consider substantial changes to the architecture or training as outside the scope of this systematic evaluation of existing techniques.. Utilizing RoBERTa, the best performing transformer from the previous experiments, we perform fine-tuning across the age and gender tasks. Following  Sun et al. (2019); Mosbach et al. (2020), we freeze layers 0-9 and fine-tune layers 10 and 11. Even these top 2 layers alone of RoBERTa still result in a model that is updating tens of millions of parameters while being tuned to a dataset of hundreds of users and at most 10,000 messages.

In table 3

, results for age and gender are shown for both sample sizes of 100 and 500. For Age, the average prediction across all of a user’s messages was used as the user’s prediction and for gender the mode was used. Overall, we find that fine-tuning offers lower performance with increased overhead for both train time and modeling complexity (hyperparameter tuning, layer selection, etc).

We did robustness checks for hyper-parameters to offer more confidence that this result was not simply due to the fastidious nature of fine-tuning. The process is described in Appendix B, including an extensive exploration of hyper-parameters, which never resulted in improvements over the pre-trained setup. We are left to conclude that fine-tuning over such small user samples, at least with current typical techniques, is not able to produce results on par with using transformers to produce pre-trained embeddings.

5.3 Best Reduction technique for Human-Level Tasks

We evaluated the reduction techniques in low data regime by comparing their performance on the downstream tasks across 100 and 500 training samples (). As described in the methods, techniques including PCA, NMF and FA along with NLAE, were applied to reduce the 768 dimensional RoBERTa embeddings to 128 features. The results in table 4 show that PCA and NLAE perform most consistently, with PCA having the best scores in the majority tasks. NLAE’s performance appears dependent on the amount of data available during the pre-training. This is evident from the results in Table 4 where the was set to a uniform value and tested for all the tasks with set to 100 and 500. Thus, PCA appears a more reliable, showing more generalization for low samples.

Figure 1: Comparison of performance for all regression tasks: age, ext, ope and bsag over varying and . Results vary by task, but predominantly, performance at k=64 is better than the performance without any reduction. It is conclusive that the reduced features almost always performs better or as good as the original embeddings.

5.4 Performance by Sample Size and Dimensions

Now that we have found (1) RoBERTa generally performed best, (2) pre-trainining worked better than fine-tuning, and (3) PCA was most consistently best for dimension reduction (often doing better than the full dimensions), we can systematically evaluate model performance as a function of training sample size () and number of dimensions () over tasks spanning demographics, personality, and mental health. We exponentially increase from 16 to 512, recognizing that variance explained decreases exponentially with dimension (Mu and Viswanath, 2018). The performance is also compared with that of using the RoBERTa embeddings without any reduction.

Figure 1 compares the scores at reduced dimensions for age, ext, ope and bsag. These charts depict the experiments on typical low data regime (). Lower dimensional representations performed comparable to the peak performance with just the features while covering the most number of tasks and just features for the majority of tasks. Charts exploring other ranges of values and remaining tasks can be found in the appendix D.1.

5.5 Least Number of Dimensions Required

(3 tasks)
(2 tasks)
(2 tasks)
50 16 16 16
100 128 16 22
200 512 32 45
500 768 64 64
1000 768 90 64
Table 5: First k to peak (fkp) for each set of tasks: the least value of k that performed statistically equivalent () to the best performing setup (peak). Integer shown is the exponential median of the set of tasks. This table summarizes comprehensive testing and we suggest its results, fkp, can be used as a recommendation for the number of dimensions to use given a task domain and training set size.

Lastly, we devise an experiment motivated by answering the question of how many dimensions are necessary to achieve top results, given a limited sample size. Specifically, we define ‘first k to peak’ (fkp) as the least valued that produces an accuracy equivalent to the peak performance. A 95% confidence interval was computed for the best score (peak) for each task and each based on bootstrapped resamples, and fkp was the least number of dimensions where this threshold was passed.

Our goal is that such results can provide a systematic guide for making such modeling decisions in future human-level NLP tasks, where such an experiment (which relies on resampling over larger amounts of training data) is typically not feasible. Table 5 shows the fkp over all of the training sample sizes (). The exponential median (med) in the table is calculated as follows:

The fkp

results suggest that more training samples available yield ability to leverage more dimensions, but the degree to which depends on the task. In fact, utilizing all the embedding dimensions was only effective for demographic prediction tasks. The other two tasks benefited from reduction, often with only

to of the original second to last transformer layer dimensions.

6 Error Analysis

Here, we seek to better understand why using pre-trained models worked better than fine-tuning, and differences between using PCA and NMF components in the low sample setting ().

Association LIWC variables
Informal, Netspeak, Negemo
Swear, Anger
Affiliation, Social, We, They,
Family, Function, Drives, Prep,
Focuspast, Quant
Table 6: Top LIWC variables having negative and positive correlations with the difference in the absolute error of the pre-trained model and the fine-tuned model for age prediction. Benjamini-Hochberg FDR . This suggests that the fine-tuned models have lesser error than pre-trained model when the language is informal and consists of more affect words.

Pre-trained vs Fine-tuned.

We looked at categories of language from LIWC Tausczik and Pennebaker (2010), correlated with the difference in the absolute error of the pre-trained and fine-tuned model in age prediction. Table 6 suggests that pre-trained model is better at handling users with language conforming to the formal rules, and fine-tuning helps in learning better representation of the affect words and captures informal language well. Furthermore, these LIWC variables are also known to be associated with age (Schwartz et al., 2013). Additional analysis comparing these two models is available in appendix E.1.

Figure 2: Comparison of the absolute error of NMF and PCA with the average number of 1 grams per message. While both the models appear to perform very similar when the texts are small or average sized, PCA is better at handling longer texts. The errors diverge when the length of the texts increases.


Figure 2 suggests that PCA is better at handling longer text sequences than NMF (> 55 one grams on avg) when trained with less data. This choice wouldn’t make much difference when used for Tweet-like short texts, but the errors diverge rapidly for longer samples. We also see that PCA is better at capturing information from these texts that have higher predictive power in downstream tasks. This is discussed in appendix E.2 along with other interesting findings involving the comparison of PCA and the pre-trained model in E.3.

7 Discussion

Ethical Consideration.

We used existing datasets that were either collected with participant consent (FB and CLPsych 2018) or public data with identifiers removed and collected in a non-intrusive manner (CLPsych 2019). All procedures were reviewed and approved by both our institutional review board as well as the IRB of the creators of the data set.

Our work can be seen as part of the growing body of interdisciplinary research intended to understanding human attributes associated with language, aiming towards applications that can improve human life, such as producing better mental health assessments that could ultimately save lives. However, at this stage, our models are not intended to be used in practice for mental health care nor labeling of individuals publicly with mental health, personality, or demographic scores. Even when the point comes where such models are ready for testing in clinical settings, this should only be done with oversight from professionals in mental health care to establish the failure modes and their rates (e.g. false-positives leading to incorrect treatment or false-negatives leading to missed care; increased inaccuracies due to evolving language; disparities in failure modes by demographics). Malicious use possibilities for which this work is not intended include targeting advertising to individuals using language-based psychology scores, which could present harmful content to those suffering from mental health conditions.

We intend that the results of our empirical study are used to inform fellow researchers in computational linguistics and psychology on how to better utilize contextual embeddings towards the goal of improving psychological and mental health assessments. Mental health conditions, such as depression, are widespread and many suffering from such conditions are under-served with only 13 - 49% receiving minimally adequate treatment  (Kessler et al., 2003; Wang et al., 2005). Marginalized populations, such as those with low income or minorities, are especially under-served Saraceno et al. (2007). Such populations are well represented in social media Center (2021) and with this technology developed largely over social media and predominantly using self-reported labels from users (i.e., rather than annotator-perceived labels that sometimes introduce bias Sap et al. (2019); Flekova et al. (2016)), we do not expect that marginalized populations are more likely to hit failure modes. Still, tests for error disparities Shah et al. (2020) should be carried out in conjunction with clinical researchers before this technology is deployed. We believe this technology offers the potential to broaden the coverage of mental health care to such populations where resources are currently limited.

Future assessments built on the learnings of this work, and in conjunction with clinical mental health researchers, could help the under-served by both better classifying one’s condition as well as identifying an ideal treatment. Any applications to human subjects should consider the ethical implications, undergo human subjects review, and the predictions made by the model should not be shared with the individuals without consulting the experts.


Each dataset brings its own unique selection biases across groups of people, which is one reason we tested across many datasets covering a variety of human demographics. Most notably, the FB dataset is skewed young and is geographically focused on residents within the United States. The CLPsych 2018 dataset is a representative sample of citizens of the United Kingdom, all born on the same week, and the CLPsych-2019 dataset was further limited primarily to those posting in a suicide-related forum 

Zirikly et al. (2019). Further, tokenization techniques can also impact language model performance Bostrom and Durrett (2020). To avoid oversimplification of complex human attributes, in line with psychological research Haslam et al. (2012), all outcomes were kept in their most dimensional form – e.g. personality scores were kept as real values rather than divided into bins and the CLPsych-2019 risk levels were kept at 4 levels to yield gradation in assessments as justified by Zirikly et al., 2019.

8 Conclusion

We provide the first empirical evaluation of the effectiveness of contextual embeddings as a function of dimensionality and sample size for human-level prediction tasks. Multiple human-level tasks along with many of the most popular language model techniques, were systematically evaluated in conjunction with dimension reduction techniques to derive optimal setups for low sample regimes characteristic of many human-level tasks.

We first show the fine-tuning transformer LMs in low-data scenarios yields worse performance than pre-trained models. We then show that reducing dimensions of contextual embeddings can improve performance and while past work used non-negative matrix factorization Matero et al. (2019), we note that PCA gives the most reliable improvement. Auto-encoder based transformer language models gave better performance, on average, than their auto-regressive contemporaries of comparable sizes. We find optimized versions of BERT, specifically RoBERTa, to yield the best results.

Finally, we find that many human-level tasks can be achieved with a fraction, often or , the total transformer hidden-state size without sacrificing significant accuracy. Generally, using fewer dimensions also reduces variance in model performance, in line with traditional bias-variance trade-offs and, thus, increases the chance of generalizing to new populations. Further it can aid in explainability especially when considering that these dimension reduction models can be pre-trained and standardized, and thus compared across problem sets and studies.


  • Ambalavanan et al. (2019) Ashwin Karthik Ambalavanan, Pranjali Dileep Jagtap, Soumya Adhya, and Murthy Devarakonda. 2019. Using contextual representations for suicide risk assessment from Internet forums. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pages 172–176, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Bao and Qiao (2019) Xingce Bao and Qianqian Qiao. 2019. Transfer learning from pre-trained BERT for pronoun resolution. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 82–88, Florence, Italy. Association for Computational Linguistics.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    , 3(Jan):993–1022.
  • Bostrom and Durrett (2020) Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
  • Center (2021) Pew Research Center. 2021. Social media fact sheet.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  • Coppersmith et al. (2014) Glen Coppersmith, Mark Dredze, and Craig Harman. 2014. Quantifying mental health signals in Twitter. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 51–60, Baltimore, Maryland, USA. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Eichstaedt et al. (2018) J. Eichstaedt, R. Smith, R. Merchant, Lyle H. Ungar, Patrick Crutchley, Daniel Preotiuc-Pietro, D. Asch, and H. A. Schwartz. 2018. Facebook language predicts depression in medical records. Proceedings of the National Academy of Sciences of the United States of America, 115:11203 – 11208.
  • Ethayarajh (2019) Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
  • Févotte and Idier (2011) Cédric Févotte and Jérôme Idier. 2011. Algorithms for nonnegative matrix factorization with the -divergence. Neural computation, 23(9):2421–2456.
  • Flekova et al. (2016) Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. Analyzing biases in human perception of user age and gender from text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 843–854, Berlin, Germany. Association for Computational Linguistics.
  • Ghodsian (1977) M Ghodsian. 1977. Children’s behaviour and the bsag: some theoretical and statistical considerations. British Journal of Social and Clinical Psychology, 16(1):23–28.
  • Guntuku et al. (2017) Sharath Chandra Guntuku, David B Yaden, Margaret L Kern, Lyle H Ungar, and Johannes C Eichstaedt. 2017. Detecting depression and mental illness on social media: an integrative review. Current Opinion in Behavioral Sciences, 18:43–49.
  • Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  • Halko et al. (2011) Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288.
  • Haslam et al. (2012) Nick Haslam, Elise Holland, and Peter Kuppens. 2012. Categories versus dimensions in personality and psychopathology: a quantitative review of taxometric research. Psychological medicine, 42(5):903.
  • Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006.

    Reducing the dimensionality of data with neural networks.

    science, 313(5786):504–507.
  • Hovy and Spruit (2016) Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591–598.
  • Jawahar et al. (2019) Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  • Jiang et al. (2020) Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020. SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, Online. Association for Computational Linguistics.
  • Kern et al. (2014) Margaret L Kern, Johannes C Eichstaedt, H Andrew Schwartz, Lukasz Dziurzynski, Lyle H Ungar, David J Stillwell, Michal Kosinski, Stephanie M Ramones, and Martin EP Seligman. 2014. The online social self: An open vocabulary approach to personality. Assessment, 21(2):158–169.
  • Kern et al. (2016) Margaret L Kern, Gregory Park, Johannes C Eichstaedt, H Andrew Schwartz, Maarten Sap, Laura K Smith, and Lyle H Ungar. 2016. Gaining insights from social media language: Methodologies and challenges. Psychological methods, 21(4):507.
  • Kessler et al. (2003) Ronald C Kessler, Patricia Berglund, Olga Demler, Robert Jin, Doreen Koretz, Kathleen R Merikangas, A John Rush, Ellen E Walters, and Philip S Wang. 2003. The epidemiology of major depressive disorder: results from the national comorbidity survey replication (ncs-r). Jama, 289(23):3095–3105.
  • Kosinski et al. (2013) M. Kosinski, D. Stillwell, and T. Graepel. 2013. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110:5802 – 5805.
  • Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019.

    Albert: A lite bert for self-supervised learning of language representations.

    In International Conference on Learning Representations.
  • Li and Eisner (2019) Xiang Lisa Li and Jason Eisner. 2019. Specializing word embeddings (for parsing) by information bottleneck. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2744–2754, Hong Kong, China. Association for Computational Linguistics.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Lynn et al. (2020) Veronica Lynn, Niranjan Balasubramanian, and H. Andrew Schwartz. 2020. Hierarchical modeling for user personality prediction: The role of message-level attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5306–5316, Online. Association for Computational Linguistics.
  • Lynn et al. (2019) Veronica Lynn, Salvatore Giorgi, Niranjan Balasubramanian, and H Andrew Schwartz. 2019. Tweet classification without the tweet: An empirical examination of user versus document attributes. In Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science, pages 18–28.
  • Lynn et al. (2018) Veronica Lynn, Alissa Goodman, Kate Niederhoffer, Kate Loveys, Philip Resnik, and H. Andrew Schwartz. 2018. CLPsych 2018 shared task: Predicting current and future psychological health from childhood essays. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 37–46, New Orleans, LA. Association for Computational Linguistics.
  • Matero et al. (2019) Matthew Matero, Akash Idnani, Youngseo Son, Salvatore Giorgi, Huy Vu, Mohammad Zamani, Parth Limbachiya, Sharath Chandra Guntuku, and H. Andrew Schwartz. 2019. Suicide risk assessment with multi-level dual-context language and BERT. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pages 39–44, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Mayfield and Black (2020) Elijah Mayfield and Alan W Black. 2020. Should you fine-tune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 151–162, Seattle, WA, USA → Online. Association for Computational Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Milne et al. (2016) David N. Milne, Glen Pink, Ben Hachey, and Rafael A. Calvo. 2016. CLPsych 2016 shared task: Triaging content in online peer-support forums. In Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, pages 118–127, San Diego, CA, USA. Association for Computational Linguistics.
  • Morales et al. (2019) Michelle Morales, Prajjalita Dey, Thomas Theisen, Danny Belitz, and Natalia Chernova. 2019. An investigation of deep learning systems for suicide risk assessment. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pages 177–181, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Morgan-Lopez et al. (2017) Antonio A Morgan-Lopez, Annice E Kim, Robert F Chew, and Paul Ruddle. 2017. Predicting age groups of twitter users based on language and metadata features. PloS one, 12(8):e0183537.
  • Mosbach et al. (2020) Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. 2020. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884.
  • Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
  • Murphy and Davidshofer (1988) K. Murphy and C. Davidshofer. 1988. Psychological testing: Principles and applications.
  • Park et al. (2015) Gregory Park, H. A. Schwartz, J. Eichstaedt, M. Kern, M. Kosinski, D. Stillwell, Lyle H. Ungar, and M. Seligman. 2015. Automatic personality assessment through social media language. Journal of personality and social psychology, 108 6:934–52.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • Power and Elliott (2005) Chris Power and Jane Elliott. 2005. Cohort profile: 1958 British birth cohort (National Child Development Study). International Journal of Epidemiology, 35(1):34–41.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Raunak et al. (2019) Vikas Raunak, Vivek Gupta, and Florian Metze. 2019. Effective dimensionality reduction for word embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 235–243, Florence, Italy. Association for Computational Linguistics.
  • Resnik et al. (2013) Philip Resnik, Anderson Garron, and Rebecca Resnik. 2013. Using topic modeling to improve prediction of neuroticism and depression in college students. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1348–1353, Seattle, Washington, USA. Association for Computational Linguistics.
  • Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • Sap et al. (2019) Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy. Association for Computational Linguistics.
  • Sap et al. (2014) Maarten Sap, Greg Park, Johannes C Eichstaedt, Margaret L Kern, David J Stillwell, Michal Kosinski, Lyle H Ungar, and H Andrew Schwartz. 2014. Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Saraceno et al. (2007) Benedetto Saraceno, Mark van Ommeren, Rajaie Batniji, Alex Cohen, Oye Gureje, John Mahoney, Devi Sridhar, and Chris Underhill. 2007. Barriers to improvement of mental health services in low-income and middle-income countries. The Lancet, 370(9593):1164–1174.
  • Schwartz et al. (2013) H Andrew Schwartz, Johannes C Eichstaedt, Margaret L Kern, Lukasz Dziurzynski, Stephanie M Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin EP Seligman, et al. 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8(9):e73791.
  • Schwartz et al. (2017) H. Andrew Schwartz, Salvatore Giorgi, Maarten Sap, Patrick Crutchley, Lyle Ungar, and Johannes Eichstaedt. 2017. DLATK: Differential language analysis ToolKit. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 55–60, Copenhagen, Denmark. Association for Computational Linguistics.
  • Shah et al. (2020) Deven Santosh Shah, H. Andrew Schwartz, and Dirk Hovy. 2020. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5248–5264, Online. Association for Computational Linguistics.
  • Sun et al. (2019) Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune bert for text classification? In China National Conference on Chinese Computational Linguistics, pages 194–206. Springer.
  • Tausczik and Pennebaker (2010) Y. Tausczik and J. Pennebaker. 2010. The psychological meaning of words: Liwc and computerized text analysis methods. Journal of Language and Social Psychology, 29:24 – 54.
  • Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Wang et al. (2005) Philip S Wang, Michael Lane, Mark Olfson, Harold A Pincus, Kenneth B Wells, and Ronald C Kessler. 2005. Twelve-month use of mental health services in the united states: results from the national comorbidity survey replication. Archives of general psychiatry, 62(6):629–640.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763.
  • Zirikly et al. (2019) Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy Hollingshead. 2019. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pages 24–33, Minneapolis, Minnesota. Association for Computational Linguistics.


Appendix A Experimental Setup


All the experiments were implemented using Python, DLATK (Schwartz et al., 2017), HuggingFace Transformers Wolf et al. (2019)

, and PyTorch 

(Paszke et al., 2019). The environments were instantiated with a seed value of 42, except for fine-tuning which used 1337. Code to reproduce all results is available in our github page: https://www.github.com/adithya8/ContextualEmbeddingDR/


The deep learning models such as stacked-transformers and NLAE were run on single GPU with batch size given by:

where GPU memory and model sizes (space occupied by the model) are in bytes, corresponds to number of trainable parameters during fine tuning and corresponds to the number of layers of embeddings required, the is the number of dimensions in the hidden state and is the maximum number of tokens (after tokenization) in any batch. We carried out the experiments with 1 NVIDIA Titan Xp GPU which has around 12 GB of memory. All the other methods were implemented on CPU.

Figure A1: Depiction of Dimension Reduction method - Transformer embeddings of domain data ( users’ embeddings101010Generation of user embeddings explained in detail under methods.) is used to pre-train a dimension reduction model that transforms the embeddings down to dimensions. This step is followed by applying this learned reduction model on task’s train and test data embeddings. These reduced train features ( users) are then bootstrap sampled to produce 10 sets of users each for training task specific models. All these 10 task specific models are evaluated on the reduced test features consisting of

users during task evaluation. The mean and standard deviation of the task specific metric are collected.

Appendix B Model Details

NLAE architecture.

The model architecture for the Non-linear auto-encoders in Table 4 was a twin network taking inputs of 768 dimensions and reducing it to 128 dimensions through 2 layers and reconstructs the original 768 dimensional representation with 2 layers. This architecture was chosen balancing the constraints of enabling the non-linear associations while keeping total parameters low given the low sample size context. The formal definition of the model is:

NLAE Training.

The data for domain pre-training of dimension reduction was split into 2 sets for NLAE alone: training and validation sets. 90% of the domain data was randomly sampled for training the NLAE and the remaining 10% of pre-training data was used to validate hyper-parameters after every epoch. This model was trained with an objective to minimise the reconstruction mean squared loss over multiple epochs. It was trained until the validation loss increased over 3 consecutive epochs. AdamW was the optimizer used with the learning rate set to 0.001. This took around 30-40 epochs depending upon the dataset.


In our fine-tuning configuration we freeze all but the top 2 layers of the best LM, to prevent over fitting and vanishing gradients at the lower layers Sun et al. (2019); Mosbach et al. (2020). We also apply early stopping (varied the patience between 3 and 6 depending upon the task). Other hyperparameters for this experiment include L2-regularization (in the form of weight-decay on AdamW optimizer, set to 1), dropout set to 0.3, batch size set to 10, learning rate initialized to 5e-5, and the number of epochs was set to max of 15, which was limited by early stopping between 5-10 depending on the task and early stopping patience.

We arrived at these hyperparameter values after an extensive search. The weight decay param was searched in [100, 0.01], dropout within [0.1, 0.5], and learning rate between [5e-4, 5e-5].

Appendix C Data

Due to human subjects privacy constraints, most data are not able to be publicly distributed but they are available from the original data owners via requests for research purposes (e.g. CLPsych-2018 and CLPsych-2019 shared tasks).

LM demographics personality mental health
type name
100 AE BERT 0.615 0.754 0.758 0.176 0.225 0.457 0.400
AE RoBERTa 0.649 0.753 0.788 0.167 0.213 0.443 0.381
AR XLNet 0.625 0.698 0.755 0.144 0.152 0.457 0.357
AR GPT-2 0.579 0.708 0.681 0.090 0.110 0.361 0.335
500 AE BERT 0.721 0.831 0.849 0.332 0.395 0.507 0.489
AE RoBERTa 0.737 0.830 0.859 0.331 0.382 0.519 0.447
AR XLNet 0.715 0.810 0.828 0.314 0.364 0.506 0.424
AR GPT-2 0.693 0.794 0.790 0.242 0.307 0.508 0.371
Table A1: Comparison of various auto-encoders(AE) and auto-regressor(AR) language models trained on 100 and 500 samples () for each task using all the dimensions of transformer embeddings. RoBERTa and BERT show consistent performance.

Appendix D Additional Results

d.1 Results on higher

We can see that reduction still helps in majority of tasks in higher from Figure A2. As expected, the performance starts to plateau at higher values and it is visibly consistent across most tasks. With the exception of age and gender prediction using facebook data, all the other tasks benefit from reduction.

d.2 Results on classification tasks

Figure A3 compares the performance of reduced dimensions at low samples size scenario () in classification tasks. Except for a few values in gender prediction using the facebook data, all the other tasks benefits from reduction in achieving the best performance.

d.3 LM comparison for no reduction & Smaller models.

Table A1 compares the performance of the language models without applying any dimension reduction of the embeddings and the performance of the best transformer models is also compared with smaller models (and distil version) after reducing second to last lasyer representation to 128 dimensions in table A2.

Figure A2: Performance recorded for reduced dimensions for all tasks at higher values (). Reduction continues to help in performing the best in personality and mental-health tasks. The ’fkp’ is observed to be shifting to a higher value, due to the rise in performance of no reduction and the reduction of standard error.
Figure A3: Comparison of performance in gen, gen2 and sui tasks for varying between 50 and 1000.
demographics personality mental health
100 BERT 0.533 0.703 0.761 0.163 0.184 0.424 0.360
RoBERTa 0.589 0.712 0.761 0.123 0.203 0.455 0.363
DistilRoBERTa 0.568 0.640 0.731 0.130 0.207 0.446 0.355
ALBERT 0.525 0.689 0.710 0.111 0.218 0.413 0.355
500 BERT 0.686 0.810 0.837 0.278 0.354 0.484 0.466
RoBERTa 0.700 0.802 0.852 0.283 0.361 0.490 0.432
DistilRoBERTa 0.687 0.796 0.826 0.246 0.346 0.503 0.410
ALBERT 0.668 0.792 0.799 0.237 0.337 0.453 0.385
Table A2: Comparison of the best performing auto-encoder models with a smaller LMs (like ALBERT (Lan et al., 2019) and DistilRoBERTa Sanh et al. (2019) after reduction to 128 dimensions. These results suggest that the reduction of the larger counterparts produce better results than reducing these smaller LMs’ representations.

d.4 Least dimensions required: Higher

The ’fkp’ plateaus as the the number of training samples grow as seen in table A3.

demographics personality
2000 768 90 64
5000 768 181 64
10000 768 181 64
Table A3: First k to peak for each set of tasks: the least value of k that performed statistically equivalent () to the best performing setup (peak). Integer shown is the exponential median of the set of tasks.
Figure A4: The absolute error in age prediction for the fine-tuned model is higher than pre-trained models for users with short messages. Fine-tuned models have smaller errors for users with longer messages.

Appendix E Additional Analysis

e.1 Pre-trained vs Fine-Tuned models

We also find that fine-tuned model doesn’t perform better than the pre-trained model for users with typical message lengths, but is better at handling longer sequences upon training it on the tasks’ data. This is evident from the graphs in figure A4.

e.2 PCA vs NMF.

From figure A5, we can see that LIWC variables like ARTICLE, INSIGHT, PERCEPT (perceptual process), COGPROC (cognitive process) negatively correlates to the difference in absolute error of PCA and NMF. These variables also happen to have higher correlation with the openness scores (Schwartz et al., 2013). We also see that characteristics typical of an open person like interest in arts, music, and writing (Kern et al., 2014) appear in the word clouds.

Figure A5: The word cloud of the LIWC variables (left) and the 1 grams (right) having negative correlation with the difference in the absolute error of PCA and NMF in Openness prediction. Benjamini-Hochberg FDR. . We can see that LIWC variables and 1 grams more correlative of a person exhibiting more openness are better captured by the PCA model than the NMF.

The divergence of the absolute errors in NMF and PCA is seen in bsag and ope tasks as well. From graphs in figure A6 we can see that the sequence length at which we see this behavior is close to the previously observed value in age and ext tasks.

Figure A6: Comparison of the absolute error of NMF and PCA with the average number of 1 grams per message. We see that the absolute error of NMF models starts diverging at longer text sequences for the bsag and the ope tasks as well.
Figure A7: Terms having negative (left) and positive (right) correlations with the difference in the absolute error of the PCA and pre-trained model in age prediction. Benjamini-Hochberg FDR. . The error in the PCA model is lesser than pre-trained models when messages contain more slang, affect words and social media abbreviations.

e.3 PCA vs Pre-trained.

PCA models overall perform better than pre-trained model in low sample regime and from figure A7, we can see that PCA captures slang, affect and standard social media abbreviations better than the pre-trained models. The task specific linear layer is better able to capture social media language with fewer dimensions (reduced by PCA) than from the original 768 features produced by the pre-trained models.