1.1 Data description
A total of 339 participants, of which were 170 patients with a schizophrenia spectrum disorder, 22 diagnosed with depression and 147 healthy controls, were interviewed by a research group of the University Medical Center Utrecht. The interview questions were designed to elicit semi-free speech about general experiences. The interviewers were trained to avoid health related topics in order to make produced language by the participants more generalisable irrespective of diagnosis or absence thereof. The raw, digitally recorded audio from the interview was normalized to an average sound pressure level of 60db. The openSMILE audio processing framework[eyben_opensmile_2010] [eyben2015geneva] was used to extract 94 speech parameters for each audio file a list of which can be found in table 7.1. A subset of each audio file was manually transcribed according to the CHAT [macwhinney_wagner_2010] transcription format by trained transcribers.
1.2 Aim of thesis
Currently, the state of the art for classification of psychiatric illness is based on audio-based classification. This thesis aims to design and evaluate a state of the art text classification network on this challenge. The hypothesis is that a well designed text-based approach poses a strong competition against the state-of-the-art audio based approaches. Dutch natural language models are being limited by the scarcity of pre-trained monolingual NLP models, as a result Dutch natural language models have a low capture of long range semantic dependencies over sentences. For this issue, this thesis presents belabBERT, a new Dutch language model extending the RoBERTa[liu2019roberta] architecture. belabBERT is trained on a large Dutch corpus (+32GB) of web crawled texts. After this thesis evaluates the strength of text-based classification, a brief exploration is done, extending the framework to a hybrid text- and audio-based classification. The goal of this hybrid framework is to show the principle of hybridisation with a very basic audio-classification network. The overall goal is to create the foundations for a hybrid psychiatric illness classification, by proving that the new text-based classification is already a strong stand-alone solution.
2.1 Text analysis
In the field of text analysis there is a huge variety of approaches ranging from finding characterizing patterns in the syntactical representation of text by tagging parts-of-speech, to representing words as mathematical objects which together form a semantic space, with the latter approach having a rapid rise in various linguistic problems. In a meta-analysis of eighteen studies in which semantic space models are used in psychiatry and neurology [de_boer_clinical_2018] draw the conclusion that analyzing full sentences is more effective than analyzing single words. The best performing models used word2vec [mikolov_efficient_2013] which make use of word embeddings to represent sequences of words and can be used to analyse text. However, word2vec lacks the ability to analyze full sentences or longer range dependencies.
Current NLP research is being dominated by the use of bidirectional transformer models such as BERT [devlin_bert_2019]. Transformer models use word embeddings as input similar to word2vec; however the models can handle longer input sequences and the relations within these sequences. This ability, combined with the attention mechanism described in the famous "attention is all you need" paper [vaswani2017attention] enables BERT to find long range dependencies in text leading to more robust language models. All top 10 submissions for the GLUE benchmark [wang2018glue] make use of BERT models, thus it would be intuitive to conclude it would be interesting to use a BERT model as text analysis model for our task. Figure 2.1
shows a BERT architecture for sentence classification. The original BERT model was pre-trained on a large quantity of multilingual data. However, since the open sourcing of the BERT architecture by Google, a multitude new models have been made available including monolingual models constructed for tasks in specific languages.[martin2019camembert][Kuratov2019AdaptationOD][virtanen2019multilingual][Antoun2020AraBERTTM] A comparison of monolingual BERT model performance and multilingual BERT model performance [nozza_what_2020] on various tasks showed that monolingual BERT models outperform multilingual models on every task table 2.1 shows a short summary of their evaluation as performed by Nozza et al.
|Task||Metric||Avg. Monolingual BERT||Avg. Multilingual BERT||Diff|
|Sentiment Analysis||Accuracy||90.17 %||83.80 %||6.37 %|
|Text Classification||Accuracy||88.96 %||85.22 %||3.75 %|
for the Dutch language the top performing models are RobBERT [delobelle_robbert_2020]
which is a BERT model using a different set of hyperparameters as described by Yinhan Liu, et al.[liu2019roberta] This model architecture is dubbed RoBERTa. The other model BERTje [de_vries_bertje_2019] is more traditional in the sense that the pretraining hyperparameters follow the parameters as described in the original BERT publication. Table 2.2 provides a short overview of these models
|Model name||Pretrain corpus||Tokenizer type||Acc Sentiment analysis|
|belabBERT||Common Crawl Dutch (non-shuffled)||BytePairEncoding||95.92 %|
|RobBERT||Common Crawl Dutch (shuffled)||BytePairEncoding||94.42 %|
|BERTje||Mixed (Books, Wikipedia, etc)||Wordpiece||93.00 %|
* to be verified
2.2 Audio classification
As highlighted in the introduction, the field of computational audio analysis is well established. Most researchers extract speech parameters from raw audio and base their classification on this. Speech parameters reflect important brain functions such as motor speed (articulation), emotional status (prosody), cognitive functioning (correct use of grammar, vocabulary scope) and social behavior (timbre matching), Pause length, and percentage of pauses were found to be highly correlated with psychotic symptoms [cohen_psychiatric_2013]. Marmar et al. identified several Mel-frequency cepstral coefficients (MFCC) which are highly indicative for depression [marmar_speech-based_2019]. The features described in these papers can be quantitatively extracted from speech samples. We assume these features to also be indicative for our classification task as both groups are included.
3.1 Data preprocessing
Of the 339 interviews, 141 were transcribed, of which were 76 psychotic, 6 depressive and 59 healthy participants. Transcripts were transformed from the CHAT format to flat text. Since we are dealing with privacy-sensitive information we took measures to mitigate any risk of leaking sensitive info. For audio we only perform analysis on parameters that were derived from the raw audio, not including any content. For the transcripts we swapped all transcripts with their tokenized versions and only performed calculations on these. In order to create more examples, full tokenized transcripts were chunked into a length of 220 tokens per chunk and 505 tokens per chunk resulting in two transcript datasets per tokenizer table 3.1 shows the amount of samples after chunking. The acquired datasets were split into 80% training set, 10 % validation and 10 % test set keeping the ratios among participants of the original dataset.
|Dataset ID||Chunk size||Psychotic||Control||Depressive||Total|
3.2 Text classification
We hypothesize that a language model which is pretrained on data that resembles the data of its fine tuning task (text classification of transcripts in our case) will perform better then general models. Our dataset consists interview transcripts thus conversational data. The problem is that RobBERT was pretrained on a shuffled version of the the OSCAR Web crawl corpus. This limits the range over which RobBERT can find relations between words, RobBERT also uses the RoBERTa base tokenizer which is a tokenizer trained on a English corpus, we assumed this affects the performance of RobBERT negatively on downstream tasks. since the previously referenced meta-analysis [de_boer_clinical_2018] recommends future research looks at models which are able to analyze larger group of words, sentences to be specific. We decided to train a RoBERTa based Dutch language model from scratch on the non-shuffled OSCAR corpus [ortiz-suarez-etal-2020-monolingual] which consists of a set of monolingual corpora extracted from Common Crawl snapshots. We also trained a byte pair encoding tokenizer on the same corpus to create the word embeddings which belabBERT uses as input, alleviating potential problems in RobBERT both regarding tokenizer as well as long-term dependencies. We use the original RoBERTa training parameters
3.2.2 Fine tuning
In order to fine tune belabBERT and RobBERT for the classification of text input we implemented the classifier head as described in the BERT paper a visualization can be found in figure 3.1
the output layer consists of 3 output neurons. In order to find the optimal hyperparameter set we performed several runs with different sets of configurations. In the results chapter we will go more in depth about the specifics of this process.
3.3 Audio analysis
Related work in audio analysis for diagnostic purposes found that impressive results can be achieved using speech parameters only. Our dataset provides us of a pre-processed set of speech parameters for every audio interview. These are extracted using openSMILE and the eGeMAPS package [eyben_opensmile_2010]. Using this set of features, we use a simple neural network architecture consisting of three layers of which the specifics can be seen in figure 3.2
. The majority of research in this field focuses on more traditional machine learning techniques such as logistic regression or support vector machine. However, these are less resistant to noise in the data and thus require feature engineering before processing the parameters. A notable weakness of feature engineering is that information is lost, as it is difficult for traditional machine learning techniques to cope with noise that irrelevant features provide. Using a neural network enables us to use all audio extracted speech parameters as input and automatically learn which features are relevant for each classification
3.4 Hybrid model
We developed a hybrid model making use of both modalities (text and audio) and compared its performance to the single models. We assume this model improves the accuracy of the classification since audio characteristics are not embedded in text data; e.g. variations in pitch can be highly indicative for depression [marmar_speech-based_2019] however this is parameter is not present in text data. Similarly, coherence of grammar and semantic dependencies are indicative of the mental state of a person but is not found in the audio signal. There are multiple ways and techniques to combine models. As this thesis aims to present an initial proof of concept for hybridisation we stick to a simple "late fusion" architecture with a fully-connected layer to map the output of both models into 3 outputs. After training both models separately weights will be frozen and output layers of the separate models will be used to generate inputs for the hybrid model. Figure 3.3 shows an overview of this combined model.
4.1 Experimental setup
All experiments were run on a high performance computing cluster. The language model belabBERT was trained on 16 Nvidia Titan RTX GPUs (24GB each) for a total of 60 hours. All other tasks were run on a single node containing 4 GPUs of the same specifications.
4.1.1 Pretraining corpus
For the pretraining of belabBERT we used the OSCAR corpus [ortiz-suarez-etal-2020-monolingual] which consists of a set of monolingual corpora extracted from Common Crawl snapshots. For this thesis a non-shuffled version was made available for the Dutch corpus, which consists of 41GB raw text. This is in contrast with the corpus used for RobBERT, which uses the shuffled and pre-cleaned version. By using a non-shuffled version the sentence order of the corpus is preserved. This property hopefully enables belabBERT to learn long range syntactic dependencies. On top of that, we perform a sequence of common preprocessing steps in order to better match the source of our interview transcript data. These preprocessing steps included, fuzzy deduplication (i.e remove lines with a +90% overlap with other lines), removing non textual data such as "https://" and excluding lines longer than 2000 words. this resulted in a total amount of 32GB clean text of which 10% was held-out as validation set to accurately measure overfitting.
The language model belabBERT was created using the Hugging Face’s transformer library[Wolf2019HuggingFacesTS], a Python library which provides a lot of boilerplate code for building BERT models. belabBERT uses a RoBERTa architecture [liu2019roberta], unless otherwise specified all parameters for the training of this model are kept default. The model and used code is publicly available under an MIT open-source license on GitHub
All other models used in this thesis (text classifier, audio classifier and hybrid classifier) are developed in Python using the PyTorch Lightning[falcon2019pytorch] framework. Hyperparameter optimization was performed using the Weights & Biases Sweeps system [wandb]. This process involves generating a large set of configuration parameters based on pre-defined default parameter values and training the model accordingly, we picked the model with the lowest cross-entropy loss on the held-out validation set assuming this model is best generalisable.
4.2 Training configurations
. To measure the effect of chunk sizes we ran two separate analyses for each base model (belabBERT and RobBERT), with a varying chunk size of 220 and 505 tested for each model. A dutch BPE tokenizer is used for belabBERT to create its word embeddings which makes it an efficient tokenizer for our dataset when compared to the Multi lingual tokenizer used for RoBERTa. As a consequence, belabBERT produces less tokens for a Dutch text than RobBERT which explains the skewed sizes of training samples. Our default hyperparameters follow the GLUE fine tuning parameters used in the original RoBERTa paper[liu2019roberta]. Subsection 4.2.3
shows the training configuration which was used for the hybrid model, this involves two neural networks which were trained separately, in which the first described model takes audio features as input, the second is the fusion layer which bases its output classification on 6 tensorized input values. In order to find the optimal set of hyperparameters we train each model 15 times. We show the parameter set for the described model that reached the lowest cross-entropy validation loss. The results are presented in chapter5.
We train belabBERT in the two different chunk sizes, 505 and 220. We expect belabBERT to outperform RobBERT due to the nature of its pretraining corpus and custom Dutch tokenizer.
chunk size 505
|Set||Psychotic||Depressed||Healthy||% Of total|
|Peak learning rate|
chunk size 220
|Set||Psychotic||Control||Depressed||% Of total|
|Peak learning rate|
In order to evaluate the performance of belabBERT we evaluate it against the performance of the current Dutch state-of-the-art model RobBERT. The results of these experiments will help us to better contextualize the achieved results of belabBERT.
chunk size 505
|Set||Psychotic||Control||Depressed||% of total|
|Peak learning rate|
chunk size 220
|Set||Psychotic||Control||Depressed||% Of total|
|Peak learning rate|
4.2.3 Extending to a hybrid model
The hybrid model consists of a separately trained audio classification network. In order to maximize the size of available training samples for the fusion we trained the audio classifier on samples of which no transcript was available. The held-out test set of our audio classifier consists of all samples of which a transcript did exist, this makes sure there is no overlap between the training data of the audio classifier and the text classifier.
The audio classification network uses categorical cross-entropy loss and Adam optimization[kingma2014adam] with and , due to the inherent noisy nature of an audio signal and its extracted features we use a default dropout rate of . The learning rate boundaries were found by performing a initial training run in during which, the learning rate linearly increases for each epoch as described by L. Smith [smith2017cyclical]. We picked the median learning rate of these bounds as our default learning rate
|Set||Psychotic||Control||Depressed||% Of total|
We trained the hybrid classification on the dataset of our best performing text classification network, its important to remember that due to the chunking of this dataset we have multiple samples stemming from a single patient which is discussed in chapter 5, this explains the difference in total amount of samples between the audio classification and hybrid classification. The train/validate/test dataset used for the hybrid classifier is shown in Table 4.3
5.1 belabBERT and RobBERT
Table 5.1 shows that both experiments with belabBERT as its base model manages to outperform the current Dutch state-of-the-art RobBERT with the top performing model using a chunk size of 220 achieving a classification accuracy of 75.68% on the test set and 71.18% on validation set. The top performing model with RobBERT as base also uses a chunk size of 220 and reaches a 69.06% classification accuracy on the test set and 69.64% on the validation set.
|Experiment||Validation accuracy||Test accuracy|
The results shown in 5.1 confirm our initial hypothesis, belabBERT does indeed benefit from its ability to capture long range semantic dependencies. Both on the 505 chunk size, as well as the 220 chunk size experiments belabBERT manages to outperform the current state-of-the-art language model RobBERT. belabBERT 220 has a limited recall for the depression label but its precision is higher than expected.
5.2 Extending to a hybrid model
In this section we present the audio classification results and the results which is part of the extension towards the hybrid classification network which uses the best performing text classification network.
Table 5.3 shows the audio classification network reached a classification accuracy of 65.96 % on the test set and 80.05% accuracy on the validation set, due to the small size of this set we should not consider this result as significant, we also observe in 5.2 that the network was not able to distinguish samples with the depressed label from the other labels based on its inputs.
|Validation accuracy||Test accuracy|
* validation set size was very small
Table 5.5 shows the classification accuracies for the hybrid classification network, it reaches an accuracy of 77.70% on the test set and a 70.47% accuracy on the validation set.
|Validation accuracy||Test accuracy|
From our observations of the audio classification network we can conclude that it does not perform that well for the classification of all labels, it does however perform relatively well on the healthy category. The extension towards the hybrid model where we base our classification on both text and audio does however result in an improved classification accuracy.
From the results in table 5.1 we can conclude that our self trained model belabBERT reaches a higher classification accuracy on the test-set than the best performing RobBERT model. Furthermore, we observe that a smaller chunk size of 220 tokens leads to a significant accuracy gain for both base models. The small difference between the validation and test set accuracies shown in table 5.1 are a positive indicator that the classification accuracy is significant and representative for the capability of the model to categorize the given text samples. From the difference in classification accuracy between belabBERT and RobBERT we conclude that a BERT model using a specialized Dutch tokenizer and pretrain corpus which resembles on conversational data provides significant benefits on downstream classification tasks. On top of that, we conclude that using a smaller chunk size has a positive effect on the classification accuracy. Our brief exploration into the hybridisation of belabBERT with a very basic audio-classification network has pushed its test set accuracy of 75.68% to a 77.0% accuracy. From our observations of the classification metrics shown in table 5.6 we showed that the addition of an audio classification network next to the strong stand-alone text classification model leads to an overall better precision for all labels on top of the higher classification accuracy. However, the lack of ’depressed’ samples in our dataset hinders us from making definitive conclusions about relevance of our findings in this category.
In this thesis, we presented a strong text classification model which challenges the current state of the art audio classification networks used for the classification of psychiatric illness. We introduced a new model belabBERT and showed that this language model which is trained on capturing long range semantic dependencies over sentences in a Dutch corpus outperforms the current state-of-the-art RobBERT model as seen in table 5.1. We hypothesized that we could increase the size of our dataset by splitting the samples up into chunks of a fixed length without losing classification accuracy, our results in table 5.1 support this approach. On top of that we explored the possibilities for a hybrid network which uses both text and audio data as input for the classification of patients as psychotic, depressed or "healthy". Our results in section 5.2.1 indicate this approach is able to improve the accuracy and precision of a stand alone text classification network. Based on these observations we can confirm our main hypothesis that a well designed text-based approach poses a strong competition against the state-of-the- art audio based approaches for the classification of psychiatric illness
6.2 Future work
This section discusses future work on enhancing belabBERT, enhancing the text-based classification of psychiatric illness, possible extensions for the proposed hybrid framework, interpretation and rationalisation of the text classification network. Compared to BERT models of the same size as belabBERT, it seems that belabBERT is actually still under-trained, the version used during this thesis has only seen 60% of the training data. Training belabBERT even more could possibly increase its performance on all tasks. In our text classification we already applied a chunking technique in order to generate more examples from a single interview sample. However, we observed that prediction accuracy increased when we decreased the chunk size. This leads to the question to explore how the use of even smaller chunk sizes affect the prediction accuracy. When smaller chunk sizes can be used, the amount of training examples is increased, making the model more robust. While the explored hybrid model we present in this thesis uses pre-extracted audio parameters as input for a neural network it would be interesting to apply new audio analysis techniques. It would be interesting to use raw audio as input for a neural network. The approach would be similar to speech recognition architectures [xiong2016achieving]; a major advantage would be that these architectures can find patterns over time, which makes it possible to discover new relations between input features. The hybrid model could also use other data sources to generate a classification such as video which would possibly increase classification accuracy even more The interpretation and rationalisation of the predictions of neural networks is key for providing clinical relevancy not only in the practical domain of psychiatry but also for the theoretic understanding of the disorder and symptoms. Transformer models like BERT are easily visualisable [coenen2019visualizing], an extensive interpretation toolkit could provide researchers better tools to discover new patterns in language that are highly indicative for a certain classification prediction, in turn leading to greater understanding of the disorders.