Semantic Characteristics of Schizophrenic Speech

04/16/2019 ∙ by Kfir Bar, et al. ∙ Tel Aviv University Collage of Management 0

Natural language processing tools are used to automatically detect disturbances in transcribed speech of schizophrenia inpatients who speak Hebrew. We measure topic mutation over time and show that controls maintain more cohesive speech than inpatients. We also examine differences in how inpatients and controls use adjectives and adverbs to describe content words and show that the ones used by controls are more common than the those of inpatients. We provide experimental results and show their potential for automatically detecting schizophrenia in patients by means only of their speech patterns.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Thought disorders are described as disturbances in the normal way of thinking. Bleuler Bleuler (1991) original considered thought disorders to be a speech impairment in schizophrenia patients, but nowadays there is agreement that thought disorders are also relevant to other clinical disorders, including pediatric neurobehavioral disorders like attention deficit hyperactivity disorder and high functioning autism. They can even occur in normal populations, especially in people who have a high level of creativity. Bleuler focused mostly on “loosening of associations”, or derailment, a thought disorder characterized by the usage of unrelated concepts in a conversation, or in other words, a conversation lacking coherence. The Diagnostic and Statistical Manual of Mental Disorders (DSM 5) Association (2013) outlines disorganized speech as one of the criteria for making a diagnosis of schizophrenia. Morice and Ingram Morice and Ingram (1982) showed that schizophrenics’ speech is built upon a different syntactic structure than normal controls, and that this difference increases over time. Andreasen Andreasen (1979) suggested several definitions of linguistic and cognitive behaviors frequently observed in patients, and which may be useful for thought-disorder evaluation. Among the definitions presented in that report, one finds the following, which we address in this study:

Incoherence, also known as “word salad”, refers to speech that is incomprehensible at times due to multiple grammatical and semantic inaccuracies. In this paper, we focus mostly on the semantic inaccuracies, leaving grammatical issues for future investigation.

Derailment, also known as “loose associations”, happens when a speaker shifts among topics that are only remotely related, or are completely unrelated, to the previous ones.

Tangentiality occurs when an irrelevant, or just barely relevant, answer is provided for a given question.

We focus here on derailment. But tangentiality has been addressed in some other studies. The two notions are closely related.

One of the main data sources for diagnosing mental disorders is speech, typically collected during a psychiatric interview. Identifying signals that indicate the presence of thought disorders is often challenging and subjective, especially in patients who are not undergoing a psychotic episode at the time of the interview.

In this work, we focus on schizophrenia. We investigate a number of semantic characteristics of transcribed human speech, and propose a way to use them to measure disorganized speech. Natural-language processing software is used to automatically detect those characteristics, and we suggest a way of aggregating them in a meaningful way. We use transcribed interviews, collected from Hebrew-speaking schizophrenia inpatients at a mental health hospital and from a control group. About two thirds of the patients were identified as in schizophrenia remission at the time of the interview.

Following a few previous works (Iter et al., 2018; Bedi et al., 2015), we measure Andreasen’s derailment by calculating average semantic similarity between consecutive chunks of a running text to track topical mutations, and show the difference between patients and controls. For incoherence, we look at word modifiers, focusing on adjectives and adverbs, that subjects use to describe the same objects, and then learn the difference between the two groups. As a final step, we use those semantic characteristics in a classification setting and argue for their usability.

This work makes the following contributions:

  • We measure derailment in speech using word semantics, similar to (Bedi et al., 2015), this time on Hebrew.

  • We explore a novel way of measuring one aspect of speech incoherence, by measuring how similar modifiers (adjectives and adverbs) are to ones used in a reference text to describe the same words.

  • Using these measures, we build a classifier for detecting schizophrenia on the basis of recorded interviews, which achieves 81.5% accuracy.

We proceed as follows: The next section reviews some relevant previous work. In Section 3, we describe how we collected the data. Our main contributions are described in Section 4, followed by some conclusions suggested in the final section.

2 Related Work

There is a large body of work that examines human-generated texts with the aim of learning about the way people who suffer from various mental-health disorders use language in different settings. For example, Al-Mosaiwi and Johnstone Al-Mosaiwi and Johnstone (2018) conducted a study in which they analyzed 63 web forums, some related to mental health disorders and others used as control. They ran their analysis with the well-known Linguistic Inquiry and Word Count Pennebaker et al. (2015) tool to find absolutist words in free text. Overall, they discovered that anxiety, depression, and suicidal-ideation forums contained more absolutist words than control forums.

Recently, social media have become a vital source for learning about how people who suffer from mental-health disorders use language. Several studies collect relevant users from Twitter,111 by considering users who intentionally write about their diagnosed mental-health disorders. For example, in (De Choudhury et al., 2013; Tsugawa et al., 2015), some language characteristics of Twitter users who claim to suffer from a clinical depression are studied. Similarly, users who suffer from post traumatic stress disorder are addressed in (Coppersmith et al., 2014). Mitchell et al. Mitchell et al. (2015) analyze tweets posted by schizophrenics, and Coppersmith et al. Coppersmith et al. (2016) investigate the language and emotions that are expressed by users who have previously attempted to commit suicide. Coppersmith et al. Coppersmith et al. (2015) work with users who suffer from a broad range of mental-health conditions and explore language differences between groups. Most of these works found a significant difference in the usage of some linguistic characteristics by the experience group when compared to a control group. Furthermore, different levels of these linguistic characteristics are used as features for training a classifier to detect mental-health disorders prior to the report date.

Reddit222 has also been identified as a convenient source for collecting data for this goal. Losada and Crestani Losada and Crestani (2016) outline a methodology for collecting posts and comments of Reddit and Twitter users who suffer from depression. Similarly, a large dataset of Reddit users with depression, manually verified (by lay annotators for an explicit claim of diagnosis), has been released for public use Yates et al. (2017)

. In that work, the authors employ a deep neural network on the raw text for detecting clinically depressed people ahead of time, achieving 65% F1 score on an evaluation set.

A few caveats are in order when using social media for analyzing mental health conditions. First, self reporting of a mental health disorder is not a popular course of action. Clearly, then, the experimental group is chosen from a subgroup of the relevant population. Second, the controls, typically collected randomly “from the wild”, are not guaranteed to be free of mental-health disorders. Finally, social media posts are considered to be a different form of communication than ordinary speech. For all these reasons, in this work, we use validated experimental and control groups in an interview setting.

Measuring various aspects of incoherence in schizophrenics using computational tools has been previously addressed in (Elvevåg et al., 2007; Bedi et al., 2015; Iter et al., 2018). Elvevåg et al. Elvevåg et al. (2007)

analyzed transcribed interviews of inpatients with schizophrenia to measure tangentiality. Moving along the patient’s response, they calculated the semantic similarity between text chunks of different sizes and the question that was asked by the interviewer. Semantic similarity was cast by cosine similarity over the latent semantic analysis (LSA)

(Deerwester et al., 1990)vectors calculated for each word, and summed across an entire chunk of words. They fitted a linear-regression line to represent the trend of the cosine similarity values, as one moves along the text. The slope of that line was used to measure how quickly the topic diverges from the original question. Overall, they were able to show a significant correlation between those values and a blind human evaluation of the same responses. Furthermore, as chunk size grows larger, the distinction between patients and controls becomes less prominent. One explanation for that could be the large number of mentions of functional and filler words, for which we typically do not have a good semantic representation. Iter et al. Iter et al. (2018) addressed this suggestion by cleaning the patients’ responses of all those words and expressions (e.g. uh, um, you know) prior to calculating the semantic scores. This gave a slight improvement, although measured over a relatively small set of participants. Instead of working with chunks of text, they worked with full sentences, and replaced LSA with some modern techniques for sentence embeddings. Likewise, in our work, we use word embeddings instead of LSA.

Bedi et al. Bedi et al. (2015) define coherence as an aggregation of the cosine similarity between pairs of consecutive sentences, each represented by the element-wise average vector of the individual words’ LSA vectors. They worked with a group of 34 youths at clinical high-risk for psychosis, interviewed them quarterly for 2 1/2 years, and transcribed their answers. Five out of the 34 transitioned to psychosis. They used coherence scores, along with part-of-speech information, to automatically predict transition to psychosis with 100% accuracy.

The goal of all these works, including ours, is to automatically detect disorganized speech in a more objective and reliable way. Inspired by the last three studies described above, we analyzed transcribed responses to 18 open questions given by inpatients with schizophrenia and by controls. Instead of cleaning the text from filler words using a dictionary – as proposed by Iter et al. (2018), we take a deeper look into the syntactic roles the words play, and calculate semantic similarity over a filtered version of the text, every time using different sets of part-of-speech categories. We report on the results of two sets of experiments: (1) We measure derailment by calculating the semantic similarity of adjacent words of various part-of-speech categories. (2) We measure semantic coherence by looking at the choices of modifiers (adjectives, adverbs) used in responses by inpatients and controls, as compared to those used in ordinary discourse.

Generally speaking, not too much is known about the role played by adjectives and adverbs in thought disorders. Modifiers are often not included in language tests, as they usually need to be presented together with the noun or verb they modify. Some previous works (Obrębska and Obrębski, 2007) have reported a significantly smaller number of adjectives used by schizophrenics. In the current study, we use computational tools to investigate the semantic relation between modifiers and objects, and its attribution to speech incoherence.

3 Data Collection

We interviewed 51 men, aged 19–63, divided into control and patient groups, all speaking Hebrew as their mother tongue. The patient group comprised 24 inpatients at Beer Yaakov Mental Health Center in Israel who were officially diagnosed with schizophrenia. The control group includes 27 people, mainly recruited via an advertisement that we placed on social media. Most of the participants are single, with average-to-lower monthly income. Demographics for the two groups are presented in Table 1.

Control Patients
27 24
Age, Mean (SD) 30.3 (8.26) 38.3 (10.43)
Edu., HS 68% 75%
Edu., Post HS 20%  4%
Loc., South 40% 20%
Loc., Center 44% 33%
M.S., Single 80% 95%
Income, Avg/low 84% 83%
Table 1: Demographics by group. Edu. = Education (HS = High School); Loc. = Location in Israel; M.S. = Marital Status.

Ethics statement: The institutional review board of the College of Management Academic Studies of Rishon LeZion, as well as of the Beer Yaakov–Ness Ziona Mental Health Center, approved these experiments, and informed consent was obtained for all subjects.

3.1 Interviews

Overall, the participants were asked 18 questions, out of which 14 were thematic-apperception-test (TAT) pictures that participants were requested to describe, followed by 4 questions that require the participant to share some personal thoughts and emotions. Both the control and patient groups completed a demographic questionnaire. To monitor the mental-health condition of the control group, they were requested to complete Beck’s Depression Inventory-II (BDI-II) and the State and Trait Anxiety Inventory (STAI). The patient group also completed BDI-II, as well as a Hebrew translation Katz et al. (2012) of the Positive and Negative Syndrome Scale–6 (PANSS-6, a shorter version of PANSS-30) questionnaire, in order to assess symptoms of psychosis Østergaard et al. (2016). Scores for the two questionnaires were found to be highly correlated. Out of the patient group, 66.7% were assigned a score below 14, a recommended preliminary threshold indicating schizophrenia remission.

The interviews were recorded and then manually transcribed by Hebrew-speaking students from our lab. The TAT pictures presented to participants during the interview were: 1, 2, 3BM, 4, 5, 6BM, 7GF, 8BM, 9BM, 12M, 13MF, 13B, 14, 3GF. Table 2 lists the questions that were presented to the participants during the interview. All the transcripts are written in Hebrew. Figure 1 shows average word counts by question, per group. Clearly, the patients spoke fewer words than the controls. The difference becomes less significant for the open-ended questions.

ID Question
1 Tell me as much as you can about your bar mitzvah.
2 What do you like to do, mostly?
3 What are the things that annoy you the most?
4 What would you like to do in the future?
Table 2: Four open questions asked during the interview.
Figure 1: Word counts per question.

3.2 Preprocessing

Hebrew being a highly-inflected language, we preprocessed the texts with the Ben-Gurion University Morphological Tagger Adler (2007), a context-sensitive morphological analyzer for Modern Hebrew. Given a running text, the tagger breaks the text into words and provides morphological information for every word, including the disambiguated part-of-speech tag and lemma. There were no specific instructions given to the transcribers for how to punctuate, which led to an inconsistency in the way punctuation was used in the transcriptions. We used the tags to clean up all punctuation marks by removing all tokens tagged as such.

4 Tools and Method

We report on two sets of experiments. In the first, we measure derailment by calculating the semantic similarity between adjacent words in running text. In the second set of experiments, we investigate the modifiers that the two groups use to describe specific nouns and verbs. As a final step, we measure the contribution of the semantic characteristics that we compute in the experiments, for automatic classification of schizophrenia.

4.1 Experiment 1: Measuring Derailment

We calculate a derailment score for each response and use it to measure derailment.

Tools: To measure derailment, we calculate the semantic similarity of adjacent words in the answers provided by the participants during the interview. We use word embeddings to represent each word by means of a mathematical vector that captures its meaning. These vectors were created automatically by characterizing words by the surrounding contexts in which they are mentioned in a large corpus of documents. Specifically, we used Hebrew pretrained vectors provided by fastText Grave et al. (2018), which were created from Wikipedia,333 as well as from other content extracted from the web with Common Crawl.444 Overall, 97% of the words in our corpus exist in fastText. Hebrew words are inflected for person, number and gender; prefixes and suffixes are added to indicate definiteness, conjunction, prepositions, and possessive forms. On the other hand, fastText was trained for surface forms. Therefore, we work on the surface-form level. To measure semantic similarity between two words, we use the common cosine-similarity function that calculates the cosine of the angle between the two corresponding vectors. The score ranges from to , with representing maximal similarity.

Method: (1) For each sufficiently long response, , we retrieve the fastText vector for every word , , in the response. (2) For each word, we calculate the average pairwise cosine similarity between this word and the following words. The integer is a parameter; we experimented with different values. (3) We take the average of all the individual cosine similarity scores and form a single score for each response.

In this experiment, we consider only responses that are long enough to allow topic mutation to develop. Therefore, we use only the four questions from Table 2 for which the participants provided a relatively long response. Accordingly, we drop responses of fewer than 50 words. As mentioned above, we consider that the existence of some word types, like fillers and functional words, might introduce some noise, which might harm the calculation process. We would rather focus on words that convey real content. Therefore, we calculate scores separately using all words and using only content words, which we take to be nouns, verbs, adjectives, and adverbs. We detected a few types of text repetitions, which may bias the derailment score. One type is when a word is said twice or more for emphasis; for example, “quickly, quickly” (מהר מהר) (i.e. very quickly). To mitigate this bias, we keep only one word out of a pair of consecutive identical words. Another type is when a whole phrase is repeated; for example, “She’s in a big hurry; she’s in a big hurry” (היא ממהרת מאוד, היא ממהרת מאוד). Handling this problem is left for future work.

We calculate derailment scores for the responses provided by all participants and compare the means of the two groups.

Results: When using all words, we could not detect a significant difference between patients and controls. However, when using content words only, patients scored lower on derailment than the controls, for all window widths , suggesting that focusing only on content words is the more robust approach for calculating derailment. This finding is consistent with previous work (Iter et al., 2018). Overall, coherence decreases as increases. Table 3 summarizes the results. To confirm the significance we are seeing in the results, is due to the diagnosis and not due to other characteristics of the participants, we aggregated the same scores for the different age groups and education levels, regardless of the diagnosis status; all these results did not appear to be significant. Figure 2 shows the trend of the average derailment score from Table 3, running with different values of . The left plot was produced for all word types, and the right plot using only content words. We clearly observe a slight increase of the entire control curve and a slight decrease of the patients curve, when restricting to content words.

All words Content words
Control Patients Control Patients
1 0.270
(0.014) 0.257
(0.025) 2.004* 0.265
(0.019) 0.240
(0.020) 2.968*
2 0.246
(0.017) 0.239
(0.025) 1.173 0.256
(0.018) 0.231
(0.025) 2.687*
3 0.237
(0.017) 0.233
(0.025) 0.476 0.250
(0.018) 0.225
(0.026) 2.614*
4 0.233
(0.018) 0.229
(0.025) 0.471 0.245
(0.018) 0.221
(0.026) 2.539*
5 0.230
(0.017) 0.226
(0.026) 0.528 0.241
(0.018) 0.218
(0.023) 2.598*
Table 3:

Results for Experiment 1. Comparing average derailment scores of patients and controls. The numbers are provided as average across patients and controls, with standard deviation in parentheses,

Figure 2: Derailment scores for different values of . The left plot shows the results for all word types, and the right plot shows the results for content words only.

4.2 Experiment 2: Incoherence

In this experiment, we examine the way patients use adjectives and adverbs (hereafter, modifiers) to describe specific nouns and verbs, respectively. Our goal is to measure the difference between modifiers used by patients and the ones used by controls, when describing the same nouns and verbs. We suggest this as a tool for measuring incoherence in speech. For example, inspecting the responses for the first TAT image, we learn that patients typically use the adjectives “new” (חדש) and “good” (טוב) to modify the noun “violin” (כינור), while controls use the adjectives “old” (ישן), “sad” (עצוב), and “significant” (משמעותי).

Tools: To detect all noun-adjective and verb-adverb pairs in the responses, we use a dependency parser, which analyzes the grammatical structure of a sentence and builds links between “head” words and their modifiers. Specifically, we use YAP More and Tsarfaty (2016), a dependency parser for Modern Hebrew, and process each sentence individually. Among other things, YAP provides a word-dependency list, shaped as a list of tuples, each includes a head word, a dependent word, and the kind of dependency. We use the relevant types (e.g. advmod, amod) for finding all noun-adjective and verb-adverb pairs. For example, Figure 3 shows the dependencies returned by YAP for the input sentence: “I ate a tasty candy” (אכלתי סוכריה טעימה). From this sentence we extract the noun “candy” (סוכריה), which is modified by the adjective “tasty” (טעימה).

[theme=default] & (I ate) & (a tasty) & (candy)

[->]24tobj [->]43amod

Figure 3: The dependencies returned by YAP for the sentence “(I ate) (a tasty) candy”. The parentheses delimit the translations for each of the three Hebrew words in the sentence.
Corpus Description # Documents # Words
Doctors555 Articles from the Doctors medical website 239 187,938
Infomed666 Question-and-answer discussions from the Infomed website’s medical forum, January 2006 – September 2007 749 128,090
To Be Healthy777 Articles and forum discussions from the To Be Healthy (L’Hiyot Bari, 2b-bari) medical website 137 112,839
HaAretz888 News and articles from the HaAretz news website, 1991 4,920 250,399
Table 4: The external Hebrew corpora used to collect modifiers of nouns and verbs that are typically used.

Method: To measure the difference between the modifiers that are used by patients and controls, we compare them to the modifiers that are commonly used to describe the same nouns and verbs. For example, given an answer with only one noun “violin” (כינור) that is modified by the adjective “sad” (עצוב), we calculate a score that reflects how similar the adjective “sad” is to adjectives that are typically used to describe a violin.

We take the following steps:

(1) We convert each sentence into a list of noun-adjective and verb-adverb pairs using YAP.

(2) To compare each modifier with the modifiers that are typically used to describe the same noun or verb, we use external corpora as reference. These were taken from various sources reflecting the health domain we are working in.999All were downloaded from MILA Knowledge Center for Processing Hebrew: Table 4 lists the sources and the corresponding number of documents and words that they contain. Each document in these sources was processed in exactly the same way to find all noun-adjective and verb-adverb pairs.

(3) Given a list of noun-adjective and verb-adverb pairs of one response, we calculate the similarity score of every modifier that describes a specific noun or verb with the set of modifiers describing exactly the same noun or verb in the reference corpus. Looking at our example above, we would want to calculate a similarity score between the adjective “old” (ישן) and all the adjectives that are used to describe “violin” (כינור) in the reference corpus. Searching for instances of the same Hebrew word is challenging due to Hebrew’s rich morphology. Hebrew words are inflected for person, number, and gender; prefixes and suffixes are added to indicate definiteness, conjunction, various prepositions, and possessive forms. Therefore, we work on the lemma (base-form) level. Most vowels in Hebrew are not indicated in standard writing; therefore, Hebrew words tend to be ambiguous, and determining the correct lemma for a word is nontrivial. We use the lemmas provided by YAP.

Another challenge is how to compare a single modifier with a group of modifiers that were taken from the reference corpus. We take the fastText vectors of the modifiers that were extracted from the reference corpus and aggregate them into a single vector. Then, we take cosine similarity between the modifier from the response and the aggregated vector of the modifiers from the reference corpus. As an aggregation function, we use element-wise weighted average of the individual modifiers’ fastText vectors, and define the weights to be the inverse-document-frequency (IDF) score to account more for modifiers that describe the noun or verb more uniquely. We calculate IDF scores using the reference corpora. For this purpose, a “qualified” word is a noun or verb that has an IDF score and that has at least one modifier linked to it in either the control or patient corpus. Most of the nouns and verbs are non-qualified; we only consider qualified words in this investigation.

(4) For each response, we calculate two scores, individually. The adjective-similarity score is the IDF-weighted average of the individual adjective scores we calculate in the previous step. Similarly, the adverb-similarity score is the IDF-weighted average of the individual adverb scores we calculate in the previous step.

(5) To calculate a score on the participant level, we average the scores of all the individual responses provided by the participant.

The output of this process is a pair of scores, one for adjectives and one for adverbs, calculated for each participant. The higher a score is, the more similar the modifiers are to ones that are typically used to describe the same noun or verb.

Results: Table 5 summarizes the results. Overall, controls have significantly higher scores for both modifier types, indicating a higher agreement on modifiers by the controls and external writers.

Control Patients
Adj 0.5891
(0.0301) 0.5498
(0.0284) 4.7765***
Adv 0.6880
(0.0251) 0.6254
(0.0709) 4.2961***
Table 5: Results for Experiment 2. The numbers are average coherence scores across patients and controls (with standard deviations); ***.

There are more nouns and adjectives than verbs and adverbs, as summarized in Table 6. On average, participants use more adjectives to describe nouns than adverbs to describe verbs. Controls use about 0.61 adjectives per noun, while patients use 0.84 adjectives on average. Similarly, patients use more adverbs to describe a verb on average than controls do. While patients use about 0.42 adverbs per verb, controls use only 0.23. However, these differences are not significant.

Control Patients
Total Qual. Total Qual.
Nouns 934 226 242 90
Adjectives 573 371 204 127
Verbs 699 60 204 34
Adverbs 166 104 86 50
Table 6: Experiment 2: Counts of nouns, verbs, and their modifiers, across the two groups. Qual. = Qualified.

4.3 Classification

As a final step, we train several classifiers to distinguish between controls and patients. We represent participants with the characteristics we compute in the two experiments. Specifically, each subject is represented by the following: (1) noun and verb derailment scores; (2) coherence scores for 5 windows, using all words; and (3) coherence scores for 5 windows, using only content words. In total, we use 12 scores per subject. Each classifier was trained using a 10-fold cross-validation evaluation of prediction quality over the 51 participants. For each classifier, we report on the overall prediction accuracy, as well as precision and recall for the prediction of the patients group. The classification algorithms we tried are Random Forest

(Breiman, 2001)

and XGBoost

(Chen and Guestrin, 2016)

, both based on decision trees, and, in addition, linear support vector machines (SVM)

(Cortes and Vapnik, 1995). Table 7 summarizes the results per classifier with respect to the different metrics.

Classifier Acc. Prec. Recall
Random Forest 81.5% 91.3% 71.8%
XGBoost 80.5% 86.8% 73.1%
SVM 70.4% 72.1% 47.3%
Table 7: Classification results for each classifier.

We used the decision-tree based classifiers to calculate the most important features, that is, the ones that have the greatest impact on prediction decisions. The most important features were found to be the two derailment scores, as expected.

5 Conclusions

With the aim of detecting speech disturbances, we have analyzed transcribed Hebrew speech, produced by schizophrenia inpatients and compared it with those of controls. We believe that speech produced during a psychiatric interview is a more reliable data source for detecting disturbances than are social media posts.

Generally speaking, we find that patients talk significantly less in interviews than controls do.

In one experiment, we use word embeddings to detect derailment, that is, when a speaker shifts to a topic that is not strongly related to previously discussed ones. The results show that controls have higher scores, indicating that they keep the topic more cohesive than patients do. These results are in line with previous studies on English (Bedi et al., 2015), which showed that schizophrenics have a lower score, calculated by a similar mathematical procedure.

In a second experiment, we examine the difference in how patients and controls use adjectives and adverbs to describe nouns and verbs, respectively. Our results show that the adjectives and adverbs that are used by the controls are more similar to the typical ones used to describe the same nouns and verbs. For now, we consider this difference as related to speech incoherence; however, we plan to continue investigating this direction in the near future, when more data become available.

Analyzing Hebrew is more challenging than analyzing English due to Hebrew’s rich morphology, as well as the absence of written vowels. In the first experiment, we work with fastText, which provides word embeddings on the surface-form level. In the second, we used lemmata rather than the word surface forms, so we can find multiple instances of the same lexeme.

As we did not measure the IQ of participants, some of the results might, to a certain extent, be attributable to differences in intellect. Moreover, as can be seen in Table 1, about 20% of the control participants have some sort of post high-school education, while most of the inpatients did not continue beyond high-school. We plan to address these questions in followup work. Another limitation that we are aware of is related to the classification results, as the number of participants we use for training the classifiers might be considered relatively small.

Overall, we found the semantic characteristics that we compute in this study to be beneficial for the task of detecting thought disorders in Hebrew speech. We plan to collect speech samples from more subjects, and to continue to explore additional semantic – as well as grammatical – textual characteristics to support the automatic detection of various mental disorders.


  • Adler (2007) Meni Adler. 2007. Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach. Ph.D. thesis, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
  • Al-Mosaiwi and Johnstone (2018) Mohammed Al-Mosaiwi and Tom Johnstone. 2018. In an absolute state: Elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation. Clinical Psychological Science, 6(4):529–542.
  • Andreasen (1979) Nancy C. Andreasen. 1979. Thought, language, and communication disorders: I. clinical assessment, definition of terms, and evaluation of their reliability. Archives of General Psychiatry, 36(12):1315–1321.
  • Association (2013) American Psychological Association. 2013. Diagnostic and Statistical Manual of Mental Disorders (5th ed.). Washington, DC.
  • Bedi et al. (2015) Gillinder Bedi, Facundo Carrillo, Guillermo A. Cecchi, Diego Fernández Slezak, Mariano Sigman, Natália B. Mota, Sidarta Ribeiro, Daniel C. Javitt, Mauro Copelli, and Cheryl M. Corcoran. 2015. Automated analysis of free speech predicts psychosis onset in high-risk youths. npj Schizophrenia, 1:15030.
  • Bleuler (1991) Eugen Bleuler. 1991. Dementia praecox oder Gruppe der Schizophrenien. In G. Aschaffenburg, editor, Handbuch der Psychiatrie, volume Spezieller Teil. 4. Abteilung, 1. Hälfte. Franz Deuticke, Leipzig.
  • Breiman (2001) Leo Breiman. 2001. Random forests. Mach. Learn., 45(1):5–32.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY. ACM.
  • Coppersmith et al. (2015) Glen Coppersmith, Mark Dredze, Craig Harman, and Kristy Hollingshead. 2015. From ADHD to SAD: Analyzing the language of mental health on Twitter through self-reported diagnoses. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 1–10.
  • Coppersmith et al. (2014) Glen Coppersmith, Craig Harman, and Mark Dredze. 2014. Measuring post traumatic stress disorder in Twitter. In ICWSM 2014.
  • Coppersmith et al. (2016) Glen Coppersmith, Kim Ngo, Ryan Leary, and Anthony Wood. 2016. Exploratory analysis of social media prior to a suicide attempt. In Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology, pages 106–117.
  • Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn., 20(3):273–297.
  • De Choudhury et al. (2013) Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. Predicting depression via social media. ICWSM, 13:1–10.
  • Deerwester et al. (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
  • Elvevåg et al. (2007) Brita Elvevåg, Peter W. Foltz, Daniel R. Weinberger, and Terry E. Goldberg. 2007. Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia. Schizophrenia research, 93(1–3):304–316.
  • Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  • Iter et al. (2018) Dan Iter, Jong Yoon, and Dan Jurafsky. 2018. Automatic detection of incoherent speech for diagnosing schizophrenia. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 136–146. Association for Computational Linguistics.
  • Katz et al. (2012) Gregory Katz, Leon Grunhaus, Shukrallah Deeb, Emi Shufman, Rachel Bar-Hamburger, and Rimona Durst. 2012. A comparative study of Arab and Jewish patients admitted for psychiatric hospitalization in Jerusalem: the demographic, psychopathologic aspects, and the drug abuse comorbidity. Comprehensive Psychiatry, 53(6):850–853.
  • Losada and Crestani (2016) David E. Losada and Fabio Crestani. 2016. A test collection for research on depression and language use. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 28–39. Springer.
  • Mitchell et al. (2015) Margaret Mitchell, Kristy Hollingshead, and Glen Coppersmith. 2015. Quantifying the language of schizophrenia in social media. In CLPsych@HLT-NAACL.
  • More and Tsarfaty (2016) Amir More and Reut Tsarfaty. 2016. Data-driven morphological analysis and disambiguation for morphologically rich languages and universal dependencies. In Proceedings of COLING 2016.
  • Morice and Ingram (1982) Rodney D. Morice and John C. L. Ingram. 1982. Language analysis in schizophrenia: Diagnostic implications. Australian & New Zealand Journal of Psychiatry, 16(2):11–21. PMID: 6957177.
  • Obrębska and Obrębski (2007) M Obrębska and T Obrębski. 2007. Lexical and grammatical analysis of schizophrenic patients’ language: A preliminary report. Psychology of Language and Communication, 11(1):63–72.
  • Østergaard et al. (2016) Soren Dinesen Østergaard, Ole Michael Lemming, Ole Mors, Christoph U. Correll, and Per Bech. 2016. PANSS-6: A brief rating scale for the measurement of severity in schizophrenia. Acta Psychiatrica Scandinavica, 133(6):436–444.
  • Pennebaker et al. (2015) James W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn. 2015. The development and psychometric properties of LIWC2015. Technical report, University of Texas at Austin, Austin, TX.
  • Tsugawa et al. (2015) Sho Tsugawa, Yusuke Kikuchi, Fumio Kishino, Kosuke Nakajima, Yuichi Itoh, and Hiroyuki Ohsaki. 2015. Recognizing depression from Twitter activity. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 3187–3196. ACM.
  • Yates et al. (2017) Andrew Yates, Arman Cohan, and Nazli Goharian. 2017. Depression and self-harm risk assessment in online forums. CoRR, abs/1709.01848.