LiSSS: A toy corpus of Literary Spanish Sentences Sentiment for Emotions Detection

05/17/2020 ∙ by Juan-Manuel Torres-Moreno, et al. ∙ Université d'Avignon et des Pays de Vaucluse 0

In this work we present a new and small corpus in the area of Computational Creativity (CC), the Literary Sentiment Sentence Spanish Corpus (LISSS). We address this corpus of literary sentences in order to evaluate algorithms of sentiment classification and emotions detection. We have constitute it by manually classifying its sentences in five emotions: Love, Fear, Happiness, Anger and Sadness/Pain. We also present some baseline classification algorithms applied on our corpus. The LISSS corpus will be available to the community as a free resource to evaluate or create CC algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research in Natural Language Processing (NLP) focused in the classification of emotions, has used corpus constituted by encyclopedic documents (mainly Wikipedia), journals (newspapers or magazines) or specialized (legal, scientific or technical documents) for the development and evaluation of the models.

[torres2014, cunhaCSSMV11, sierra]. Studies of literary corpus have been systematically left aside mainly because the level of literary discourse is more complex than other genres. In this work we introduce a new literary emotion corpus in order to evaluate and validate the NLP algorithms in the literary emotions classification tasks.

This paper is structured as follows. In Section 2 we show some works related to development and analysis of corpora. In Section 3 we describe the corpus LISSS and in Section 4 the learning corpus CiteIn. Four baseline classification models are described in Section 5, as well as their respective results. Finally in Section 6, we propose some ideas for future works before to conclude.

2 Related work

Several corpora in Spanish have been built and made available to the scientific community [acllaw:2011] however, a few number of them have been classified considering categories of emotions. For example the corpus SAB, composed by tweets in Spanish was introduced in [corpusSAB]. The tweets represent critics toward different commercial brands. The annotation was made considering the emotion perceived for each tweet. The corpus SAB consists of 4 548 annotated tweets using 8 predefined emotions: [Trust, Satisfaction, Happiness, Love, Fear, Disaffection, Sadness and Anger].

Another data set concerning tweets is the corpus TASS [corpusTASS]. It contains about 70 000 tweets classified using automatic methods into the following categories: [Positive, Negative, Neutral, None]. Tweets of the TASS corpus are related with different topics: Politics, Economy, Sport, Music, etc.). In [chen2014building]

it is presented a global analysis (at word level) about emotions polarity. The corpus employed is composed of several lexicons in 40 languages, including Spanish. Annotation considered the categories: [

Positive and Negative] with a score in the range between 0 and 1 (0 defines a word as negative and 1 as positive).

3 LISSS Corpus

Unlike the above-mentioned corpora, we introduced in this paper a Literary Sentiment Sentences in Spanish corpus named LISSS. It consists only of literary texts, which gives it a particular characteristic more useful for studying the algorithms of automatic classification and generation of literary text. Moreover, for the classification, five categories of emotions were defined, instead of a binary (positive-negative) classification. This characteristic of LISSS could be useful for more complete analysis.

3.1 Corpus structure

Our work was meticulous but on a modest scale. We decided to create a small controlled corpus, exclusively composed of literary phrases in Spanish selected from universal literature. The LISSS corpus was constituted manually using literary texts in Spanish coming from 140 Spanish-speaking authors or using authors translations from languages other than Spanish (the full list of authors studied is available in the Annex). Each phrase in the corpus was read and manually classified using the following five categories:

  • Anger (A)

  • Love (L)

  • Fear (F)

  • Happiness (H)

  • Sadness or Pain (S)

Since the sentences or paragraphs (group of sentences) could belong to two or more emotions, we included the possibility of labelling the sentences using all the emotions expressed. The corpus currently has 250 sentences: 50 for each emotion. Sentences were processed to create a text document coded in utf8. Each line of the corpus contains the information of the following fields:

  • ID

  • Sentence (or paragraph)

  • # Author

Each field is separated by a tab character. The ID field is composed of a sequential number (1,2,3,…) followed by a code (A, L, F, H, S) for each of the predefined emotions. If s sentence is ambiguous, it will have as many codes as categories it belongs to. Sentences were selected manually to maintain a balance between the five types of emotions.

It should be noted that sentences are actually sometimes mini-paragraphs composed of several sentences. This was done to respect as much as possible the coherence of the idea expressed and the corresponding emotion. For example, sentence 24 of the emotion Love (L), by Lope de Vega:

80L  La raz de todas las pasiones es el amor. De l nace la tristeza, el gozo,
     la alegra y la desesperacin.       # Lope de Vega

is a two-sentence paragraph:

     La raz de todas las pasiones es el amor.

     (The root of all passions is love.)

     De l nace la tristeza, el gozo, la alegra y la desesperacin.

     (From it they are born sadness, joy, happiness and despair.)

3.2 Characterization of LISSS corpus

The corpus contains 230 literary sentences (or paragraphs in a line). It was constituted manually from selected quotes, stories, novels and some poems.

The literary genre is heterogeneous. Sentences in general language as well as those too short ( words) or too long ( words) were carefully avoided. Finally we got a vocabulary complex and aesthetic where certain literary figures like anaphora or metaphor could be observed. The characteristics of the LISSS corpus are shown in Table 1.

Label Sentences Paragraphs Words Characters
Corpus LISSS 230 26 4 577 25 643
Anger A 43 7 961 5 428
Love L 43 8 941 5 166
Fear F 46 5 940 5 173
Happiness H 46 2 889 5 143
Sadness/Pain S 41 7 846 4 728
Table 1: Corpus LISSS of literary sentences classified in 5 emotions.

10% of the LISSS corpus is constituted by mini-paragraphs composed of several sentences. The five existing classes are not completely homogeneous given the existence of ambiguous sentences that belong to two or more emotions Table 2).

Emotion Label Ambiguous sentences
Anger A 8
Love L 12
Fear F 7
Happiness H 4
Sadness/Pain S 5
TOTAL 36
Table 2: Corpus LISSS: Sentences with ambiguous emotions.

An example of this sort of sentences is the sentence labeled with the identifier 14AL, Anger (A) and Love (L):

14ALΨDel amor al odio, solo hay mas amorΨ# Mario Benedetti

(From love to anger, there’s only more love)

that belongs to both categories.

In total, 36 sentences ( 16%) of LISSS corpus are considered with more than one emotion. This ambiguity is mainly observed in the emotions Love and Sadness. Literary ambiguity represents a challenge to automatic classification methods.

The LISSS corpus has the advantage of being homogeneous in terms of genre by possessing only sentences considered as “literary sentences”, but it is heterogeneous in terms of emotions classes.

In others emotion corpora, the sentences may be in general language: sentences that often give a fluency to the reading and provide the necessary relations between ideas expressed in literary sentences. Likewise, the corpora with tweets are not able to be used with literary goals due to the presence of noise and other special characters. Another advantage of LISSS corpus is that the presence of noise (cut phrases, pasted words, wrong syntax, etc.) was avoided by a repeated and carefully reading.

However, LISSS corpus has the disadvantage of having a reduced size, this is not suitable to algorithms that use automatic learning. This is normal, because the goal of LISSS corpus is not to be used for learning, but for evaluation The LISSS corpus is suitable for testing the quality and performance of such algorithms.

The version 0.250 of LISSS corpus (14/05/2020) is available in the website: http://juanmanuel.torres.free.fr/corpus/lisss/ under GPL3 public licence.

4 Algorithms used in classification

In this section we present some classical classification methods applied to the LISSS corpus and their preliminary results.

4.1 Employed models

The LISSS corpus was tested with several classical classification algorithms available in the Weka’s system libraries111https://www.cs.waikato.ac.nz/ml/weka/. In particular, we have employed:

The J48 algorithm, proposed by Ross Quinlan [salzberg1994c4], belongs to the family of models based on decision trees. It is an extension of the ID3 algorithm. Its analysis is based on the search for information entropy and was considered for its high performance in classification tasks.

We also decided to use the Naive Bayes model given its wide implementation in several classification processes. In particular, we used the Naive Bayes Multinomial model, based on the calculation of estimated frequency of terms, which allows a simple and efficient implementation in textual classification

[su2011large].

Finally, we tested with a standard implementation of a SVM222https://weka.sourceforge.io/doc.stable/weka/classifiers/functions/LibSVM.html to compare the performance of models on the LISSS corpus.

The four algorithms used need a learning phase to produce a classification model. The learning phase must be done on a corpus independent of the test corpus. In our case we decided to build a learning corpus suitable for this task, adapting it to the five categories present in the LISSS corpus.

4.2 Learning corpus

For the training of classification models, we built an ad hoc learning corpus. The CitasIn corpus composed of texts in Spanish, mostly from the literary genre. A large number of documents belonging to different categories333https://citas.in/temas/ (friendship, lovers, beauty, success, happiness, laughter, enmity, deception, anger, fear, etc.) were retrieved from a suitable website444All documents were downloaded, with the editor authorisation, on 25 March 2020 from the website: https://citas.in. These documents were classified in the five classes of the LISSS corpus, from a manual mapping with their own categories (last column of the Table 3).

The generated corpus has an adequate size to be used in training tasks of automatic learning models. The disadvantage of the corpus CitasIn is the presence of noise given that it often contains supporting sentences (sentences with general language vocabulary) that do not belong to the literary genre. The characteristics of the corpusCitasIn are found in the Table 3. The reader should have no problem reconstituting the corpus CitasIn using this correspondence between class.

CitasIn Sentences Words Chars Words per Categories
sentence https://citas.in/temas/
Corpus 53 351 1 903 214 11 113 527 35.6
Love 13 430 392 623 2 234 523 29.2 alma, amantes, amistad, amor
belleza, beso, esperanza, pasion
Happiness 10 857 377 280 2 222 645 34.7 felicidad, amistad, diversión, sonrisa, motivación
risa, victoria, exito, optimismo
Anger 9 556 384 211 2 253 546 40.2 egoismo, enemistad, engaño, envidia, venganza
guerra, infierno, mentira, odio, muerte
Fear 8 917 355 976 2 093 918 39.9 necesidad, miedo, dolor, fracaso
indecisión, problema, soledad, suicidio
Sadness/Pain 10 591 393 124 2 308 895 37.1 despedida, tristeza, pena, enfermedad, fracaso
pérdida, sufrimiento, olvidando, llorar, lágrima
Table 3: Corpus CitasIn: Sentences from different categories grouped into 5 emotions

We pre-processed the corpus CitasIn before the learning phase. The texts were coded in utf-8 format, we removed the special symbols, as well as the stop words using the Weka libraries and stop lists. We normalized the words by transforming the capital letters into small letters. Finally a tokenization process was applied using Weka specific algorithms for Spanish language. Of course, we have eliminated from the CitasIn corpus, the common sentences with the LISSS corpus.

5 Baseline results and discussion

In this section we show the baseline results of our tests. The tests were executed with the four models presented in Section 4, using CitasIn as a learning corpus and LISSS corpus (see Section 3

) for evaluation. In all cases, there were 53 351 input instances (sentences) for training the systems. Our results also show the confusion matrix calculated for each algorithm.

5.1 Algorithm Trees J48

=== Summary ===

Correctly Classified Instances          146               58.4    %
Incorrectly Classified Instances        104               41.6    %
Total Number of Instances               250

=== Detailed Accuracy By Class ===

                 Precision  Recall   F-Measure  Class
                 0.478      0.880    0.620      Love
                 0.700      0.700    0.700      Happiness
                 0.615      0.640    0.627      Anger
                 0.500      0.240    0.324      Sadness
                 0.719      0.460    0.561      Fear
Weighted Avg.    0.602      0.584    0.566

=== Confusion Matrix ===

  a  b  c  d  e   <-- classified as
 44  2  1  2  1 |  a = Love
  6 35  3  2  4 |  b = Happiness
 11  1 32  5  1 |  c = Anger
 22  8  5 12  3 |  d = Sadness
  9  4 11  3 23 |  e = Fear

5.2 Algorithm Naive Bayes Multinomial Text

=== Summary ===

Correctly Classified Instances         128               51.2    %
Incorrectly Classified Instances       122               48.8    %
Total Number of Instances              250

=== Detailed Accuracy By Class ===

                 Precision  Recall   F-Measure  Class
                 0.338      0.880    0.489      Love
                 0.657      0.460    0.541      Happiness
                 0.629      0.440    0.518      Anger
                 0.737      0.280    0.406      Sadness
                 0.806      0.500    0.617      Fear
Weighted Avg.    0.633      0.512    0.514

=== Confusion Matrix ===

  a  b  c  d  e   <-- classified as
 44  4  0  2  0 |  a = Love
 24 23  1  0  2 |  b = Happiness
 24  2 22  0  2 |  c = Anger
 26  4  4 14  2 |  d = Sadness
 12  2  8  3 25 |  e = Fear

5.3 Algorithm Support Vector Machine

=== Summary ===

Correctly Classified Instances         124               49.5      %
Incorrectly Classified Instances       126               50.4      %
Total Number of Instances              250

=== Detailed Accuracy By Class ===

                 Precision  Recall   F-Measure  Class
                 0.304      0.760    0.434      Love
                 0.714      0.400    0.513      Happiness
                 0.653      0.640    0.646      Anger
                 0.471      0.160    0.239      Sadness
                 0.839      0.520    0.642      Fear
Weighted Avg.    0.596      0.496    0.495

=== Confusion Matrix ===

  a  b  c  d  e   <-- classified as
 38  2  2  5  3 |  a = Love
 29 20  1  0  0 |  b = Happiness
 14  1 32  1  2 |  c = Anger
 35  4  3  8  0 |  d = Sadness
  9  1 11  3 26 |  e = Fear

5.4 Algorithm Naive Bayes

=== Summary ===

Correctly Classified Instances          91               36.4    %
Incorrectly Classified Instances       159               63.6    %
Total Number of Instances              250

=== Detailed Accuracy By Class ===

                 Precision  Recall   F-Measure  Class
                 0.238      0.780    0.364      Love
                 0.645      0.400    0.494      Happiness
                 0.550      0.220    0.314      Anger
                 0.444      0.080    0.136      Sadness
                 0.654      0.340    0.447      Fear
Weighted Avg.    0.506      0.364    0.351



=== Confusion Matrix ===

  a  b  c  d  e   <-- classified as
 39  3  2  1  5 |  a = Love
 25 20  0  1  4 |  b = Happiness
 37  2 11  0  0 |  c = Anger
 41  3  2  4  0 |  d = Sadness
 22  3  5  3 17 |  e = Fear

The best model in this task was the J48 tree algorithm, obtaining an average F-measure = 0.566 (harmonic combination of Precision and Recall). This relatively low result shows the difficulty of the task of classifying emotions in literary corpora.

We detected two main problems in the classification of this type of texts; on the one hand, the richness of the lexicon used. On the other hand, the ambiguity. Belonging to multiple emotions in the same sentence causes the errors of the methods used. In particular the emotions Sadness and Love were confused even with the best algorithm, J48.

6 Conclusion and future work

In this article we have introduced a new toy literary corpus of emotions in Spanish, the LISSS corpus. The aim of this corpus is to test machine learning algorithms on a specialized corpus, no to train such algorithms. We have tested four classic classification algorithms on the LISSS corpus. The results obtained show that the classification of this type of text is a difficult exercise. The sentences often belong to two or more classes. The overlap between the vocabulary of the different classes, causes the methods to be unable to correctly classify this corpus.

We think that automatic classifiers can be enriched through the integration of other modules, using characteristics of language, style or personality detection to achieve a better classification [We12, plastino2016fisica, Ed17, Si18, moreno_GEN].

Future work needs the introduction of more sentences in order to enrich the corpus. The scientific community can contribute to develop this corpus, modify it or distribute it under the GPL3 license.

Acknowledgements

This work is funded by Consejo Nacional de Ciencia y Tecnología (Conacyt, Mexico), grant number 661101 and partially by the Université d’Avignon/Laboratoire Informatique d’Avignon (LIA), France. We thank the admin of the site https://cite.in for the facilities in allowing us to use their literary quotation database for our experiments. Also, authors thank Carlos-Emiliano González-Gallardo for their comments and invaluable corrections of this paper.

Annex: Author Listing of LISSS corpus version 0.230

Abraham Lincoln; Agatha Christie; Albert Camus; Albert Einstein; Albert Schweitzer; Aldous Huxley; Alphonse Daudet; Alphonse de Lamartine; Alphonse Karr; Amado Nervo; Anais Nim; Anatole France; Andrés Calamaro; Antoine de Saint-Exupéry; Arthur Schopenhaue; Baruch Spinoza; Benjamin Franklin; Bernard Le Bouvier de Fontenelle; Bertrand Russell; Blaise Pascal; Buda; Camilo José Cela; Cesare Pavese; Charles Baudelaire; Charles Bukowski; Charles Dickens; Charles Péguy; Charly García; Cleóbulo de Lindos; Denis Diderot; Dostoievsky; Edgar Allan Poe; Elbert Hubbard; Epicteto de Frigia; Ernest Hemingway; Ernesto Che Guevara; Ernesto Sábato; Federico García Lorca; Fiodor Dostoievski; F Nietzche; François de La Rochefoucauld; Gabriel García Márquez; George Bernard Shaw; George Orwell; George Sand; George Steiner; Giacomo Leopardi; Gilbert Keith Chesterton; Giordano Bruno; Goethe; G Patton; Graham Greene; Groucho Marx; Gustave Flaubert; Heinrich Heine; Henry Louis Mencken; Hermann Hesse; Honoré de Balzac; HP Lovecraft; Immanuel Kant; Isaac Asimov; Italo Calvino; Jacinto Benavente; Jaime Sabines; Jane Addams; Jean Cocteau; Jean-Jacques Rousseau; Jean Luc Goddard; Jean Paul Sartre; Jim Morrison; John F Kennedy; John Lennon; Jorge Luis Borges; José Ingenieros; José Marti; José Saramago; Juan Manuel Torres Moreno; Juan Ramón Jiménez; Juan Rulfo; Jules d’Aurevilly; Laura Esquivel; Leonardo Da Vinci; Lope de Vega; Lord Byron; Marcel Proust; Marie Curie; Mario Benedetti; Mark Twain; Marlene Dietrich; Marqués de Vauvenargues; Martin Luther King; Maximilien Robespierre; Máximo Gorki; Miguel de Cervantes; Miguel de Unamuno; Miguel Hernández; Milan Kundera; Molière; Montesquieu; Nelson Mandela; Nietzsche; Ogden Nash; Orson Welles; Oscar Wilde; Ovidio; Pablo Neruda; Paulo Coelho; Pedro Bonifacio Palacios Almafuerte; Pericles; Peter Alexander Ustinov; Pierre Corneille; Proverbio chino; Ray Loriga; René Descartes; R Tagore; Sabino Arana; Sadamm Hussein; Selma Lagerlöf; Séneca; Shakespeare; Simone de Beauvoir; Sir Francis Bacon; Sófocles; Solón; Stanisław Lem; Stephen Hawking; Sthendal; Susan Sontag; Tennessee Williams; Terencio; Tito Livio; Tupac Shakur; Ugo Foscolo; Victor Hugo; Vinicius de Moraes; Voltaire; William Blake; William Faulkner; William Nicholson; Woody Allen

References