Authorship Attribution in Bangla literature using Character-level CNN

01/11/2020 ∙ by Aisha Khatun, et al. ∙ 0

Characters are the smallest unit of text that can extract stylometric signals to determine the author of a text. In this paper, we investigate the effectiveness of character-level signals in Authorship Attribution of Bangla Literature and show that the results are promising but improvable. The time and memory efficiency of the proposed model is much higher than the word level counterparts but accuracy is 2-5 models. Comparison of various word-based models is performed and shown that the proposed model performs increasingly better with larger datasets. We also analyze the effect of pre-training character embedding of diverse Bangla character set in authorship attribution. It is seen that the performance is improved by up to 10 balancing them before training and compare the results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Authorship attribution is generally concerned with the identification of the original author of a given text from a set of given authors. It has a wide range of applications including plagiarism detection, forensic linguistics, etc. Each author has a distinctive writing style that is exploited by statistical analysis to detect the author.

However, in Bangla language, the amount of work done in this area is not very rich despite being one of the most spoken languages. In traditional methods, texts are represented using independent features such as lexical n-gram or frequency-based representation. In this approach, words of similar context are likely to be represented in different vector space as the features are independent. So, the semantic values of the words might be lost, which is problematic. Word embedding, also generally known as distributed term representations, offers a solution to this problem by encoding semantic similarity from their co-occurrences. Chowdhury

[4] experimented with the effectiveness of word embedding in authorship attribution for Bangla language for various architectures.

Another type of embedding, which we tried to analyze in this paper is character embedding. Character CNN was first introduced by Zhang[28] for the text classification task. Through the empirical experiment of Sebastian [22] and Jozefowicz[11], character level NLP has been proven to be very promising in various ways. Although it may seem that character on its own does not have any semantic value, Radford [21] illustrates that character-level models can capture the semantic properties of text. Character level models are also better at handling out-of-vocabulary words, misspelling, etc and provide an open vocabulary. Another major advantage is that it reduces the dimension to as low as 16, unlike word embedding where the dimension can increase up to 300 while the vocabulary is also huge. So, character embedding removes the bottleneck in training tasks and gives huge advantages on computational complexity.

Our approach in this paper was to investigate how character embedding performs in the task of Authorship Attribution in Bangla language. Bangla Language has numerous words with joint letters which can be written in a few different forms. Moreover, there are some words with the same meaning but slightly different spelling. These inconsistencies are not recognized by word-level models but character-level models can capture and relate words of this kind, making such models more appropriate for Bangla language. Comparison of character embedding with word embedding is discussed according to the findings. Experiments with and without pre-trained embedding layers have also been done to show the effectiveness of information captured in the embeddings. No previous work, analysis or investigation has yet published on the effect of character embedding in Authorship Attribution of Bangla Literature as of our knowledge to date. This paper follows the structure provided below:

  • Related Works - Extensive background study on some works relevant to this paper are provided in this section.

  • Corpus - The dataset used in our experiment is described in this section.

  • Methodology - The proposed architecture for our character embedding model along with the strategies used during the training phase of the neural networks are described in depth.

  • Experiments - Describes the evaluation process and the model setup for comparison.

  • Results and Discussion - Our findings along with results and possible reasons are presented in this part.

  • Conclusion - In the last section, some recommendations and scope for future research on this field are mentioned.

Ii Related Works

Ii-a On Authorship Attribution

Authorship Attribution has been a topic of important research for a long time. With increased anonymity on the internet and easy fraud, authorship attribution of writings has become crucial. For authorship attribution, work on varying degrees of feature selection

[25], including advanced features such as local histograms[8]. Naive similarity-based models[12], SVMs[16] have been explored. Semi-supervised approach to authorship attribution was also taken[17]. SOTA was achieved by Ruder[22] using character-level and multi-channel CNN.

Compared to other works, very few works have been done in Bangla language, lacking any sort of high benchmarks until very recently. Das and Mitra[7] worked with a really small dataset of 36 documents and 3 authors to perform uni-gram and bi-gram feature-based classification. Chakraborty[3]

worked with SVMs on 3 authors to achieve up to 84% accuracy. Shanta Phani also attempted to attribute 3 authors with machine learning

[19]. P. Das, R. Tasmim, and S. Ismail used 4 authors of current times and hand-drawn features such as word frequency, type-token ratio, number of various POS, word/sentence lengths etc[6]

. 90.67% was achieved by Hossain and Rahman by using multiple features along with cosine similarity

[10]. Pal, Siddika, and Ismail achieved 90.74% accuracy with 6 authors using SVM on one feature[18]

. Multi-layered perceptrons were employed by Phani, Lahiri, and Biswas

[20]. Impressive results were achieved very recently by[4] using various word embeddings on a 6 author dataset. They demonstrated the effects of various architectures and word embeddings on authorship attribution and concluded that fastTexts skip-gram used with CNN tends to beat all other models in terms of accuracy. No work has been done on the character level classification task as of knowledge in Bangla literature. The effects of Bangla alphabet complexity and language formulation on architectural design and character embedding learning remains largely untouched.

Ii-B On Embedding

Embeddings are effectively mappings from various entities (character, word, sentence, etc) to continuous vector spaces in high dimensions. The relation among the numerical representations gives a semantic, syntactic and morphological meaning of the entities. These meanings are leveraged by machine learning techniques to find patterns in texts and thus perform various tasks such as classification.

Ii-B1 Word Embedding

Representing words in continuous vector spaces is considered as one of the breakthroughs of NLP. Word embeddings are learned in the form of an embedding layer or separately in an unsupervised manner. Among the unsupervised techniques include Continuous Bag-of-Words(CBOW) and Skip-Gram models famously implemented by Word2Vec and fastText. Also, there are co-occurrence statistical methods such as Glove. Santos[24]

used word embeddings with convolutional models showing significant improvements over baseline methods. Word embeddings have been used to improve the performance of sentiment analysis

[23]

. Often pre-trained embeddings are used or are learned for specific tasks such as tree-structured long short-term memory networks

[26] and Multi-perspective sentence similarity modeling[9]. Although words started to be used as units of text, various works have started to break down words and work at subword and character levels. Wieting[27] creates subword embedding from counts of character n-grams.

Ii-B2 Character Embedding

Character Level embeddings are used in various ways, either by themselves or to produce embeddings of higher levels e.g for words. Character embeddings have been employed in POS tagging[14], language modelling[14] and dependency parsing[1]. Character-RNN were used for machine translation, for representing words[15] or to generate character level translations[5]. Pure Character level classification was first explored using CNN architecture[28]. Jozefowicz[11] shows that a character-level language model can significantly outperform state of the art models. Their best performing model combines an LSTM with CNN input over the characters. Besides using either just word or character embeddings, ideas of combining them also have been introduced[13]. Attempts to learn character embedding and serve as pre-trained have also been explored[2].

Iii Corpus

Because of the scarcity for the standard dataset in authorship attribution, we made a custom web crawler to parse the data on our own. We collected writings from an online Bangla e-library containing writings(e.g., novels, story, series, etc.) of different authors. Table I shows the details of our dataset. Our dataset is larger compared to the previously worked on datasets for Bangla as mentioned in section II with 13.4+ million words. The dataset was equally partitioned with each document having the same length of 750 words. Various subsets of authors were chosen and the dataset was truncated to each author having the same number of samples.

The dataset from the paper[4] was also used. This dataset consists of 6 authors with 350 sample texts per author and total word count of 2.3+ million.

Author Word count Unique words
candidate 01 351750 44477
candidate 02 421500 62485
candidate 03 825000 53163
candidate 04 666000 84888
candidate 05 636750 67579
candidate 06 984000 78717
candidate 07 944250 89956
candidate 08 3388500 161893
candidate 09 357000 43864
candidate 10 786000 69182
candidate 11 1056000 69648
candidate 12 1472250 109230
candidate 13 698250 76071
candidate 14 581250 84311
TABLE I: Corpus details

For pre-training our model, we used another large corpus of Bangla Newspaper articles based on 6 topics. The topics were accident, crime, education, entertainment, environment, and sports. The dataset consists of 10564543 tokens.

Iv Methodology

Iv-a Proposed Architecture

Character-level CNN can sufficiently replace words for classifications[28]

. This means CNN does not require the syntactic or semantic structure of a language, which makes such approaches effectively independent of language as the number of characters is limited. To this end, CNN was used in this paper to perform the task of author attribution. An elaborate set of experiments were performed on 3 different datasets to conclude with an architecture that successfully extracts the character level features of any sample text. The same architecture was used to prepare the pre-trained character embeddings for classification tasks. The model is a deep neural network starting with 4 convolutional layers, each followed with a maxpool layer of kernel size 3. As standardized in computer vision, for the convolutional layers, the number of filters increases while decreasing the kernel size at each layer. The kernel sizes are respectively 7,3,1 and 1. The number of filters is 64,128,256 and 256. Beneath all is an embedding layer where each character is represented as a vector of length

, i.e, the alphabet size. The convolutional layers are stacked with a fully connected layer of 512 activation nodes, activation function ReLU and dropout. Finally, an output layer with softmax is used to provide the classification probabilities. For optimization Adam optimizer is used along with categorical cross-entropy as the loss function.

Iv-B Character Embedding

Character embedding aims to turn characters into meaningful numerical representations in the form of vectors. These vectors may represent the correlation of different characters, or even correlation of groups of characters together i.e. words, sentences, documents, etc. This concept can be leveraged to use character embeddings to fit misspelled words, rare or new words, slangs or emoticons. They can also easily represent words with variations such as drive, driving, drives, etc. There is no more bottleneck for out of vocabulary words. The character set can be used to make any word, even if it is out of vocabulary, in contrast to word embeddings which simply ignored them, or had weak representations for rare words. This way character embeddings increase generalization compared to words. Another significant improvement is the vocabulary size. Instead of having a very large vocabulary of words, character embeddings have a fixed number of characters which is significantly smaller, therefore reduces model complexity and the number of parameters by a significant amount. Furthermore, they can be represented with a small vector size (e.g 16) and still be significantly informative as opposed to word embeddings which require at least 100-300 size vectors for a decent model. The simplest way to represent a character is to use a one-hot encoding. This requires the vector size to be the size of the alphabet. We used a one-hot encoding as a baseline for comparison of pre-trained embeddings. Otherwise, one can randomly initialize the vectors, where the vectors can be of any size as small as 16 to as big as 300. This becomes a hyperparameter for tuning.

Iv-C Training the model

The alphabet size, and therefore the embedding vector size is 253. Among the 253 different characters are the English letters(capital and small) and digits, Bangla letters and digits, Bangla vowel symbols, and various other punctuation and symbols. For comparative training, two sets of embeddings were created for the character set. First is one-hot encoding, and the other is pre-trained embeddings. The training was done in two phases as stated below:

Iv-C1 Pre-training Embedding

To learn character embeddings, the architecture mentioned above was used for classification of the news dataset as mentioned in section III. This is in contrast to the usual ways of learning embeddings. No separate model was used[2] to learn the embeddings. Instead, already available classification task on a marginally large dataset learns character embeddings for its purposes. These embeddings can be used as initialization for the author attribution task, which has a smaller dataset compared to the former, giving it an initial boost. The model was trained with a learning rate of 0.001 and decay of 0.0001. The maximum length of each text sample was set as 1000 and batch size as 80. A dropout rate of 0.5 was used in the fully connected layer to prevent over-fitting. The embeddings then learned to have an understanding of how the Bangla language works and provide a meaningful initialization for any classification tasks. They were then extracted and used for the task of authorship attribution.

Iv-C2 Performing Classification

To perform the main task of author attribution and comparison, this training phase was performed twice with each type of embeddings mentioned above, i.e one-hot and pre-trained. The fully connected layer was given a dropout probability of 0.7 and trained with batch size 128 and the maximum length of each text was set to be 3000 characters. Everything else was kept similar. The classification was carried out with 2 author attribution datasets, one with 6 authors[4] and our dataset with maximum 14 authors. The larger dataset was trained with 6,8,10,12 and 14 authors to analyze the effects of increasing classes on the proposed model.

V Experiments

We evaluate the performance of the proposed architecture in terms of accuracy, with and without pre-training character level embedding and comparing them on the held-out dataset. We also try to infer how the character-level model compares with the word level models. All models are compared for the increasing number of authors(classes) on the corpus mentioned to assess the quality of the models. To keep the dataset balanced, the number of samples per class were truncated to the minimum among the classes. We propose a model for word-level classification mostly similar to our Char-CNN model. The model used for performance analysis is as follows:

V-a Word Embedding Model

This model has a close resemblance to the proposed Char-CNN model except for a few differences to tune with the word level version of the classification. The model has 2 convolutional layers with the kernel sizes 7,3 and number of filters are 128,256 respectively for each layer. Each layer followed by a maxpool layer. The model is initialized with pre-trained word embeddings from word2vec and fastText, both CBOW and skip-gram versions. The convolutional layers are stacked with an LSTM layer of 100 neurons and a fully connected layer of 512 activation nodes both with dropout to prevent overfitting. Finally, a softmax layer is used to provide the classification probabilities. It is trained for 10 epochs with a learning rate of 0.001 with Adam optimizer, the batch size is 32 and 750 words per sample are used as input to models. All the word level models have a vocabulary size of 60000 and word embedding vector of size 300.

Vi Results and Discussion

The accuracies achieved(in percents) on the test set of the datasets, with pre-trained embeddings for both word and character levels are summarized in Table II. Because the datasets were balanced, the comparison of accuracies is sufficient.

#of Authors 6[4] 6 8 10 12 14
samples/author 350 1100 931 849 562 469
Char-CNN 83 96 92 86 75 69
W2V(CBOW) 65.3 97 82.8 83.3 76.4 71.8
fastText(CBOW) 65 73 58 35.7 37.31 40.3
W2V(Skip) 79 94 91.1 85.4 82.2 78.6
fastText(Skip) 86 98 95.2 86.35 80.9 81.2
TABLE II: Performance comparison of different models with pre-trained embedding

Accuracy comparison(in percents) of the proposed model with and without pre-trained character embeddings are summarized in Table III.

#of Authors 6[4] 6 8 10 12 14
#of samples/class 350 1100 931 849 562 469
Pretrained Embedding 83 96 92 86 75 69
Not pretrained 71 95 82 83 66 59.5
TABLE III: pretrained vs non-pretrained comparison
Fig. 1: Accuracy of various models with increasing number of samples.

From the accuracy comparisons shown in Table II we see that Skip-gram implemented by fastText performs well in the given datasets. So we can infer that subword level classification tends to extract a good amount of meaning information and styles from the text. On the other hand, the word2vec models, which use entire words have worse performance. Character level model performs reasonably well in competition with subword level as long as the dataset is big enough. When the number of authors increased, the number of samples per author decreased making it difficult for the character-level model to collect enough information. With larger datasets, this model will be able to perform significantly better[28]. This can be illustrated from Figure 1 that with a larger number of samples, the Char-CNN model raises steeply and performs competitively with the other models. In terms of the number of parameters, character level model is much superior to its word-level counterparts. The embedding vectors for the word level models is of size . i.e. 300 * 60000. On the other hand, the character embedding matrix is of size 253*253 given that we initially used one-hot vectors. This size can also be reduced to as low as 253*16 as were done in some research[11]. Another thing to consider is the time it takes to train the models. For the word embedding models, a pure CNN does not work satisfactorily, so an LSTM layer had to be added to add sequential information in the model. This improves accuracy with the cost of taking more time to train, around 15-20 minutes. On the other hand, the character-level model works significantly well with only using convolutional layers taking less than 2 minutes to train. This effect of training time become largely magnified on large-scale cases, making the word-level model unfit for light-weight devices. As stated in the paper[28], ConvNets with character embedding can completely replace words and work even without any semantic meanings. Which means that convolutional layers can extract whatever information necessary for author attribution, given enough data.
To illustrate the need of pre-trained character embeddings, we see from III

that using a pre-trained embedding increases the accuracy across datasets and the different number of authors, regardless of the amount of data for each author. Which shows that these naively learned embeddings contain valuable information that can be easily applied to various tasks of the language, including author attribution, and increase the performance a few degrees. These numerical representations of character contain information about morphology and the syntax of the language among other things. Therefore such embedding can be learned from any task and applied to other tasks as a form of transfer learning, given the alphabet remains the same.

Vii Conclusion

So far no work has been done to evaluate the usefulness of character embeddings for classification task in Bangla language. We attempt to fill this gap and compare character embeddings with word embeddings showing that character embeddings perform almost as good as the best word embedding model. But besides accuracy, character level classification has a greater hand in terms of memory, time and number of parameters. Considering the small size of our datasets, we hope to have improved performance with larger datasets, as is the case for character level ConvNets[28]. Besides such network also work better with non-curated texts, which are hard for word-level embeddings to capture, thus more applicable to real-life scenarios. Furthermore, we analyzed the importance of pre-trained character embedding for author attribution and showed that pre-training can result in better performances. Since very large corpus is not available in Bangla language yet, we must come up with solutions that tackle attribution tasks sufficiently well even with little data. Therefore our future works include the combination of both character and word level embeddings to perform attribution task, in an attempt to combine the power of both types of embeddings. More advanced levels of transfer learning can also be performed by using language models in place of embeddings before classification. Language models and embeddings can also be combined to give greater generalization for Bangla language.

References

  • [1] M. Ballesteros, C. Dyer, and N. A. Smith (2015) Improved transition-based parsing by modeling characters instead of words with lstms. arXiv preprint arXiv:1508.00657. Cited by: §II-B2.
  • [2] K. Cao and M. Rei (2016) A joint model for word embedding and word morphology. arXiv preprint arXiv:1606.02601. Cited by: §II-B2, §IV-C1.
  • [3] T. Chakraborty (2012) Authorship identification in bengali literature: a comparative analysis. arXiv preprint arXiv:1208.6268. Cited by: §II-A.
  • [4] H. A. Chowdhury, M. A. H. Imon, and M. S. Islam (2018) A comparative analysis of word embedding representations in authorship attribution of bengali literature. Cited by: §I, §II-A, §III, §IV-C2, TABLE II, TABLE III.
  • [5] J. Chung, K. Cho, and Y. Bengio (2016)

    A character-level decoder without explicit segmentation for neural machine translation

    .
    arXiv preprint arXiv:1603.06147. Cited by: §II-B2.
  • [6] P. Das, R. Tasmim, and S. Ismail (2015) An experimental study of stylometry in bangla literature. Cited by: §II-A.
  • [7] S. Das and P. Mitra (2011) Author identification in bengali literary works. Cited by: §II-A.
  • [8] H. J. Escalante, T. Solorio, and M. Montes-y-Gómez (2011) Local histograms of character n-grams for authorship attribution. Cited by: §II-A.
  • [9] H. He, K. Gimpel, and J. Lin (2015)

    Multi-perspective sentence similarity modeling with convolutional neural networks

    .
    Cited by: §II-B1.
  • [10] M. T. Hossain, M. M. Rahman, S. Ismail, and M. S. Islam (2017) A stylometric analysis on bengali literature for authorship attribution. Cited by: §II-A.
  • [11] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: §I, §II-B2, §VI.
  • [12] M. Koppel, J. Schler, and S. Argamon (2011) Authorship attribution in the wild. Language Resources and Evaluation. Cited by: §II-A.
  • [13] D. Liang, W. Xu, and Y. Zhao (2017) Combining word-level and character-level representations for relation classification of informal text. Cited by: §II-B2.
  • [14] W. Ling, T. Luís, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso (2015) Finding function in form: compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096. Cited by: §II-B2.
  • [15] M. Luong and C. D. Manning (2016) Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788. Cited by: §II-B2.
  • [16] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song (2012) On the feasibility of internet-scale author identification. Cited by: §II-A.
  • [17] J. A. Nasir, N. Görnitz, and U. Brefeld (2014) An off-the-shelf approach to authorship attribution. Cited by: §II-A.
  • [18] U. Pal, A. S. Nipu, and S. Ismail (2017) A machine learning approach for stylometric analysis of bangla literature. Cited by: §II-A.
  • [19] S. Phani, S. Lahiri, and A. Biswas (2015) Authorship attribution in bengali language. Cited by: §II-A.
  • [20] S. Phani, S. Lahiri, and A. Biswas (2016) A machine learning approach for authorship attribution for bengali blogs. Cited by: §II-A.
  • [21] A. Radford, R. Jozefowicz, and I. Sutskever (2017) Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Cited by: §I.
  • [22] S. Ruder, P. Ghaffari, and J. G. Breslin (2016) Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686. Cited by: §I, §II-A.
  • [23] E. Rudkowsky, M. Haselmayer, M. Wastian, M. Jenny, Š. Emrich, and M. Sedlmair (2018) More than bags of words: sentiment analysis with word embeddings. Communication Methods and Measures. Cited by: §II-B1.
  • [24] I. Santos, N. Nedjah, and L. de Macedo Mourelle (2017) Sentiment analysis using convolutional neural network with fasttext embeddings. Cited by: §II-B1.
  • [25] E. Stamatatos (2009) A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology. Cited by: §II-A.
  • [26] K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. Cited by: §II-B1.
  • [27] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu (2016) Charagram: embedding words and sentences via character n-grams. arXiv preprint arXiv:1607.02789. Cited by: §II-B1.
  • [28] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. Cited by: §I, §II-B2, §IV-A, §VI, §VII.