Experiments with adversarial attacks on text genres

07/05/2021 ∙ by Mikhail Lepekhin, et al. ∙ 0

Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification, such as genre identification. However, often these approaches exhibit low reliability to minor alterations of the test texts. A related probelm concerns topical biases in the training corpus, for example, the prevalence of words on a specific topic in a specific genre can trick the genre classifier to recognise any text on this topic in this genre. In order to mitigate the reliability problem, this paper investigates techniques for attacking genre classifiers to understand the limitations of the transformer models and to improve their performance. While simple text attacks, such as those based on word replacement using keywords extracted by tf-idf, are not capable of deceiving powerful models like XLM-RoBERTa, we show that embedding-based algorithms which can replace some of the most “significant” words with words similar to them, for example, TextFooler, have the ability to influence model predictions in a significant proportion of cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Non-topical text classification concerns a wide range of problems that are aimed at predicting a text property that is not connected directly to the text topic, for example, at predicting its genre, difficulty level, the age or the first language of its author, etc. Unlike topical text classification, non-topical text classification needs a model that predicts a label on the basis of its stylistic properties. Automatic genre identification is one of the standard problems of non-topical text classification, as it is useful in many areas such as information retrieval, language teaching or basic linguistic research [Santini et al., 2010].

An early comparison of various datasets, models and linguistic features for genre classification [Sharoff et al., 2010]

shows that traditional machine learning models, for example, SVM, can be very accurate in genre classification on their native dataset, but suffer from a dataset shift. Since then, many new approaches to text classification have emerged. In particular, BERT (Bidirectional Encoder Representations from Transformers) is an efficient pre-trained model based on the Transformer architecture

[Devlin et al., 2018]. It achieves the state-of-the-art results for various NLP tasks, including text classification. In this study we use XLM-RoBERTa [Conneau et al., 2019] is an improved variant of BERT. It has the same architecture, but uses bigger and more genre diverse corpora and an updated pre-training procedure. In addition, XLM-RoBERTa is a multilingual model trained on Common Crawl data in comparison to multilingual BERT only trained on Wikipedia.

One of the most significant problems in genre classification concerns topical shifts [Petrenz and Webber, 2010]. If in the training corpus a specific topic is more frequent for a specific genre, then many classification models can be biased towards indicating this genre by the keywords of this topic. This becomes especially problematic in the case of data shift between the training and testing corpora [Petrenz and Webber, 2010]. For this reason, they check reliability of their genre classifiers by testing on datasets from different domains. We test this in our study too.

Genre label Prototypes FTD EN FTD RU Natural annotation EN LJ RU
Train Val Train Val Test Sources Test
Argument Expressing opinions, editorials 276 77 207 77 400 [Kiesel et al., 2019] 481
Fiction Novels, songs, film plots 69 28 62 23 400 BNC&Brown 199
Instruction Tutorials, FAQs, manuals 141 50 59 17 400 StackExchange 384
News Reporting newswires 114 37 379 103 400 Giga News 1518
Legal Laws, contracts, T&C 56 17 69 13 400 UK and US legal codes 14
Personal Diary entries, travel blogs 72 19 126 49 400 ICWSM 513
Promotion Adverts, promotional postings 218 66 222 85 400 promo sites 68
Academic Academic research papers 59 23 144 49 400 arxiv.org 20
Information Encyclopedic articles 131 38 72 33 400 Wikipedia 171
Review Product reviews 48 22 107 34 400 Amazon reviews 185
Total 1184 377 1447 483 4000 3553
Table 1: Training and testing corpora

There have been numerous attempts to attack NLP models by making minor changes to a text which lead to different predictions. An overview of different methods is presented in [Huq and Pervin, 2020]. These techniques help to reveal the flaws of the NLP models and to find out what are the features in the texts that are taken into account by the models. TextFooler [Jin et al., 2019]

sorts the words of texts under attack by their impact on the target class probability and tries to replace the most important words with their closest neighbours with the similarity defined as the dot product between the corresponding word embeddings. BertAttack

[Li et al., 2020] has a similar algorithm, but instead of using word embeddings it relies on Bert token embeddings. Because of this, BertAttack processes the whole words and subword tokens in different ways, while trying to find suitable words to replace subword tokens.

Until now, there have been no reports of successful attempts of attacking genre classifiers or non-topical classification in general using neural methods, even though it is important to understand their reliability and to find ways for improving their robustness. In this study, we test two methods to attack text genre classifiers. The first method is based on swapping the keywords which are found with tf-df extraction, while the second method applies a modified TextFooler algorithm. Moreover, we try to improve the performance of the original classifiers by adding a set of texts broken by TextFooler to the training corpus.

In this paper we perform the following steps to investigate attacking techniques and to improve the reliability of the genre classifier:

  1. [noitemsep]

  2. training a baseline classifier using XLM-RoBERTa (Section 2);

  3. attacking the XLM-RoBERTa classifier by swapping topical keywords between the genres (Section 3.1);

  4. attacking the XLM-RoBERTa classifier with TextFooler (Section 3.2);

  5. performing targeted attacks on the XLM-RoBERTa classifier (Section 3.3);

  6. training a new XLM-RoBERTa classifier by using the original training corpus combined with the successfully attacked texts (Section 3.4);

All code, data, and materials to fully reproduce the experiments are openly available.111https://github.com/MikeLepekhin/TextGenresAttack

2 Baseline

Genre Keywords
Argument united, nations, reconciliation, international, development, people, security, countries
Fiction said, would, one, could, little, man, came, like, went, upon
Instruction tap, device, screen, email, tab, select, settings, menu, contact, message
News said, million, committee, disarmament, kongo, report, program, also, budget, democratic
Legal shall, article, may, paragraph, court, person, order, department, party, state
Personal church, one, like, people, could, really, congo, time, years, would
Promotion viagra, cialis, online, writing, posted, service, levitra, business, buy, essay
Academic system, quantum, fault, data, software, image, node, faults, application, fig
Information committee, convention, parties, secretariat, iran, meeting, shall, mines, states, conference
Review google, home, new, like, star, paul, one, shoes, pro, art
Table 2: Examples of English keywords extracted with tf-idf
Replaced 10% 50% 100%
EN 14 (1.1%) 31 (2.5%) 196 (15.5%)
RU 22 (1.5%) 44 (3.0%) 148 (10.0%)
Table 3: Successful attacks with keyword replacement

2.1 Training data

For training the genre classifiers, we use existing FTD datasets in English and in Russian [Sharoff, 2018]. Each of them contains more than 1,500 texts from a wide range of sources annotated with 10 genre labels, see Table 1. The dataset is relatively balanced with the most common categories being Argumentation and Promotion. For validation of the success of attacking models at the last stage (see the next section) we reserve a small portion of this dataset obtained by stratified sampling (columns Val in Table 1), which is not used in the training and attacking pipelines.

It is known that genre classifiers are often not robust when applied to a different corpus with the same labels [Sharoff et al., 2010], therefore we use independently produced test sets to simulate out-of-domain performance on large collections coming from a smaller number of sources. This makes them different from the training datasets, which came from a much wider range of sources.

For the Russian test set we use posts from LiveJournal, a social media platform popular in Russia. Each of these texts has been annotated by two assessors. Those texts for which the annotators did not agree have been adjudicated by an expert annotator. Since LiveJournal is a social media platform, the distribution of its texts significantly differs from the FTD corpora. It contains fewer Legal, Academic and Promotion texts and more News, Personal and Instruction texts (see Table 1).

As we lack an independent test set for English, we use “natural annotation” in the sense of collecting examples of texts for each genre from sources relatively homogenous with respect to this genres, such as StackExchange, which mainly contains instructive texts, or Wikipedia, which mainly contains texts for reference information, see more details in the Sources column in Table 1. Similar to social media data, natural annotation creates its biases, for example, submissions to arxiv.org tend to be on topics in physics or computer science, so that we will be able to test predictions in the presence of biases.

2.2 Training genre classifiers

We fine-tune the baseline XLM-RoBERTa classifiers following the same architecture as [Sun et al., 2019]

using the training part of the FTD corpus for 10 epochs with the Adam optimiser with

since these hyperparameters are used for fine-tuning in the original papers for several BERT-like models

[Devlin et al., 2018, Liu et al., 2019].

3 Genre attacks

The genre attack task is to make minimal alterations to a target text with the aim to change its prediction by an existing classifier. If a test text can be altered to change the label predicted by the classifier, and if this can be achieved within a fixed limit of alterations, the text is counted as “broken”. We can try untargeted and targeted attacks:

untargeted

these are attacks that intend to force the classifier to change its correct prediction on a test set text to produce any incorrect label from our set of labels;

targeted

the opposite attack direction when we attack texts for which the classifier makes a mistake by making alterations to force the classifier to predict the correct label.

The genre attacks are conducted to achieve cross-validation for attacks without leaking information about the target texts to the classifier: we randomly shuffle the training dataset and make 5 iterations of the cross-validation mechanism: For every the texts with numbers from to are used as test texts to attack the classifier which has been trained on the remaining texts from the training corpus. Thus, we get five architecturally identical classifier models with slightly different weights, as well as a set of successfully attacked texts we can use our analysis below.

Original Attacked
As a Company Limited by Guarantee this charity is owned not by any shareholders but by its members. Only members can vote at Annual General Meetings to elect officers and Directors or become Directors of the charity. So if you would like to help us in this way, contributing at least £5 per year and in return receive regular updates and an invitation to the AGM please complete a membership form Company Membership Form Friends Membership Form There is also the option to make a monthly donation towards our work. As little as £2 a month can make a real difference to Emmaus Projects. As a Company Limited by Guarantee that charity is owned not by any shareholders but by its members. Only members can vote at Annual General Meetings to elect officers and Directors or become Directors of the charity. So if you would like to help us in this way, contributing at least £5 per year and in return receive regular updates and an invitation to the AGM please complete a membership form Company Membership Form Friends Membership Form There is also the option to make a monthly donation towards our work. As little as £2 a month can make a real difference to Emmaus Projects.
label: Promotion label: Argument
Table 4: Untargeted attack example with TextFooler

3.1 Untargeted attack by swapping topical keywords

First, we test a simple text attack generator which is based on replacing keywords extracted for each genre by keywords extracted for other genres. The keywords are defined by their tf-idf scores within the genre texts. Table 2 lists the most significant keywords according to the tf-idf score. Some keywords correspond to their genres quite reasonably, for example, those from Fiction or Legal texts. However, most genres have fairly topical keywords, which indicates the prevalence of specific topics in the training corpus. For example, the keyword lists show that both Argument and News contain a lot of texts about international politics, while many Instruction texts refer to Internet services or communication devices.

Then the attack generator replaces a certain percentage of the keywords for a genre to a keyword of a random genre. We choose the following range of the keywords to be replaced: 10%, 50%, 100%. Contrary to our expectations concerning the prevalence of topic-specific keywords, our XLM-R classifier is reasonably robust to attacks on both English and Russian texts, as the rate of successfully broken texts is fairly low even when all tf-idf keywords are replaced, see Table 3.

USE Language k=15 k=30 k=50
0.84 EN 416 (32.9%) 438 (34.7%) 453 (35.8%)
0.84 RU 686 (47.4%) 718 (49.6%) 744 (51.4%)
0.6 EN 424 (33.5%) 444 (35.1%) 457 (36.2%)
0.6 RU 687 (47.5%) 720 (49.8%) 744 (51.4%)
0 EN 424 (33.5%) 444 (35.1%) 457 (36.2%)
0 RU 687 (47.5%) 720 (49.8%) 744 (51.4%)
Table 5: Successful untargeted attacks with different USE thresholds
Genre Words
Argument peopleresidents (14), havebe (13), havehas (12), worldworldwide (8), behave (8), socialsocietal (8), doknow (7), childreninfants (7), peopleindividuals (7), nuclearfissile (7)
Fiction hadhas (12), hadhave (10), willwants (10), havehas (6), kingmonarch (5), eachevery (4), diddoes (4), camecoming (4), comehappen (4), havebe (4)
Instruction doknow (18), willwants (12), behave (10), havebe (10), shouldought (10), clickclicking (6), choosechoices (5), basedinspired (4), trytrying (4), exampleexamples (4)
News willwant (13), hasmaintains (7), hashave (6), behave (5), willwants (5), havebe (5), saidstating (5), yearolds (4), newny (4), weekdays (4)
Legal behave (22), shallhereof (18), shallhowsoever (11), termsterminology (8), orderordering (8), personsomebody (8), conditionssituations (5), contractagreement (5), agreementagreed (5)
Personal lifelives (6), doknow (5), thinksuppose (5), wantedwant (5), feltknew (4), peopleindividuals (3), startedbegin (3), wentgoing (3), designstyling (2), sobecause (2)
Promotion behave (6), newny (5), businesscommerce (5), companycorporation (5), havebe (5), productsbyproducts (4), opportunityopportunities (4), helpaid (4), companyventure (4), modelmodels (4)
Academic scatteringscatter (8), havebe (5), findingsconfirmatory (3), mathematicaldynamical (3), analysisanalyzed (3), showshowcase (3), ideathought (3), computationcomputing (3), behave (3)
Information systemintegrator (4), numbernumbering (4), hashave (3), systemmechanism (3), eachevery (3), hadhas (2), personsomeone (2), littlescant (2), astronomyephemeris (2), ehcliga (2)
Review googleyahoo (3), qualitydependability (2), reviewreassessment (2), synthsynths (2), moviemovies (1), companycorporation (1), rescuerescued (1), getgot (1), engadgetwired (1)
Table 6: Most common English word pairs amended with untargeted TextFooler attack
Original Attacked
In addition to the internet connection, you should also try to have at least 100 MB of free space available on your drive when you install Titan Poker. In addition to the internet connection, you need also trying to have at least 100 MB of free space available on your drive when you install Titan Poker.
label: Instruction label: Review
Table 7: Example of deterioration of grammar in untargeted attack

3.2 Attacking with untargeted TextFooler

Genre English Russian
Argument 12.0 12.0
Fiction 19.0 12.0
Instruction 16.5 23.5
News reports 10.0 21.0
Legal 26.0 20.0
Personal 18.0 6.0
Promotion 12.0 25.0
Academic 11.0 18.0
Information 5.5 5.5
Review 3.0 3.5
Table 8: The median number of words per text for successful genre attacks

The original TextFooler algorithm has the following stages. First, we order the words of the training corpus (after excluding the stop words) by the descending order of their importance scores , that defined in the following way:

where is the predicted label for text , is the predicted probability of the genre for the text , and denotes a text with removed. The intuition of the importance score is that removal of a more important word leads to greater distortion of the predicted probability.

Then for every word in the attacked text, closest words are chosen by the value of the dot product of their embeddings with the embedding of the original word. These words are the candidates for replacing the original word. We iterate through the words in the order of their importance and try to replace each of them with one of the candidates following rules in a set of filters. If we succeed in doing that, then the text replacement is considered as successful. Otherwise, we continue to iterate through the list of candidates. If we cannot find a candidate for replacing the word , we try the word for which the classifier gives the minimal probability of the original class for the text with this replacement. If we have iterated all over the words , but the classifier still predicts the original label for the text, the attack is unsuccessful.

The filters for choosing a suitable replacement can vary. First, we can keep the same part-of-speech tag (usually on the top level of tags, for example, NOUNNOUN). Second, we can vary the lower limit threshold for the word similarity score for each candidate. In the original TextFooler algorithm, this threshold is fixed at 0.5. In our study the 20-80 percentile range for the embedding similarities between each word and its closest neighbour is 0.61–0.82 for English and 0.67–0.82 for Russian. If we take into account the top-15 most similar embeddings for each word embedding, the 20–80 percentile range for English is 0.49–0.66, for Russian it is 0.52–0.68. This limits the range of values for selecting the similarity threshold.

Finally, to preserve the meaning and the grammatical correctness of the attacked texts, we estimate the similarity between the original sentence and its attacked version with the Universal Sentence Encoder

[Cer et al., 2018]. The original TextFooler paper fixed the threshold of the minimal score to 0.84, we tried varying it in our study.

We also made two experiments when the replacement of the stop words is allowed and not. We find that there is no big difference in the number of broken texts in either case. Furthermore, we experimented with various values of and the minimal USE score to find out how they affected the number of the attacked texts and the robustness of the XLM-RoBERTa model trained on them. Since the original TextFooler implementation in the TextAttack framework [Morris et al., 2020] does not contain embeddings for Russian, we used FastText embeddings for both English and Russian to make the experiments with both languages comparable.

Table 5 shows that the number of the successfully attacked texts is practically independent from the USE threshold when it varies from the default 0.84 to 0, so this filter is not particularly useful for genre attacks. At the same time, the proportion of the broken texts increases when more variants for attack are considered (the value of , the number of nearest neighbours to consider).

Besides, TextFooler turned out to be more efficient in breaking the Russian texts, about 15% difference in the proportion of broken texts. However, we should note that TextFooler tries to attack only texts which the model classifies correctly. As the XLM-RoBERTa classifier performed better on the Russian texts, we make more attacks on Russian texts in general.

Our experiments with applying TextFooler to genre classification produced convincing replacements which preserved the meaning for both English and Russian. Table 4 shows an example of a text, successfully broken by this mechanism in our task. A replacement of just one word in a reasonably long text is able to change the prediction of the classifier. The words being replaced are typically not crucial for judging the genre. Table 6 lists the most common word replacement pairs with untargeted attacks for each genre for English.

We also found that preserving the grammar is trickier: Table 7 shows an example of alteration that makes a text ungrammatical. A word-level replacement mechanism is not enough to keep the sentence grammatical, as replacing in this example requires syntactic alterations, and filtering by the Universal Sentence Encoder scores is not enough to guarantee syntactic coherence. Table 6 also shows that many replacements do not keep the grammatical number, for example, replacing , , which is likely to lead to ungrammatical sentences. Also the replacements are not motivated, as often the replacements for the same genre can go in both directions, for example, and , or they have nothing to do with the genres, for example, . We can assume that the lack of motivated replacements is coming from instability of parameters in the transformer model when small changes in the input texts lead to considerable changes in the output predictions.

Table 8 lists the results for untargeted attack for both English and Russian FTD corpora in terms of the number of words needed for a successful change of genre predictions for a text. The most difficult genres for attack in English are Legal, Fiction and Personal blogs. This is likely because their training sets are less affected by topical biases. In contrast, Information and Review texts are easier to attack with fewer substitutions, while they are more affected by topical biases, such as politics, see also the keywords in Table 2 and the most salient replacement pairs in Table 6. However, the difference from tf-idf replacements is that TextFooler tends to amend more frequent English verbs, while the words chosen by the tf-df mechanism are more genre-specific. That confirms the observation that replacements in successful TextFooler attacks on genre classification are not motivated by genre if the original training set is biased.

3.3 Targeted attacks with TextFooler

For targeted attacks we use the same mechanism with TextFooler, but we choose the replacement candidate that maximises the probability of the true class.

Language k=15 k=30 k=50
EN 233 (34.2%) 248 (36.4%) 254 (37.2%)
RU 317 (57.3%) 326 (59.0%) 328 (59.3%)
Table 9: The number of the texts broken by the targeted attack, USE threshold = 0.84

Table 9 lists for how many texts the classifier predictions can be improved by the attack mechanism. Targeted attacks are considerably harder that the untargeted ones.

Genre F1 Prec Rec
Base Robust Base Robust Base Robust
Argument 0.585 0.550 0.514 0.612 0.678 0.499
Fiction 0.685 0.677 0.902 0.697 0.553 0.658
Instruction 0.651 0.738 0.891 0.762 0.813 0.716
News 0.940 0.937 0.917 0.943 0.965 0.931
Legal 0.585 0.615 0.444 0.480 0.857 0.857
Personal 0.742 0.723 0.747 0.657 0.737 0.805
Promotion 0.333 0.408 0.316 0.369 0.353 0.456
Academic 0.273 0.489 0.250 0.440 0.300 0.550
Information 0.586 0.578 0.690 0.596 0.509 0.561
Review 0.571 0.535 0.550 0.559 0.595 0.514
Table 10: Comparison of the Base and the Robust XLM-RoBERTa results for English
Genre F1 Prec Rec
Base Robust Base Robust Base Robust
Argument 0.584 0.728 0.487 0.734 0.723 0.723
Fiction 0.913 0.928 0.977 0.907 0.858 0.950
Instruction 0.535 0.617 0.708 0.635 0.430 0.600
News 0.816 0.848 0.710 0.767 0.960 0.948
Legal 0.860 0.652 0.990 0.995 0.760 0.485
Personal 0.682 0.707 0.719 0.685 0.648 0.730
Promotion 0.819 0.881 0.914 0.912 0.742 0.853
Academic 0.892 0.886 0.823 0.816 0.975 0.968
Information 0.942 0.845 0.923 0.753 0.963 0.963
Review 0.812 0.774 0.892 0.857 0.745 0.705
Table 11: Comparison of the Base and the Robust XLM-RoBERTa results for Russian
Corpus no attacked k=15 k=30 k=50
En, Natural 0.747 0.026 0.796 0.011 0.771 0.01 0.776 0.029
Ru, LiveJournal 0.76 0.003 0.756 0.008 0.755 0.009 0.756 0.005
Table 12: Accuracy of the XLM-RoBERTa classifier trained on the attacked texts
Genre F1 Prec Rec
Base Robust Base Robust Base Robust
Argument 0.566 0.732 0.534 0.724 0.603 0.740
Fiction 0.914 0.929 0.951 0.913 0.88 0.945
Instruction 0.448 0.621 0.613 0.636 0.353 0.608
News 0.689 0.856 0.529 0.784 0.988 0.943
Legal 0.798 0.652 0.985 0.995 0.670 0.485
Personal 0.658 0.702 0.580 0.681 0.760 0.725
Promotion 0.502 0.885 0.802 0.915 0.365 0.858
Academic 0.910 0.888 0.883 0.820 0.940 0.968
Information 0.944 0.847 0.917 0.753 0.973 0.968
Review 0.752 0.777 0.933 0.865 0.630 0.705
Table 13: Comparison of the Base and the Robust XLM-RoBERTa results for the English natural annotation corpus

3.4 Adding attacked texts to train new genre classifiers

In the next step we add broken texts with correct labels to train a new model and we test it on the validation portion of the original training corpus and also on test corpora. Table 12 lists the robust classifier performance on the test corpora. It shows that the XLM-RoBERTa classifier trained on the attacked texts attains higher accuracy than the baseline classifier. Table 13

shows, that for most genres the robust classifier achieves higher f1-score. The same is true for precision and recall.

Training XLM-RoBERTa on concatenation of the original and broken texts does not improve the classifier performance on the LiveJournal corpus but significantly increases the accuracy on the English genre corpus with natural annotation. Besides, the best result is attained when hyper-parameter value is used. It shows that the quality of attack is more important than the number of the successfully attacked texts for boosting the classifier performance. In the Table 10 we can see that the robust classifier performs better for most genres. In the Table 11 the improvement in terms of the F1 score is limited, since for many genres improving recall implies deterioration of precision.

4 Related Work

Genre classification is not a new task, since non-topical classification is needed for many applications. There have been experiments with various architectures from linear discriminant analysis [Karlgren and Cutting, 1994] to SVM [Dewdney et al., 2001]

to recurrent neural networks

[Kunilovskaya and Sharoff, 2019]. Early work on robust genre classification across different training and testing corpora [Sharoff et al., 2010]

reveals the problem of topical biases in the genre corpora available at the moment. In this paper we try to solve the problem indirectly by improving their robustness.

[Petrenz and Webber, 2010] investigate a very important idea concerning estimation of the reliability of genre classifiers via its validation on a corpus with different topical distributionss but with the same genre labels. Our study continues this line of research when we use the datasets from natural annotation and LiveJournal to estimate the model accuracy on an out-of-domain testing corpus.

Our experiments with using adversarial attacks for genre classification are novel. The most efficient adversarial attack techniques for classifiers [Jin et al., 2019, Li et al., 2020] are based on usage of word-level embeddings and finding for each word a fixed number of the most similar words as candidates for replacing with. Our genre attacks are based on the TextFooler [Jin et al., 2019] with a modification that we allow replacing of the stop-words and vary the USE threshold. TextFooler [Jin et al., 2019] was chosen as the basis for genre attacks in this study due to its efficiency and flexibility as it can be applied to various neural models. We also experimented with BertAttack, that differs from the TestFooler algorithm in its use of BERT token embeddings instead of pre-trained word-level embeddings. In our initial experiments we found it to be much slower than TextFooler and also somewhat less efficient for the genre attack task, for example, the percentage of the texts successfully broken by BertAttack is lower than 15% for English. Therefore, we only report the results with TextFooler here. A recent experiment on adversarial attacks on personal style, as another non-topical classification task, [Emmery et al., 2021] is the closest to our study. They attacked author profile predictions using similar methods. However, they have not investigated the question of attacks on genre predictions.

5 Conclusion

In this paper we show that the XLM-RoBERTa genre classifier is resistant to simple attack methods, such as replacement of tf-idf keywords, this is unlike traditional feature-based methods which are very sensitive to the keywords. At the same time, even the XLM-RoBERTa classifier can be deceived by word-based adversarial attacks using mechanisms like TextFooler. In the case of the baseline classifier, more than 35% of English texts in the training corpus can be successfully broken, raising to more than 50% for Russian. The number of successfully attacked texts can be considered as an important metric for estimating the robustness of the classifiers – the lower the number of broken texts, the more difficult it is to break the classifier, which implies higher robustness. Also we find some important patterns in the attack results, in particular, the threshold for USE almost does not affect the number of the attacked texts; attacks are more efficient for the Russian language; the higher the number of replacing candidates, the less the difference in reliability of the robust classifier vs the original one.

Our experiments demonstrate the effectiveness of TextFooler at improving robustness of genre classifiers via adversarial attacks. For example, adding broken texts (with their original labels) improves the overall accuracy, while texts in the new collection cannot be broken by the same set of adversarial attacks, thus implying a more robust classifier. We also tried targeted attacks, but fewer text can be broken and the classifiers trained on the targeted attacked texts performed worse than those coming from the untargeted attack.

References