DeepAI
Log In Sign Up

Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models

Practical needs of developing task-oriented dialogue assistants require the ability to understand many languages. Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots. In such setup models for cross-lingual transfer show remarkable performance in joint intent recognition and slot filling. However, existing benchmarks lack of code-switched utterances, which are difficult to gather and label due to complexity in the grammatical structure. The evaluation of NLU models seems biased and limited, since code-switching is being left out of scope. Our work adopts recognized methods to generate plausible and naturally-sounding code-switched utterances and uses them to create a synthetic code-switched test set. Based on experiments, we report that the state-of-the-art NLU models are unable to handle code-switching. At worst, the performance, evaluated by semantic accuracy, drops as low as 15% from 80% across languages. Further we show, that pre-training on synthetic code-mixed data helps to maintain performance on the proposed test set at a comparable level with monolingual data. Finally, we analyze different language pairs and show that the closer the languages are, the better the NLU model handles their alternation. This is in line with the common understanding of how multilingual models conduct transferring between languages

READ FULL TEXT VIEW PDF

page 10

page 16

page 17

page 18

page 19

page 20

page 21

page 22

03/13/2021

Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling

Predicting user intent and detecting the corresponding slots from text a...
05/07/2022

Multi-level Contrastive Learning for Cross-lingual Spoken Language Understanding

Although spoken language understanding (SLU) has achieved great success ...
03/24/2021

Are Multilingual Models Effective in Code-Switching?

Multilingual language models have shown decent performance in multilingu...
03/17/2021

Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots

Multilingual models have demonstrated impressive cross-lingual transfer ...
07/30/2015

One model, two languages: training bilingual parsers with harmonized treebanks

We introduce an approach to train lexicalized parsers using bilingual co...
06/21/2019

A Deep Generative Model for Code-Switched Text

Code-switching, the interleaving of two or more languages within a sente...

1 Introduction

The usability of task-oriented dialog (ToD) assistants depends crucially on their ability to process users’ utterances in many languages. At the core of a task-oriented dialog assistant is a natural language understanding (NLU) component, which parses an input utterances into a semantic frame by means of intent recognition and slot filling [39]. Intent recognition tools identify user needs (such as buy a flight ticket). Slot filling extracts the intent’s arguments (such as departure city and time).

Common approaches to training multilingual task-oriented dialogue systems rely on (i) the abilities of pre-trained language models to transfer learning across languages and

[5, 24] and (ii) translate-and-align pipelines [17].

The latter group of methods leverages upon pre-training on large amounts of raw textual data for many languages. Different learning objectives are used to train representations aligned across languages. The alignment can be further improved by means of task-specific label projection [26] and representation alignment [13] methods.

The former group of methods utilizes off-the-shelf machine translation engines to translate (i) the training data from resource-rich languages (almost exclusively English) to target languages or (ii) the evaluation data from target languages into English [17]. Further word alignment techniques help to match slot-level annotations [9, 29]. Finally, a monolingual model is trained to make desired predictions.

A number of datasets for cross-lingual NLU have been developed. To name a few, MultiAtis++ [44], covering seven languages across four language families, contains dialogues related to a single domain – air travels. MTOP [22] comprises six languages and 11 domains. xSID [40] is an evaluation-only small-scale dataset, collected for 12 languages and six domains.

Recent research has adopted a new experimental direction aimed at developing cross-lingual augmentation techniques, which learn the inter-lingual semantics across languages [9, 14, 20, 25, 30]. These work seek to simulate code-switching, a phenomenon where speakers use multiple languages mixed up [34]. Experimental results consistently show that augmentation with synthetic code-switched data leads to significantly improved performance for cross-lingual NLU tasks. What is more, leveraging upon these datasets in practice meets the needs of multicultural and multilingual communities. To the best of our knowledge, large-scale code-switching ToD corpora do not exist.

This paper extends the ongoing research on exploring benefits from synthetic code-switching to cross-lingual NLU. Our approach to generating code-switched utterances relies on grey-box adversarial attacks on the NLU model. We perturb the source utterances by replacing words or phrases with their translations to another language. Next, perturbed utterances are fed to the NLU model. Increases in the loss function indicate difficulties in making predictions for the utterance. This way, we can (i) generate code-switched adversarial utterances, (ii) discover insights on how code-switching with different languages impacts the performance of the target language, (iii) gather augmentation data for further fine-tuning of the language model. To sum, our contributions are

111Our code and datasets will be released in open access upon acceptance:

  1. We implement several simple heuristics to generate code-switched utterances based on monolingual data from an NLU benchmark;

  2. We show-case that monolingual models fail to process code-switched utterances. At the same time, cross-lingual models cope much better with such texts;

  3. We show that fine-tuning of the language model on code-switched utterances improves the overall semantic parsing performance by up to a 2-fold increase.

2 Related work

2.0.1 Generation of code-switched text

has been explored as a standalone task [16, 21, 32, 33, 38, 42] and as a way to augment training data for cross-lingual applications, including task-oriented dialog systems [9, 14, 20, 25, 30], machine translation [1, 12, 15], natural language inference and question answering [36], speech recognition [46].

Methods for generating code-switched text range from simplistic re-writing of some words in the target script [12] to adversarial attacks on cross-lingual pre-trained language models [38] and building complex hierarchical VAE-based models [33]. The vast majority of methods utilize machine translation engines [36], parallel datasets [1, 12, 15, 42]

or bilingual lexicons

[38] to replace the segment of the input text with its translations. Bilingual lexicons may be induced from the parallel corpus with the help of soft alignment, produced by attention mechanisms [21, 25]. Pointer networks can used to select segments for further replacement [16, 42]. If natural code-switched data is available, such segments can be identified with a sequence labeling model [15]. Other methods rely on linguistic theories of code-switching. To this end, GCM toolkit [32] leverages two linguistic theories, which help to identify segments where code-switching may occur by aligning words and mapping parse trees of parallel sentences.

The quality of generated code-switched texts is evaluated by (i) intrinsic text properties, such as code-switching ratio, length distribution, and (ii) extrinsic measures, ranging from the perplexity of external languages model to the downstream task performance, in which code-switched data was used for augmentation [33].

2.0.2 Natural Language Understanding

in the ToD domain has two main goals, namely intent recognition, and slot filling [31]. Intent recognition assigns a user utterance with one of the pre-defined intent labels. Thus, intent recognition is usually tackled with classification methods. Slot filling seeks to find arguments of the assigned intent and is modeled as a sequence modeling problem. For example, the utterance I need a flight from Moscow to Tel Aviv on 2nd of December should be assigned with the intent label find flight; three slots may be filled: departure city, arrival city, date. These two interdependent NLU tasks are frequently approached via multitask learning, in which a joint model is learned to recognized intents and fill in slots simultaneously [3, 19, 43].

2.0.3 Adversarial attacks

on natural language models has been categorized with respect to (i) what kind of information is provided from the model and (ii) what kind of perturbation is applied to the input text [28]. White-box attacks [8] have access to the whole model’s inner workings. On the opposite side, black-box attacks [11] do not have any knowledge about the model. Grey-box attacks [45]

access predicted probabilities and loss function values. Perturbations can be applied at char-, token-, and sentence-levels

[11, 23, 4].

2.0.4 Other related research directions

include code-switching detection [27, 37], evaluation of pre-trained language models’ robustness to code-switching [41], analysis of language model’s inner workings with respect to code-switched inputs [35], benchmarking downstream tasks in code-switched data [2, 18].

3 Our approach

In our work we train multilingual language models for the joint intent recognition and slot-filling task.

3.1 Dataset

We chose MultiAtis++ dataset [44]

as the main source of data. This dataset contains seven languages from three language families - Indo-European (English, German, Portuguese, Spanish, and French), Japanese, and Sino-Tibetan (Chinese). The dataset is a parallel corpus for classifying intents and slot labeling - in 2020; it was translated from English to the other six languages. The training set contains 4978 samples for each language; the test set contains 893 samples per language. Each object in the dataset consists of a sentence, slot labels in BIO format, and the intent label.

3.2 Joint intent recognition and slot-filling

We train a single model for the joint intent recognition and slot-filling task. The model has two heads; the first one predicts intents, and the second one predicts slots. We had trained two different models as a backbone - m-BERT and XLM-RoBERTa.

We aim at comparing two setups: (i) training on the whole dataset and (ii) only on its English subset followed by zero-shot inference for other languages. For convenience, we propose short names for the whole four models trained during the research - xlm-r, xlm-r-en, m-bert, m-bert-en.

We measure our models’ quality with three metrics: Intent accuracy, F1 score for slots (we used micro-averaging by classes) and Semantic accuracy.

3.3 Code-switching generation

We propose two variants of gray-box adversarial attacks. During the attack, we have access to the model’s loss of input data. We strive to create an attack so that the resulting adversarial perturbation of the source sentence is as close as possible to the realistic code-switching. Quality evaluation at such adversarial attacks can act as a lower bound for corresponding models’ quality in similar problems in the presence of real code-switching in input data.

Sentence and label x, y; source model ; embedded target language
Adversarial sample x’
= GetLoss(, x, y)
for i in permutation(len(x)) do
     Candidates = GetCandidates(, x, y, token_id = i)
     Losses = GetLoss(, Candidates)      
     if Candidates and max(Losses)  then
           = max(Losses)
          x, y = Candidates[argmax(Losses)]
     end if
end for
     return x
Algorithm 1 General view of the attack

We focus mainly on the lexical aspect of code-switching when some words are replaced with their substitutes from other languages. We replace some tokens in the source sentence with their equivalents from the attacking languages during the attack. The method to determine the replacement depends on which exactly attack is used. Since most people who can use code-switching are bilinguals, in our work, we propose to analyze attacks consisting in embedding one language into another.

3.3.1 Overview of the attacks

The general attack scheme (Algorithm 1) is the same for both proposed attacks. We offer the following attack pattern in our work: a source model, a pair of sample sentences and labels, and embedded target language. Then we iterate over the tokens in the sample sentence and strive to replace them with their equivalent from the embedded target language. If changing the token to its equivalent increases the source model’s loss, we replace the token with the proposed candidate. The difference between the two methods consists in the way they generate replacement candidates.

3.3.2 Word-level adversaries

The first attack (Algorithm 2) generates target embedded language substitutions by translating single tokens into the corresponding language. Attacking this way, we make a rough lower bound since we do not consider the context of the sentences and the ambiguity of words during the attack. To translate words into other languages, we use the large-scale many-to-many machine translation model M2M-100 from Facebook [10]. You can see an example of this attack in table 1.

Machine translation model
function GetCandidates(, x, y, token_id)      
     if x[token_id] in  then
          tokens = [x[token_id]]
          x[token_id] = tokens
          y[token_id] = ExtendSlotLabels(y[token_id], len(tokens))
     end if
     return x, y
end function
Algorithm 2 Word-level attack
Utterance en what are the flights from las vegas to ontario
Utterance adv what sind die flights from las vegas to ontario
Table 1: Example of attacking XLM-RoBERTa (xlm-r) with word-level attack.

3.3.3 Phrase-level adversaries

The second attack (Algorithm 3) generates equivalents from other languages by building alignments between sentences in different languages. One sentence is a translation of another; we utilize the fact that we have a parallel dataset. Candidates for each token are defined as tokens from the embedded sentence into which the token was aligned. For aligning sentences, we use the awesome-align model based on m-BERT [7]. You can see an example of this attack in table 2.

Sentences alignment
function GetCandidates(, x, y, token_id)
     if x[token_id] in  then
          tokens = [x[token_id]]
          x[token_id] = tokens
          y[token_id] = ExtendSlotLabels(y[token_id], len(tokens))
     end if
     return x, y
end function
Algorithm 3 Phrase-level attack
Utterance en please find flights available from kansas city to newark
Utterance adv encontre find flights disponíveis from kansas city para newark
Table 2: Example of attacking XLM-RoBERTa (xlm-r) with phrase-level attack.

3.4 Adversarial pre-training method protects from adversarial attacks

Adversarial pre-training protects the model against the proposed adversarial attacks. It most likely allows the model to increase the performance not only at adversarial perturbations but also on real data with code-switching. However, this is only a hypothesis since there is no real-life code-switched ToD data.

The adversarial pre-training method relies on domain adaptation techniques and has several steps:

  1. Generating adversarial training set for masked language modeling task.

  2. Fine-tuning language model’s body on the new generated set in masked language modeling task.

  3. Loading fine-tuned model’s body before training for joint intent classification and slot labeling task.

3.4.1 Generating adversarial training set

To generate an adversarial training set, we use an adaptation of the phrase-level algorithm of the adversarial attack (Algorithm 4). The difference is that tokens are replaced with their equivalents with the probability of . Thus, a trained model is not required to generate the sample. The adversarial training set is a concatenation of generated sets for all languages in the dataset except English. Each subset is generated by embedding the target language into the training set of the MultiAtis++ dataset in English. After generation, we get six subsets of 4884 sentences each. The final adversarial training set consists of 29304 sentences; we divide it into training and test sets in a ratio of 9 to 1.

Training dataset , set of embedded languages
Adversarial training set
X’ = [ ]
for  in  do
     for x in X do
         for i in permutation(len(x)) do
               Candidates = GetCandidates(, x, y, token_id = i)      
              if Candidates and (0, 1) ¿ 0.5 then
                    x, _ = random.choice(Candidates)
              end if
         end for
          X‘.append(x)
     end for
end for
return X’
Algorithm 4 Generating adversarial training set

3.4.2 Fine-tuning model’s body

After generating the adversarial training set, we fine-tune the pre-trained multilingual model. The model is trained with the masked language modeling objective [6]. We select 15% of tokens and predict them using the model to train a model for such a task. 80% of the selected tokens are replaced with the mask token, 10% are replaced with random words from the model’s dictionary, the remaining 10% remain unchanged. After fine-tuning, we dump the body of the model for future use.

3.4.3 Loading fine-tuned model’s body

Before training the multilingual model for the task of join intent classification and slot labeling, we load the fine-tuned body of the model. For models that have been pre-trained using the adversarial pre-training method, we will add the suffix adv to the name.

4 Experimental results

We will compare models trained only on the English training set (zero-shot models) and the whole training set (full models). We will evaluate the quality by three metrics - accuracy for intents, f1-score for slots, and the proportion of sentences where we correctly classified everything - both intent and all slots - semantic accuracy. We found that zero-shot models have significantly worse quality than full ones, not only in languages other than English but even in English.

4.0.1 Joint intent classification and slot labelling

We have achieved strong performance for the problem of classifying intents and filling in slots. On the test sample, full models showed an average of 97% correct answers for intents, and zero-shot ones, on average, 85%. Full models showed 0.93 f1-score for slots, zero-shot 0.68. Full models showed 79% of completely correctly classified sentences and zero-shot ones - about 26%. This shows that zero-shot learning is not capable of competing with full learning in this particular task. In the graph 1, you can see the comparison between models and languages by Intent accuracy metric. Results are depicted in figures B and detailed tables C in the appendices.

Figure 1: Model comparison on test set of MultiAtis++ dataset by Intent accuracy metric.

4.0.2 Attacking models

We attacked all the models using our two algorithms. We found that the word-level attack turned out to be more advanced, leading to lower performance. For full models, the quality of intents fell from 98% to 88%, for zero-shot models from 92% to 77%. For full models, the quality by slots fell from 0.95 to 0.6, for zero-shot models from 0.88 to 0.48. For full models, the proportion of entirely correctly classified sentences fell from 83% to 14%, for zero-shot ones from 60% to 5%. Figure 2 compares results before and after the word-level attack with the Intent accuracy metric.

We also got that the phrase-level attack turned out to be softer and gave a higher quality compared to the word-level attack. For full models, the quality of intents fell from 98% to 95%, for zero-shot models from 92% to 80%. For full models, the quality by slots fell from 0.95 to 0.7, for zero-shot models from 0.88 to 0.55. For full models, the proportion of entirely correctly classified sentences fell from 83% to 35%, for zero-shot ones from 60% to 10%. In the graph 3, you can see the quality comparison after the phrase-level attack for the Intent accuracy metric.

Figure 2: Model comparison after word-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure 3: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.

4.0.3 Adversarial pre-training

To protect the models from attacks, we fine-tuned the bodies for both models and loaded them before training for the task of joint intent classification and slot filling. We found that the defense had almost no effect on the full models in terms of quality on the test set. For zero-shot models, the effect on the test set is ambiguous - the quality by intents fell for Asian languages but increased slightly for all others. As for the slots, we observe a negative effect for the m-BERT model and a positive effect for the XLM-RoBERTa model.

For the word-level attack, a slight deterioration in the quality of intents for Asian languages is noticeable, and a positive effect for other languages. After the adversarial pre-training, the quality of slots increased for all models, which ultimately results in an almost two-fold increase in the proportion of entirely correctly classified sentences for zero-shot models and about 15% relative improvement for full models.

Figure 4: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.

Again, for the phrase-level attack, a slight deterioration in the quality of intents for Asian languages is noticeable, and a positive effect for other languages. After the defense, the quality of slots dropped slightly for Asian languages and increased significantly for the rest. This results in a two-fold increase in the proportion of entirely correctly classified sentences for zero-shot models and about 15% relative improvement for full models (graph 4).

5 Discussion

We approached the problem of recognizing intents and filling in slots for a multi-lingual ToD system. We study the effect of switching codes on two multi-lingual language models, XLM-RoBERTa and m-BERT. We showed that switching codes could become a noticeable problem when applying language models in practice using two gray-box attacks. However, the defense method shows promising results and helps to improve the quality after the model is attacked.

6 Conclusion

This paper presents an adversarial attack on multilingual task-oriented dialog (ToD) systems that simulates code-switching. This work is motivated by research and practical needs. First, the proposed attack reveals that pre-trained language models are vulnerable to synthetic code-switching. To this end, we develop a simplistic defense technique against code-switched adversaries. Second, our work is motivated by the practical needs of multilingual ToDs to cope with code-switching, which is seen as an essential phenomenon in multicultural societies. Future work directions include evaluating how plausible and naturally-sounding code-switched adversaries are and adopting similar approaches to model-independent black-box scenarios.

References

  • [1] M. Abdul-Mageed and L. V. Lakshmanan (2021) Exploring text-to-text transformers for english to hinglish machine translation with synthetic code-mixing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 36–46. Cited by: §2.0.1, §2.0.1.
  • [2] G. Aguilar, S. Kar, and T. Solorio (2020-05) LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 1803–1813 (English). External Links: Link, ISBN 979-10-95546-34-4 Cited by: §2.0.4.
  • [3] Q. Chen, Z. Zhuo, and W. Wang (2019) BERT for joint intent classification and slot filling. External Links: 1902.10909 Cited by: §2.0.2.
  • [4] M. Cheng, J. Yi, P. Chen, H. Zhang, and C. Hsieh (2020) Seq2sick: evaluating the robustness of sequence-to-sequence models with adversarial examples. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 3601–3608. Cited by: §2.0.3.
  • [5] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §3.4.2.
  • [7] Z. Dou and G. Neubig (2021) Word alignment by fine-tuning embeddings on parallel corpora. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), Cited by: §3.3.3.
  • [8] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2018) HotFlip: white-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 31–36. Cited by: §2.0.3.
  • [9] A. Einolghozati, A. Arora, L. S. Lecanda, A. Kumar, and S. Gupta (2021) El volumen louder por favor: code-switching in task-oriented semantic parsing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1009–1021. Cited by: §1, §1, §2.0.1.
  • [10] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin (2020) Beyond english-centric multilingual machine translation. arXiv preprint. Cited by: §3.3.2.
  • [11] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi (2018)

    Black-box generation of adversarial text sequences to evade deep learning classifiers

    .
    In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: §2.0.3.
  • [12] D. Gautam, P. Kodali, K. Gupta, A. Goel, M. Shrivastava, and P. Kumaraguru (2021) CoMeT: towards code-mixed translation using parallel monolingual sentences. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pp. 47–55. Cited by: §2.0.1, §2.0.1.
  • [13] M. Gritta and I. Iacobacci (2021) XeroAlign: zero-shot cross-lingual transformer alignment. arXiv preprint arXiv:2105.02472. Cited by: §1.
  • [14] Y. Guo, L. Shou, J. Pei, M. Gong, M. Xu, Z. Wu, and D. Jiang (2021) Learning from multiple noisy augmented data sets for better cross-lingual spoken language understanding. arXiv preprint arXiv:2109.01583. Cited by: §1, §2.0.1.
  • [15] A. Gupta, A. Vavre, and S. Sarawagi (2021) Training data augmentation for code-mixed translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5760–5766. Cited by: §2.0.1, §2.0.1.
  • [16] D. Gupta, A. Ekbal, and P. Bhattacharyya (2020) A semi-supervised approach to generate the code-mixed text using pre-trained encoder and transfer learning. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings

    ,
    pp. 2267–2280. Cited by: §2.0.1, §2.0.1.
  • [17] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson (2020) Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In

    International Conference on Machine Learning

    ,
    pp. 4411–4421. Cited by: §1, §1.
  • [18] S. Khanuja, S. Dandapat, A. Srinivasan, S. Sitaram, and M. Choudhury (2020) GLUECoS: an evaluation benchmark for code-switched nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3575–3585. Cited by: §2.0.4.
  • [19] Y. Kim, D. Kim, A. Kumar, and R. Sarikaya (2018) Efficient large-scale neural domain classification with personalized attention. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 2214–2224. External Links: Link, Document Cited by: §2.0.2.
  • [20] J. Krishnan, A. Anastasopoulos, H. Purohit, and H. Rangwala (2021) Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. arXiv preprint arXiv:2103.07792. Cited by: §1, §2.0.1.
  • [21] G. Lee and H. Li (2020) Modeling code-switch languages using bilingual parallel corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 860–870. Cited by: §2.0.1, §2.0.1.
  • [22] H. Li, A. Arora, S. Chen, A. Gupta, S. Gupta, and Y. Mehdad (2021) MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2950–2962. Cited by: §1.
  • [23] L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu (2020) BERT-attack: adversarial attack against bert using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6193–6202. Cited by: §2.0.3.
  • [24] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020)

    Multilingual denoising pre-training for neural machine translation

    .
    Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §1.
  • [25] Z. Liu, G. I. Winata, Z. Lin, P. Xu, and P. N. Fung (2020) Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2.0.1, §2.0.1.
  • [26] Z. Liu, G. I. Winata, P. Xu, Z. Lin, and P. Fung (2020) Cross-lingual spoken language understanding with regularized representation alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7241–7251. Cited by: §1.
  • [27] D. Mave, S. Maharjan, and T. Solorio (2018-07) Language identification and analysis of code-switched social media text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 51–61. External Links: Link, Document Cited by: §2.0.4.
  • [28] J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi (2020) TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126. Cited by: §2.0.3.
  • [29] M. Nicosia, Z. Qu, and Y. Altun (2021) Translate & fill: improving zero-shot multilingual semantic parsing with synthetic data. arXiv preprint arXiv:2109.04319. Cited by: §1.
  • [30] L. Qin, M. Ni, Y. Zhang, and W. Che (2020) Cosda-ml: multi-lingual code-switching data augmentation for zero-shot cross-lingual nlp. arXiv preprint arXiv:2006.06402. Cited by: §1, §2.0.1.
  • [31] E. Razumovskaia, G. Glavaš, O. Majewska, E. M. Ponti, A. Korhonen, and I. Vulić (2021) Crossing the conversational chasm: a primer on natural language processing for multilingual task-oriented dialogue systems. External Links: 2104.08570 Cited by: §2.0.2.
  • [32] M. S. Z. Rizvi, A. Srinivasan, T. Ganu, M. Choudhury, and S. Sitaram (2021) GCM: a toolkit for generating synthetic code-mixed text. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 205–211. Cited by: §2.0.1, §2.0.1.
  • [33] B. Samanta, S. Reddy, H. Jagirdar, N. Ganguly, and S. Chakrabarti (2019) A deep generative model for code-switched text. arXiv preprint arXiv:1906.08972. Cited by: §2.0.1, §2.0.1, §2.0.1.
  • [34] D. Sankoff and S. Poplack (1981) A formal grammar for code-switching. Research on Language & Social Interaction 14 (1), pp. 3–45. Cited by: §1.
  • [35] S. Santy, A. Srinivasan, and M. Choudhury (2021) BERTologiCoMix: how does code-mixing interact with multilingual bert?. In Proceedings of the Second Workshop on Domain Adaptation for NLP, pp. 111–121. Cited by: §2.0.4.
  • [36] J. Singh, B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2019) XLDA: Cross-lingual data augmentation for natural language inference and question answering. arXiv preprint arXiv:1905.11471. Cited by: §2.0.1, §2.0.1.
  • [37] D. Sravani, L. Kameswari, and R. Mamidi (2021-06) Political discourse analysis: a case study of code mixing and code switching in political speeches. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, Online, pp. 1–5. External Links: Link, Document Cited by: §2.0.4.
  • [38] S. Tan and S. Joty (2021) Code-mixing on sesame street: dawn of the adversarial polyglots. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3596–3616. Cited by: §2.0.1, §2.0.1.
  • [39] G. Tur, D. Hakkani-Tür, and L. Heck (2010) What is left to be understood in ATIS?. In 2010 IEEE Spoken Language Technology Workshop, pp. 19–24. Cited by: §1.
  • [40] R. van der Goot, I. Sharaf, A. Imankulova, A. Ustün, M. Stepanovic, A. Ramponi, S. O. Khairunnisa, M. Komachi, and B. Plank From masked language modeling to translation: non-english auxiliary tasks improve zero-shot spoken language understanding. Cited by: §1.
  • [41] G. I. Winata, S. Cahyawijaya, Z. Liu, Z. Lin, A. Madotto, and P. Fung (2021) Are multilingual models effective in code-switching?. NAACL 2021, pp. 142. Cited by: §2.0.4.
  • [42] G. I. Winata, A. Madotto, C. Wu, and P. Fung (2019) Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 271–280. Cited by: §2.0.1, §2.0.1.
  • [43] C. Wu, S. C.H. Hoi, R. Socher, and C. Xiong (2020-11) TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 917–929. External Links: Link, Document Cited by: §2.0.2.
  • [44] W. Xu, B. Haider, and S. Mansour (2020) End-to-End Slot Alignment and Recognition for Cross-Lingual NLU. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5052–5063. Cited by: §1, §3.1.
  • [45] Y. Xu, X. Zhong, A. J. Yepes, and J. H. Lau (2021) Grey-box adversarial attack and defence for sentiment classification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4078–4087. Cited by: §2.0.3.
  • [46] E. Yılmaz, H. v. d. Heuvel, and D. A. van Leeuwen (2018) Acoustic and textual data augmentation for improved asr of code-switching speech. arXiv preprint arXiv:1807.10945. Cited by: §2.0.1.

Appendix

A Slots replacing algorithm during attack

function ExtendSlotLabels(slot_label, num_tokens)
     slot_labels = [slot_label]      
     if num_tokens ¿ 1 then           
         if slot_label.startswith(’B’) then
               slot_labels += [’I’ + slot_label[1:]] (num_tokens - 1)
         else
               slot_labels = num_tokens
         end if
     end if
     return slot_labels
end function
Algorithm A.1 Slots replacing algorithm during attack

B Experiment results graphs

Figure B.1: Model comparison on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure B.2: Model comparison on test set of MultiAtis++ dataset by Slots F1 score metric.
Figure B.3: Model comparison on test set of MultiAtis++ dataset by Semantic accuracy metric.
Figure B.4: Model comparison after word-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure B.5: Model comparison after word-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
Figure B.6: Model comparison after word-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.
Figure B.7: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure B.8: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
Figure B.9: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.
Figure B.10: Model comparison with protection on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure B.11: Model comparison with protection on test set of MultiAtis++ dataset by Slots F1 score metric.
Figure B.12: Model comparison with protection on test set of MultiAtis++ dataset by Semantic accuracy metric.
Figure B.13: Model comparison with protection after word-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure B.14: Model comparison with protection after word-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
Figure B.15: Model comparison with protection after word-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.
Figure B.16: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
Figure B.17: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
Figure B.18: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.

C Experiment results tables

en de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.1: Model comparison on test set of MultiAtis++ dataset by Intent accuracy metric.
en de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.2: Model comparison on test set of MultiAtis++ dataset by Slots F1 score metric.
en de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.3: Model comparison on test set of MultiAtis++ dataset by Semantic accuracy metric.
de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.4: Model comparison after word-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.5: Model comparison after word-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.6: Model comparison after word-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.
de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.7: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.8: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
de es fr ja pt zh avg
xlm-r
m-bert
xlm-r en
m-bert en
Table C.9: Model comparison after phrase-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.
en de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.10: Model comparison with protection on test set of MultiAtis++ dataset by Intent accuracy metric.
en de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.11: Model comparison with protection on test set of MultiAtis++ dataset by Slots F1 score metric.
en de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.12: Model comparison with protection on test set of MultiAtis++ dataset by Semantic accuracy metric.
de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.13: Model comparison with protection after word-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.14: Model comparison with protection after word-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.15: Model comparison with protection after word-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.
de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.16: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Intent accuracy metric.
de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.17: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Slots F1 score metric.
de es fr ja pt zh avg
xlm-r adv
m-bert adv
xlm-r en adv
m-bert en adv
Table C.18: Model comparison with protection after phrase-level attack on test set of MultiAtis++ dataset by Semantic accuracy metric.