Multi-lingual Intent Detection and Slot Filling in a Joint BERT-based Model

07/05/2019 ∙ by Giuseppe Castellucci, et al. ∙ 0

Intent Detection and Slot Filling are two pillar tasks in Spoken Natural Language Understanding. Common approaches adopt joint Deep Learning architectures in attention-based recurrent frameworks. In this work, we aim at exploiting the success of "recurrence-less" models for these tasks. We introduce Bert-Joint, i.e., a multi-lingual joint text classification and sequence labeling framework. The experimental evaluation over two well-known English benchmarks demonstrates the strong performances that can be obtained with this model, even when few annotated data is available. Moreover, we annotated a new dataset for the Italian language, and we observed similar performances without the need for changing the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, conversational interfaces, e.g., Google’s Home or Amazon’s Alexa, are becoming pervasive in daily life. As an important part of any conversation, language understanding aims at extracting the meaning a partner is trying to convey. Spoken Language Understanding (SLU) plays a critical role in such a scenario. Generally speaking, in SLU a spoken utterance is first transcribed, then semantics information is extracted.

In this work, we concentrate on language understanding, i.e., extracting a semantic “frame” from a transcribed user utterance. Typically, this involves two tasks: Intent Detection (ID) and Slot Filling (SF) Tur et al. (2010)

. The former tries to classify a user utterance into an intent, i.e., the purpose of the user. The latter tries to find what are the “arguments” of such intent. As an example, let us consider Figure

1, where the user asks for playing a song (Intent=PlayMuysic) (with or without you, Slot=song) of an artist (U2, Slot=artist).

Figure 1: An example of Slot Filling in IOB format for a sentence with intent PlayMusic.

Common approaches address the ID and SF tasks in joint Deep Learning architectures (e.g., Liu and Lane (2016); Goo et al. (2018)). In particular, encoder-decoder models Sutskever et al. (2014)

and/or recurrent neural networks (RNN) with attention

Bahdanau et al. (2014) are trained on the task of predicting at the same time intents and slots. Recently, recurrence-less models Vaswani et al. (2017) shifted the attention on a neural-based computation for natural language which is not based on the typical recurrent processing happening in RNNs. Based on this idea, BERT Devlin et al. (2018) models multiple tasks with a unique deep attention-based architecture.

In this work, we extend BERT by jointly modeling the ID and SF tasks. In particular, we define a joint text classification and sequence labeling framework based on BERT, i.e., Bert-Joint

. Specifically, we base our model on the BERT pre-trained representations and we add on top of it a text classifier and a sequence labeler, which are trained jointly over a unique loss function. The experimental evaluation shows that the proposed approach can achieve strong performances on the well-known ATIS

Hemphill et al. (1990) dataset. Moreover, it can reach the state-of-the-art on the newer SNIPS Coucke et al. (2018) dataset. Finally, we annotated a new dataset for the ID and SF tasks in Italian. We will show the applicability of Bert-Joint also over this dataset without the need for adapting the model.

Following, in section 2 the proposed approach will be presented, while in section 3 the experimental evaluation will be provided. In section 5 related works will be discussed. Finally, section 6 will derive the conclusions.

2 Joint Modeling of Intents and Slots within the BERT Framework

Figure 2: The Transformer encoder structure.

In this section, we present Bert-Joint. First, BERT is briefly introduced in section 2.1. In section 2.2 the proposed joint model is described.

Let us consider we have a dataset, where each sentence is annotated with respect to:

  • an intent category ;

  • a slot category associated with each token of , in IOB format.

2.1 Bert

Bidirectional Encoder Representations from Transformers (BERT) is an attention-based architecture to pre-train language representations. In particular, BERT pre-trains deep bidirectional representations by jointly conditioning on both left and right contexts in a Transformer Vaswani et al. (2017)

. It enables transfer learning, i.e., a single architecture is pre-trained and only minimal task-specific parameters are introduced (see also

Peters et al. (2018); Radford et al. (2018)

), eliminating the need for heavily-engineered task-specific activities. The training is performed using two tasks, i.e., Masked Language Model (MLM) and Next Sentence Prediction (NSP). The former aims at capturing the properties of a language by modeling the conditional probability of

. Differently from a classical language model, in MLM some tokens are randomly masked to avoid a token observes itself in a multi-layered context. NSP aims at capturing useful information for sentence pair oriented tasks.

BERT Model.

BERT is a Transformer Encoder Vaswani et al. (2017), whose main building block is depicted in figure 2. It is a multi-layered attention-based architecture, whose processing can be summarized as a sequence of Multi-Head Attention, Add&Normalization, and Feed-Forward layers repeated

times (with residual connections

He et al. (2016)). Given a sequence of tokens , it computes a sequence of representations to capture salient contextual information for each token. For more details, please refer to Devlin et al. (2018).

2.2 Joint Sentence Classification and Sequence Labeling

Once BERT is pre-trained over a corpus, the learned representations model the token of a sequence in the context in which they are observed. In order to use this model for the final tasks, e.g., classification or sequence labeling, it must be fine-tuned over a task-specific dataset. For classification tasks, Devlin and colleagues suggest using the final hidden state of the [CLS]token, which by construction should represent a fixed dimensional pooled representation of the sequence. For sequence labeling tasks, for every token in a sequence, the corresponding final hidden state can be used for classifying such token with respect to the target categories, e.g., the Named Entities. In this work, we aim at using both the token-level and sentence-level features to perform a joint classification of the sentence and token categories.

atis_flight showthe [latest] flight from [denver] to [boston]
atis_city what time zone is [denver] in
atis_flight from [seattle] to [salt lake city]
atis_abbreviation what does fare code [qx] mean
Table 1: Examples from the ATIS dataset. The first column indicates the intent, while the second columns contains the sentence and its slots.
SearchScreeningEvent find [fish story]
PlayMusic can you play me some [eighties] music by [adele]
AddToPlaylist add this [track] to [my] [global funk]
BookRestaurant book a spot for [3] in [mt]
Table 2: Examples from the SNIPS dataset. The first column indicates the intent, while the second columns contains the sentence and its slots.

In order to achieve such goal, let us add the following parameters to the model:

  • and , i.e., the sentence-level classifier matrix and bias respectively;

  • and , i.e., the token-level classifier matrix and bias respectively,

where is the dimension of the final hidden state, is the number of the sentence-level categories and is the number of token-level categories.

In order to classify a sequence with intent and slots , each token is passed through the BERT model, resulting in a set of representations ; is the final hidden state of the [CLS]token, while is the final hidden state of token , for . The sentence-level category probabilities can be obtained by

and the classified category can be obtained by .

The token-level categories probabilities for the token can be similarly obtained through:

and the category for token can be obtained by .

In standard BERT, in order to train a sentence-level classifier, we minimize the cross-entropy () between the predicted label and the correct label . Similarly, in order to train a sequence-level classifier, for each token we can define the cross-entropy () between the predicted label and the correct label . Globally, for each sequence we can minimize the mean of the cross-entropy of each token, i.e., .

In our setting, we aim at learning the sentence-level classifier and the token-level classifier parameters jointly. In order to fine-tune the BERT model with respect to both the tasks, we define a new loss function, which is the linear combination of and as:

where and are new parameters to be acquired during the fine-tuning stage. The fine-tuning will be performed through gradient descent over all the BERT model parameters plus .

Train Valid Test Intent Slot
ATIS 4,478 500 893 26 83
SNIPS 13,084 700 700 7 39
Table 3: Datasets statistics.

3 Experimental Evaluation

ATIS SNIPS
System Slot Intent Sentence Slot Intent Sentence
Joint Seq Hakkani-Tür et al. (2016)
Attention BiRNN Liu and Lane (2016)
IntentCapsNet Xia et al. (2018) - - - -
Capsule-NLU Zhang et al. (2018)
Slot-Gated FA Goo et al. (2018)
Slot-Gated IA Goo et al. (2018)
Slot-Gated FA
Slot-Gated IA
Bert-Intent - - - -
Bert-Slot - - - -
Bert-Joint
Table 4: Performances over the ATIS and SNIPS datasets. Column Slot reports the F1 of classifying the slots in a sentence. Column Intent reports the Accuracy in finding the correct intent. Column Sentence reports Accuracy in recognizing both the intent and all the slots. FA and IA refer to the Full and the Intent Attention variants in the Slot-Gated models. Systems marked with have been re-measured in this work. All the performances are measured over the training/validation/test split as in Goo et al. (2018).

In this section, the experimental evaluation of Bert-Joint is discussed. First, the datasets used in the experiments is presented in section 3.1. Then, in section 3.3 the experiments are discussed.

3.1 Dataset

We conducted experiments over two benchmark datasets for the English language. As a first benchmark, we adopted the Airline Travel Information System (ATIS) Hemphill et al. (1990), which is a well-know benchmark for the ID and SF tasks. It contains sentences annotated with respect to intents and slots in the airline domain. In table 1 some examples of the sentences as well as of the annotations in the ATIS dataset are shown.

The other dataset used for evaluating the joint approach is the SNIPS dataset Coucke et al. (2018). It is a collection of commands typically used in a voice assistant scenario. In table 2 an excerpt of the dataset is shown. SNIPS represents a more realistic scenario compared to the single-domain ATIS dataset. The SNIPS dataset contains more varied intents, while in the ATIS dataset all intents are from the same domain.

The two datasets represent also different training scenarios, as they differ in the number of annotated examples. The SNIPS dataset contains more than number of examples with respect to ATIS. Please, see table 3 for details about the datasets.

For all the experiments in the following sections, we adopted the same dataset split as proposed by Goo et al. (2018).

3.2 Experimental Setup

In the following experiments, we adopted the multi-lingual pre-trained BERT model, which is available on the BERT authors website111https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip. This model is composed of -layer and the size of the hidden state is . The multi-head self-attention is composed of heads for a total of M parameters. We adopted a dropout strategy applied to the final hidden states before the intent/slot classifiers.

We tuned the following hyper-parameters over the validation set: (i) number of epochs among (

, , , ); (ii) Dropout keep probability among (, and ). We adopted the Adam optimizer Kingma and Ba (2014) with parameters , , L weight decay and learning rate over batches of size .

3.3 Experimental Results

Table 4 reports performance measures for both ATIS and SNIPS datasets. Regarding the ID task, we computed the accuracy, while the SF performance is measured through the F1. Moreover, we report a sentence based accuracy, i.e., the percentage of sentences for which both the intent and all the slots are correct. We compare our approach to different systems, all measured over the same training/validation/test split222Liu and Lane (2016) and Wang et al. (2018) reports respectively and for ID and and for SF. However, their precise training/validation split is not known. as reported in Goo et al. (2018) and Zhang et al. (2018). We remeasured the performances of the two Slot-Gated systems. We adopted the available code333https://github.com/MiuLab/SlotGated-SLU released by Goo et al. (2018); we tuned over the validation set the number of units among (, , , , , and ) with an early stop strategy over epochs. These systems are marked with the symbol in the Table. We report also the performances for two standard BERT based systems, i.e., Bert-Intent and Bert-Slot. The former is a text classifier based on the standard formulation of BERT, i.e., it learns the function by only optimizing the loss function over the intents detection task. The latter is a sequence labeler based on the standard formulation of BERT, i.e., it learns the function by only optimizing the loss function over the SF task.

Regarding the ATIS dataset, notice the Bert-Intent and Bert-Slot performances with respect to the other systems. The ID performance of Bert-Intent is about points higher with respect to the best-reported system Zhang et al. (2018). The SF performance of Bert-Slot is in line, but still higher with respect to Zhang et al. (2018) ( vs ). This is in line with the findings in Devlin et al. (2018), where a very good pre-training results in an effective transfer learning in natural language tasks. This also holds with respect to the Slot-gated models Goo et al. (2018) here re-measured. The BERT-based models allow to obtain higher performances resulting in an error reduction of about and for ID and SF, respectively. Notice the Bert-Joint performances on ID ( in accuracy) and on SF ( F1). These results confirm that the joint modeling here proposed can be beneficial also when the base is a BERT model. In fact, the performances of Bert-Joint are higher with respect to Bert-Intent ( ) and Bert-Slot ( ) on both tasks, resulting in a straightforward accuracy in detecting correctly the whole sentence. This results in an error reduction of about for the whole sentence prediction with respect to the best-reported system.

Regarding the SNIPS dataset, we can observe very similar outcomes. Recall that the dataset size is higher and has a more varied domain with respect to the ATIS dataset. The Bert-Joint approach set the new state-of-the-art performance over this dataset both on ID and SF, i.e., in accuracy and in F1 respectively. As a consequence, also the overall sentence accuracy set the new state-of-the-art, i.e., accuracy in detecting correctly the intents and all the slots for a sentence (error reduction of about with respect to the best-reported system). Again, Bert-Intent and Bert-Slot approaches are performing very well on this task, but, again, the joint model here proposed is beneficial.

3.4 Measuring the Impact of Joint Modeling

Figure 3: Learning curves for the ATIS dataset.

As discussed in section 3.3, Bert-Joint is very effective in classifying intents and slots. In fact, intents and slots are strongly related. In order to better understand what is the contribution of a joint approach, in this section we provide the analysis of how fast a joint model reaches higher performances with respect to non-joint approaches. We, thus, compare the performances of the BERT-based systems Bert-Intent, Bert-Slot with respect to Bert-Joint on poor training conditions. That is, we trained each of this model on training sets of growing sizes. In particular, we trained the models on , , and of the training data for both ATIS and SNIPS datasets. We performed this evaluation with the best hyper-parameters found for the evaluations in section 3.3. We acquired the models over different shuffles of the training set and we report the averaged results. Moreover, we report the same evaluations with the two Slot-Gated systems.

Figure 4: Learning curves for the SNIPS dataset.

Figure 3 shows learning curves for the ATIS dataset. First, notice that the BERT-based systems are performing better than the Slot-gated models at all training set sizes. This confirms that a good pre-training is beneficial in any training condition. Moreover, notice that the benefits of using a joint approach: starting with only of the training material Bert-Joint is beneficial both for the ID and SF. In ID, at of the training set, there is a difference of about points ( vs ) between Bert-Intent and Bert-Joint (about relative error reduction). A similar outcome can be observed for Bert-Slot vs. Bert-Joint, where the difference is about of points in the F1 measure (about relative error reduction). When the training set size grows, the performances of the systems are more similar, but a clear advantage of the Bert-Joint can be always observed.

Figure 4 shows learning curves for the SNIPS dataset. Again, there is a benefit in using a joint approach at lower training set sizes both for ID and SF. Again, it seems that the ID task benefits more of the joint modeling. At of annotated material (i.e., about example), the Bert-Joint model outperforms the Bert-Intent model of about point in accuracy ( vs ), resulting in about relative error reduction. Instead, the Bert-Joint outperforms Bert-Slot of about points in F1 with of the training material, resulting in about relative error reduction. Even with the SNIPS dataset we can observe that when the training material grows, all the models perform better, but with a clear advantage of our joint approach.

4 Detecting Intent and Slots in Italian

In order to verify whether Bert-Joint can be applied to a different language, we evaluate it on an Italian dataset. We aim at checking whether the multilingual capability of BERT is preserved also when facing a joint learning task. In the following, we provide the dataset description in section 4.1; then, we discuss the experiments in section 4.2.

4.1 Dataset

To the best of our knowledge, there is no annotated dataset for SLU in Italian. In order to obtain a good quality resource we derived it from an existing one of another language. We used the SNIPS dataset as a starting point for these reasons: i) it contains a reasonable amount of examples; ii) it is multi-domain; iii) we believe it could represent a more realistic setting in today’s voice assistants scenario. We performed a semi-automatic process consisting of two phases: an automatic translation of the sentences with contextual alignment of intents and slots; a manual validation of the translations and annotations. In the first phase, we translated each English sentence in Italian by using the Translator Text API, which is part of the Microsoft Azure Cognitive Services 444https://docs.microsoft.com/en-us/azure/cognitive-services/translator/translator-info-overview. The intent associated with the English sentence has been copied to its Italian counterpart. Slots have been transferred by using the alignment of source and target tokens provided by the Translator Text API. In order to create a more valuable resource in Italian, we also performed an automatic substitution of the name of movies, movie theatres, books, restaurants and of the locations with some Italian counterpart. First, we collected a set from the Web about Italian version of such entities; then, we substituted each entity in the sentences of the dataset with one randomly chosen from . In the second phase, the dataset was split into different sets, and each has been annotated by one annotator and reviewed by another annotator. A further review was performed in case of disagreement between the annotators. Some interesting phenomena emerge for the different intents. The translation of GetWeather’s sentences was problematic because the main verb is often misinterpreted, while in the sentences related to the intent BookRestaurant a frequent failure occurred on the interpretation of prepositions. For example, the sentence “Will it get chilly in North Creek Forest?” is translated as “Otterrà freddo in North Creek Forest?”: while the correct translation is “Sarà freddo a North CreekForest?”. The verb “get” in Italian can be translated in different ways depending on the context. In this case, the system misinterpreted the context, assigning to “get” the wrong meaning.

Finally, with this approach we obtained an Italian dataset (SLU-IT) composed of 7,142 sentences annotated with respect to intents and slots, almost equally distributed on the different intents. The effort spent on the construction of this new resource, according to the procedure described, is about 24 FTE555Full Time Equivalent, with an average production of about 300 sentences per day. We consider this effort lower than typical efforts to create linguistic resources from scratch.

4.2 Experimental Results

We selected from SLU-IT the same train/validation/test split used for the English evaluations. It results in , and respectively for training, validation and test. We run the experiments with the same setup used in the English scenario: we tuned the number of epochs (, and ) and the dropout parameter (, and ), and we used the same settings for the Adam optimizer. We compare Bert-Joint to the non-joint versions of BERT, i.e., Bert-Intent and Bert-Slot. Moreover, we compare also with the Slot-Gated models. We adopted the available code released by the authors; we tuned for these models the number of units among (, , , , , and ) with early stop over epochs.

System Slot Intent Sentence
Slot-Gated FA
Slot-Gated IA
Bert-Intent - -
Bert-Slot - -
Bert-Joint
Table 5: Performances over SLU-IT. Column Slot reports the F1 of classifying the slots. Column Intent reports the Accuracy in finding the intent. Column Sentence reports Accuracy in recognizing both the intent and all the slots. FA and IA refer to the Full and the Intent Attention variants in the Slot-Gated models.
English Italian
System Slot Intent Sentence Slot Intent Sentence
Bert-Intent - - - -
Bert-Slot - - - -
Bert-Joint
Table 6: Multi-lingual experiments: each system is trained over both English and Italian training sets and tested separately over English and Italian. Column Slot reports the F1 of classifying the slots in a sentence. Column Intent reports the Accuracy in finding the correct intent. Column Sentence reports Accuracy in recognizing both the intent and all the slots.

In table 5 the performances of the systems are shown. The slot performance is the F1 while the Intent and Sentence performances are measured with the accuracy. Notice that all models are performing similarly to their English counterpart666The training set size is about of the English dataset, thus the Italian measures must be compared with the English measures at about of the learning curve.. First, notice the performances of the Slot-gated models Goo et al. (2018) over this dataset. Regarding the SF task, the new language seems to be critical for both the variants of the model, as they reach only about in F1. Notice that in similar settings, i.e., about of the training examples, the English performance was about points higher. Regarding the ID task, the performances are instead higher also for this language. Again, we can observe that the Bert-Joint training is beneficial for obtaining higher ID performances with respect to the model without the joint modeling (i.e., Bert-Intent and Bert-Slot). Also the SF task benefits from the adoption of joint training. Notice that, the proposed approach outperforms the Slot-Gated models. This is a straightforward result as no modification to the model has been made for the Italian language.

4.3 Multi-lingual Detection of Intent and Slots

As pointed out in section 3.1 the SLU-IT dataset is obtained with a low-effort process. This results in performances that are lower with respect to their English counterpart777There is a difference of about points in correctly determining a whole sentence accuracy (intent+slot). One could think of exploiting the BERT multi-lingual capabilities to train an SLU system on the English language and to use it to generate annotations in Italian in order to obtain a higher quality dataset or to increase the size of annotated examples. However, such a system would fail888We performed a cross-lingual experiment by training on one language and testing over the other. The performances of ID could be considered satisfactory (about by training in English and testing in Italian and about vice-versa. However, the slot recognition is far worse, i.e., about and , respectively. in correctly recognizing the slots. In fact, they are very different in the two languages as both their lexical surface and the syntax is highly language-specific.

For these reasons, we believe that a more consistent way of exploiting the capabilities of the BERT model is to train a multi-lingual model over both the datasets. In this way, we aim at injecting into a low-effort dataset (SLU-IT) the information contained in a higher quality dataset (SNIPS). In Table 6 we provide the experimental results of such a setting. We trained the Bert-Intent, Bert-Slot and Bert-Joint models with both the English and Italian training sets and we tested over the two test sets separately. Notice that the performances over English are slightly worse but comparable with the monolingual training (see Table 4). It demonstrates that the multi-lingual setting doesn’t degrade too much the performances on that language. However, notice that the performances over the Italian dataset are higher, resulting in about points gain. Notice how the accuracy in predicting correctly a whole sentence increases from to , which is about a error reduction. Again, our joint approach performances are higher with respect to the non-joint versions of the model. Moreover, a multi-lingual SLU model is also more efficient for production purposes. In fact, there will be the need for deploying only one model for multiple languages, resulting in architectural savings.

5 Related Work

Intent Detection.

The ID task is addressed as a text classification problem, in which classical machine learning or deep learning have been widely adopted. Many researchers employed support vector machines

Chelba et al. (2003), or boosting-based classifiers Schapire and Singer (2000). Recently, many works exploited Deep Learning ability to learn effective representations. For example Sarikaya et al. uses Deep Belief Nets (DBNs) for natural language call routing, where a multi-layer generative model is learned from unlabeled data. Then, the features discovered are used to pre-initialize a feed-forward network which is fine-tuned on labeled data. In Xia et al. (2018) ID is addressed in a Zero-shot Xian et al. (2017) framework with Capsule Networks Hinton et al. (2011).

Slot Filling.

The SF task is addressed through supervised sequence labeling approaches, e.g., MEMMs McCallum et al. (2000), CRF Raymond and Riccardi (2007) or, again, with Deep Learning, such as Recurrent Neural Networks (RNNs) Hochreiter and Schmidhuber (1997). Deep learning research started as extensions of Deep Neural Networks and DBNs (e.g., Deoras and Sarikaya (2013)) and is sometimes merged with Conditional Random Fields Xu and Sarikaya (2013). Later, Mesnil et al. proposes models based on recurrent neural networks (RNNs). On the same line of research is the work of Liu and Lane (2015), which, uses RNNs but introduces label dependencies by feeding previous output labels. Chen et al. address the error propagation problem in a multi-turn scenario by means of an End-to-End Memory Network Sukhbaatar et al. (2015) specifically designed to model the knowledge carryover.

Joint Models.

Recently, ID and SF have been addressed by jointly modeling the two into a unique architecture Hakkani-Tür et al. (2016); Liu and Lane (2016); Wang et al. (2018); Goo et al. (2018); Zhang et al. (2018). In fact, it has been found that a model that is trained on both tasks jointly, can achieve better performances on both. For example, in Hakkani-Tür et al. (2016) a single RNN architecture for domain detection, ID and SF in a single SLU model is proposed showing gains in each. In Liu and Lane (2016), ID and SF are investigated through an attention-based Bahdanau et al. (2014) mechanism within an encoder-decoder framework. In Wang et al. (2018), a Bi-model based RNN combines two task-specific networks, i.e., a Bidirectional LSTM and an LSTM decoder; they are trained without a joint loss function. Goo et al. extend an attention-based model for the joint task of ID and SF. In particular, a slot gate focuses on learning the relationship between intent and slot attention vectors. In Zhang et al. (2018) Capsule Networks Hinton et al. (2011) are adopted to jointly classify ID and SF through a hierarchical capsule network structure. This should capture the inter-dependencies between words/slot and intents in a hierarchy of feature detectors.

6 Conclusion

In this work, we addressed the problem of Intent Detection and Slot Filling in Spoken Language Understanding. We based on the BERT model ability to provide effective pre-trained representations. We adapted the original BERT fine-tuning to define a new joint learning framework. Bert-Joint acquires very effective representations for a joint learning task. We provided an extensive evaluation in the English language: BERT-based approaches performs very well on the intent detection and slot filling tasks. Bert-Joint learning schema provides even better results, i.e., the new state-of-the-art for these tasks. Moreover, we showed that this approach is beneficial when less annotated data is available. We also showed the multi-lingual capability of the model for dealing with the Italian language. We annotated a new SLU dataset in Italian, and we measured over it the performances of our approach: in this setting and in the multi-lingual setting, Bert-Joint outperforms non-joint approaches. In future, we aim at investigating languages with very different structures, e.g., Chinese or Arabic. It could also be interesting to adapt our model to multi-intent scenarios, or to model other semantic phenomena, e.g., jointly classifying frames and semantic arguments in Frame Semantics Fillmore (1985).

References