In this era of artificial intelligence (AI), chatbots are becoming more and more popular every day for their versatility, easy accessibility, personalizing features, and, more importantly, their ability to generate automated responses. Specifically for these purposes, we now see an uprise of chatbots everywhere - from personal to organizational, to business websites or other online platforms, for which it can be trained on suitable data to make it, in a broader sense, a virtual assistant representative of the said entities.
In the light of this newly emerging scope, we explore the possibilities of how these conversational AI agents can be integrated properly and thus be an immensely useful tool to maintain business activities. To better understand the concurrent chatbots and find possible modifications in them and for further and more customized improvements, we choose a trendy chatbot platform Rasa as our study subject. [bocklisch2017Rasa]
However, as we progress with our experiments in expectation of solving the problem at hand, we find ourselves at a loss in the numerous pipeline choices available which is a major requirement for the working of Rasa chatbot. Moreover, for each of these pipelines, there are various other components in it to decide upon because they are the building blocks for each of those pipelines. While Rasa does provide a default option for its users, this setting is not perfectly suitable for most of the use cases in real-life scenarios. Additionally, among other challenges, a prominent one is having a skewed, poorly defined, or we should train scarce dataset for the chatbot. In our case, it turns out to be so because we are dealing with a low-resource language, Bangla.
In this paper, first and foremost, we aim to show an extensive comparative analysis among not only the components but also the various pipelines. We also collected, annotated and reviewed a Bangla dataset to address the Frequently Asked Questions (FAQs) and their relevant responses that users might ask the chatbot. Finally, we introduce a couple of custom components to resolve the performance throttling that otherwise results from using poorly suited default components.
Ii Related Works
In the history of conversational AI agents, ELIZA[bradevsko2012survey, zemvcik2019brief], one of the first rule-based chatbots, took it upon itself to pass the famous Turing Test and pioneer the path of guided computer responses. Even though it failed to pass the test completely, it surely did not come short in paving the way for other artificial chatbots, which ranged from responding emotionally (PARRY) [bradevsko2012survey, zemvcik2019brief, abushawar2015alice]
to simply having fun conversations by running pattern matching (Jabberwacky)[bradevsko2012survey]. Later, this field got more matured with the inception of AI-powered chatbots, namely Dr. Sbaitso[zemvcik2019brief] and A.L.I.C.E (Artificial Linguistic Internet Computer Entity)[abushawar2015alice, bradevsko2012survey]- which was able to mimic humans when chatting online or answering questions. From there, it was not long before Smarterchild, Siri, Google Assistant, and other personalized assistant-like chatbots or conversational AI agents came into existence. With conversational AI, now, anyone can build, integrate, and use message-based or speech-based assistants to automate the communication process - in personal, organizational, or business settings. And as a facilitator of this process, many conversational AI platforms have come forward and contributed largely to the advancement of the field - where one of such platforms is Rasa.
Rasa[bocklisch2017Rasa] is an open-source python framework to build and customize conversational chatbot assistants, besides imitating humans the Rasa framework can be used to develop assistants that can perform complex tasks like booking tickets for a movie or even checking if movie tickets are available for a particular slot, etc, as the Rasa framework has functionalities to interact with databases and servers.
There are several chatbot frameworks other than Rasa, that are currently available and can be used to make commercial chatbots, for example, (1) the Microsoft Bot Framework [sannikova2018chatbot], which uses either QnA Maker service or Language Understanding Intelligent Service(LUIS) to build a sophisticated virtual assistant, (2) Google Dialogflow [reyes2019methodology] which has the advantage of being easily integrated with all platforms virtually including home applications, speakers, smart devices, etc. On the other hand, Rasa is composed of two separate and decoupled parts (i)
Rasa Natural Language Understanding(NLU) module to classify the intents and entities from given user input. Rasa now incorporates the DIET Architecture[bunk2020diet] in the NLU pipeline which classifies both the intent and entity. (ii) The Rasa Core module does the job of dialogue management, which is short for deciding what actions the chatbot assistant shall take given a certain user input, intents, entities from that given input, and the current state of the conversation. Rasa uses several machine learning-based policies - like Transformer Embedding Dialogue (TED) Policy which leverages transformers to decide which action to take next, and also rule-based policies like Rule Policy which handles conversation parts that follow a fixed behavior and makes predictions using rules that have been set based on NLU pipeline predictions. So in summary we can say that the Rasa framework is designed for developers with programming knowledge.
Furthermore, Rasa assistants can be connected with personal and business websites by building website-specific connectors for the chatbot to communicate with the client-side server. Similarly, through in-built connectors, the framework provides the functionality for integrating chatbots with Facebook pages, Slack, and other social media platforms. Virtual chatbot assistants can be developed supporting several languages by changing or even building language-specific custom components in the NLU pipeline. For example, the tokenization process for the Mandarin language will be different compared to the tokenization process of English due to language structure. As a result, the need for a language-specific tokenizer arises. In the paper by Nguyen et. al. [nguyen2021enhancing], the authors built a custom Vietnamese language tokenizer and a custom language featurizer which leveraged pre-trained fastText [athiwaratkun2018probabilistic] Vietnamese word embedding and achieved a better results with their custom made components compared to the default pipeline components provided by Rasa. fastText [athiwaratkun2018probabilistic]
provides word embeddings for 157 languages and thus depending on the language in which the chatbot needs to be built, the corresponding featurizer, to leverage the pre-trained word vectors, must be designed and attached to the pipeline.
Iii Machine Learning Life Cycle
Since Bangla chatbots, Bangla transliteration chatbots, or multilingual chatbots that include Bangla is not widely explored, we had to work on all the portions of the Machine Learning Life-cycle that is Dataset Preparation, Machine Learning Modeling, and Deployment for production services. Figure 1 shows the full pipeline of the workflow.
Iii-a Dataset Preparation
Our challenge was to deal with Frequently Asked Questions (FAQs) and some other common questions in Bangla and Bangla Transliteration in English. It was also required to provide a suitable and swift response. We collected all the FAQs from our client’s interaction history with customers from different social media platforms. The questions were then annotated by 20 employees of our client with suitable answers. After the annotation process, the required files to train the model were prepared by a group of data analysts.
Rasa open-source architecture requires data to be split into three files. 1)nlu.yml is required for training the NLU module and contains all the gathered FAQs categorized into 45 intents and 9 entities, with each intent containing at least 4 examples leading to around 250+ samples. 2)domain.yml contains 150+ collected responses and actions (custom defined functions called to access and retrieve data from a database) to each corresponding FAQs in the nlu.yml file. 3)stories.yml contains 110+ sections, with each section containing a series of sequential (i) intents and entities that will be extracted from FAQs in the nlu.yml file, (ii) responses and actions that can be given when a user input categorized under a certain intent and with or without a defined entity is received, which is basically an attempt to model an actual conversation that might take place. The domain.yml and stories.yml files are required to train the core module which is responsible for dialogue management.
Iii-B Machine Learning Modeling
The Rasa open-source Architecture is composed of two separate units to handle a conversation with any user, (1) NLU: which classifies a particular sentence into a certain intent and extracts entities from it if they are present as defined in the training dataset. (2) Core: which decides what responses to utter or what custom actions to take if a necessary database querying is required for user input.
Iii-B1 Rasa NLU Module
To customize the NLU pipeline which processes an input text, in the order as defined in the config file, we mainly consider three types of operations: (1) the choice of sparse featurizers like LexicalSyntacticFeaturizer, CountVectorFeaturizer. (2) choosing a pre-trained dense featurizers for example a featurizer that converts tokens into dense fastText vectors. (3) The choice of an intent classifier and an entity extractor, for our NLU pipeline we use the DIET classifier to classify both intents and entities by using the same network.
Iii-B2 Rasa Core Module
This module uses (1) a tracker which maintains the state of the conversation, (2) a set of rule-based and machine learning-based policies to select an appropriate response as defined in the domain file. The step through which core module uses to select an appropriate response is defined below:
The NLU module converts the user’s message into a structured output including the original text, intents, and entities.
With the output from the previous state the tracker updates its current state.
Policies defined uses the output from the tracker and selects an appropriate response as defined in the domain file and responds accordingly.
The policies that we use for our module is discussed below.
TED Policy: A machine learning-based architecture built on a transformer [vaswani2017attention] to predict the next action to an input text from a customer.
Memoization Policy: A rule-based policy that checks if the current conversation matches our defined stories and responds accordingly.
Rule Policy: Used to implement bot responses for out-of-scope messages that our chatbot is not trained on so that it can fall back to a default response when the confidence values are below a defined threshold.
We deployed the trained NLU and Core module to a hosted web server. The web server can successfully take user inputs as queries and provide swift and suitable responses. We built a custom server connector so that queries from other servers can be received and responses can be sent back. To test how our hosted server is responding, we hosted another server built using Flask framework[grinberg2018flask]. We also built a connection using Facebook Developer Tools 111https://developers.facebook.com/tools/ between the hosted web server and a demo Facebook Page to test how our trained models work on real-life scenarios.
Iii-D Interaction Between User and Bot
After deployment, the webserver is connected with both Flask App Server and Facebook Page. Flask App Server is used by testers so that they can provide feedback and qualitative evaluation. The Facebook page deals with actual users. Whenever any user sends a message, firstly the language of the message is detected using Polyglot Word Embedding [kashmir]. For now, we are only working with Bangla and Bangla Transliteration in English. To convert Bangla Transliteration in English to Bangla, we are using Google Transliteration API 222https://developers.google.com/transliterate/v1/reference. The Bangla queries pass through a custom server connector that we built for seamless interaction between Rasa App Server and end-users. The Rasa app server returns a suitable response following up the queries.
Iv Experimental Analysis
Iv-a Experimental Setup
As mentioned in Table I
, we set up 8 different pipelines for our experiments, and as a part of the setting, we split our annotated data into an 80-20 ratio train-test set. For all of the experiments using different pipelines, we trained the NLU Module for 500 epochs with a learning rate of 0.05 using Adam Optimizer[kingma2014adam]. In the case of the Core Module, we trained for 200 epochs with a learning rate of 0.05 using Adam optimizer [kingma2014adam].
Among the 8 different pipelines mentioned in Table I, some include our two custom components - fastTextFeaturizer and CustomTokenizer, which is specifically built for working with Bangla corpus. Additional arguments and parameters that we fixed for some of the components are as follows:
For LanguageModelFeaturizer, we used ”bert” model configurations from HuggingFace transformers library [wolf2019huggingface] as model and ”Rasa/LaBSE” as model weights.
For CountVectorsFeaturizer, we used ”char_wb” as an analyzer, and set 1 to be minimum ngram and 4 to be maximum ngram.
For FallbackClassifier, we set 0.3 for thresholds and additionally to handle ambiguity 0.1 for ambiguity thresholds.
Lastly, for all of our experiments, we kept the policies unchanged and as following:
Iv-B Ablation Study
We conducted 8 experiments to evaluate which NLU pipeline components work best for our task. The corresponding pipeline configurations are listed in Table I. Finally, we observe our obtained results and identify how and why a particular component leads to each particular result.
As mentioned in Table I, NLU pipeline P1 uses (1) whitespace tokenizer that separates user inputs into individual tokens based on the white space that usually separates each individual word in different languages. (2) regex featurizer, lexical syntactic featurizer, countvector featurizer(counts character ngram tokens as features) which generates sparse features. (3) the DIET classifier which is trained on these features to classify intents and extract entities. And the results for the pipeline P1 are shown in Table II. We consider this pipeline as our baseline.
For pipeline P2, we take out the whitespace tokenizer component and replace it with our custom tokenizer specifically built for Bangla language as the tokenization process varies from language to language. As a result, we can see that test scores for accuracy, precision, recall, and F1-score increase. From this experiment, we can derive that our custom Bangla language tokenizer works better for our Bangla dataset compared to whitespace tokenizer.
In aim for improving test score performance, we added 2 components with pipeline P1 to build pipeline P3. They are (1) EntitySynonymMapper: maps synonymous entity values to the same values, and (2) FallbackClassifier: used to give a default response for user questions with intents out of scope. In Table II, we can see that test score performance gets better than pipeline P1.
We add a custom fastTextFeaturizer specifically built for Bangla Language with pipelineP3 for pipeline P4 to get better results. fastText featurizer returns dense word vectors trained using CBOW[mikolov2013efficient] with position-weights on a huge Bangla corpus. As we see in Table II, the scores in all 4 metrics decrease.
To investigate the reason behind the decline in performance in pipeline P5, we take out the regex featurizer and run the experiment again and see a sharp rise in all the metrics. Here we come to a conclusion that regex featurizer does not work well with fastText word vectors; this is because regex sparse vectors nullify fastext character word models when concatenated together which results in better performance.
If pipeline P6 is compared with pipeline P5, we can see the only difference is in the tokenizer component. P6 uses our custom Bangla tokenizer but P5 does not, and consequently P6 produces better results for all the metrics.
For pipeline P7, we replace our custom Bangla tokenizer with BERT [devlin2018bert] language model tokenizer and we also replace fastTextFeaturizer with BERT language model featurizer. Both of these components are pretrained on a huge Bangla corpus. And we can see slightly better scores than pipeline P6.
Finally, in pipeline P8, we add RegexFeaturizer with pipeline P7. We can see in Table II that we get the best result in this setup. The DIET architecture concatenates sparse features from regex featurizer and dense features from BERT language model featurizer. Featurizers in the same pipeline need to generate similar types of features so that they do not nullify each other. From this intuition, we can derive that regex featurizer works better with BERT language model featurizer because both of them deals with defined subwords or patterns.
Iv-C Result Analysis
According to Table I and Table II, we can conclude that Pipeline P6 works best with fastText dense featurizer and Pipeline P8 works best with BERT language model featurizer. In this section, we compare the results from Pipeline P6 and P8 so that we can describe the difference more visually.
If we compare the confusion matrix of Pipeline P8 in Figure 4 with Pipeline P6 in Figure 4, we can clearly see that the sum of diagonal in Pipeline P6 Confusion Matrix is lower than the sum of diagonal Pipeline P8. Pipeline P6 is getting confused between course_access_duration and course_details while Pipeline P8 can successfully identify the difference between them. If we look more closely, we can find more anomalies like this.
Finally, we compare the intent histograms in Figure 3 and 5. Here, we can find that Pipeline P8 is more confident when they are providing correct responses which is not the case in Pipeline P6. In Pipeline P6, the confidence varies a lot when the response is correct. On the other hand, Pipeline P6 is more confident when they are providing wrong responses. In the case of pipeline P8, we can notice inconsistent confidence values for wrong responses.
V Conclusion and Future works
The default components of Rasa Framework perform poorly for Bangla as it is still a low resource language. Hence the need for custom components built specifically for Bangla arises. In our experiments, we showed a detailed comparative analysis of the effects of each individual default and custom component. The dataset we collected, annotated and reviewed was imbalanced. So, we also needed to deal with the imbalanced class problem.
In the future, we plan to extend and improve the quality of our existing dataset by collecting more data and going through more rigorous reviewing. We also plan to add more custom components to the Rasa pipeline, for example, state-of-the-art transformer models, state-of-the-art multilingual featurizers, and many others. Furthermore, we also have plans to build a multilingual chatbot that can interact with users of different languages from different countries and cultures around the globe.