In recent years conversational agents have become efficient means for supporting business across a variety of industry sectors. Chatbots and other dialog systems are now routinely deployed for various tasks such as booking appointments, answering users’ queries, generating leads and recommending products and services to the user. Consequently, a variety of cloud based and non-cloud based platforms have been launched for their development and hosting such as aws lex 111https://aws.amazon.com/lex/, Google’s Dialogflow 222https://cloud.google.com/dialogflow and Rasa 333https://rasa.com/. These platforms have revolutionized the process of building a dialog system and the research on which they are based has led to significant improvements to the quality of the conversational user experience.
Dialog systems can be classified in three primary categories: question answering systems that answer specific user queries, task-oriented systems that assist with specific tasks such as booking appointments, and social bots whose purpose is to make human-like conversation and act more like a companion to their users.
Traditional dialog systems involve a complex pipeline whose components interact to offer a human-like conversation 
. From an architectural perspective, social bots may use neural networks such as encoder-decoder or seq2seq models[6, 8, 2] and other generative models to generate responses to user input. By contrast, retrieval-based systems map an input to a response from a repository of responses. A recent survey on dialog systems can be found in [3, 4, 7].
At the outset we chose to build a retrieval-based system rather than a generative system. The former approach is popular in commercial applications due to need for less resources, cost effective and easy maintenance. This paper focuses on the architecture, linguistic resources and additional design features involved in building a dialog system specifically for the legal sector, although it is our belief that many of the insights are in fact agnostic of domain.
We present a composite system that combines question answering and task-oriented approaches, focusing on two specific tasks discussed in section 2. The rest of the paper is organised as follows. We also explain the bot hierarchical structure. Section 3 describes different components of the end-to-end prototype system. This is followed by section 4 on the linguistic resource describing the legal dataset. Section 5 is an examination of the key challenges we faced and the additional features we included to improve the conversational flow. In section 6 we conclude the paper with a summary and future work.
2 Approach and Development of the Conversational agent
We present the approach we followed in identifying requirements and selecting a suitable platform to build the conversational agent. This is followed by the description of the bot hierarchical design structure.
2.1 Identifying requirements
The first step in building a dialog system for the legal domain is to understand the key processes and functions that take place within a law firm. This involves analysis of user journeys and enquiry data from case management systems to identify use cases and tasks that a dialog system could support. In this paper, we focus on two key use cases,
Frequently asked questions (FAQ): answering general questions related to the services offered by a law firm (note that this is not the same as giving legal advice specific to user). Queries such as this are typically fulfilled in a single exchange, i.e. by providing a single answer to a user’s question, for example:
User: Can I bring my partner to the appointment?
Sys: Yes you can bring your partner to the appointment.
Fact finding (FF): The system attempts to identify the particular service the user needs based on their specific case description then records their contact details and the case description. This use case is fulfilled by recording the details and usually involves a multi-turn conversation, for example:
User: I want someone to review my contract.
Sys: Sure, to help you with that we would need your contact details for someone from the firm to contact you.
Sys: What is your name?
Sys: What is your phone number?
Sys: Perhaps you can describe the type of contract you want reviewed.
User: A housing contract.
Sys: Thanks for that. One of our legal experts will contact you as soon as possible.
The system currently supports FAQ and FF for various legal services, including:
ten services from commercial corporate
five from wills and life planning
three from commercial property
one from civil dispute resolution
one from immigration
Example services include Contract Review (CR) and Draft and Update Terms and Conditions (DUTnC) from commercial corporate.
The dialog model is based on the notion of identifying slots from the user text and user intents, which are defined according to the specific use case. A slot can have different related values in the input, for example the “day” slot can have values according to days in the week. A slot can be predefined with a set of accepted slot values. We define multiple slots to be identified in the user utterance, including consent to enter details that takes ”yes” or ”no” as input, location of the firm, etc. To identify different legal services within a given user input, we define a custom slot “practice_type” with various legal services as its slot values.
An intent is the user’s objective behind their utterance. For example, user text “What is the firm’s location?” implies user’s intent to retrieve a list of locations of the firm. For the FAQ use case, we define intents according to the input type. For example, a “Cost” intent is used to deal with questions relating to the price of different legal services, while a “Prep_app” intent handles questions related to the documents required to bring to an appointment. Each of these FAQ intents will elicit a different response from the response resource (shown in Figure 2 and described in section 3) depending on the identified service in the slot. By contrast, FF intents are defined according to the specific legal service description. For example “I want someone to look at a contract for me.” and “I need help with drafting a contract.” may look similar but they refer to different services (“CR”) and (“DUTnC”) respectively. Our initial attempt to conflate these intents led to confusing responses, as it is difficult for the model to accurately identify the service from the description of a legal case such as in the examples above. It is therefore vital in FF to use the whole utterance as a description of the user’s legal case to identify and populate legal service slots. Consequently, FF intents are defined on a per-service basis whereas in FAQ intents are based on the combination of input type and particular service.
2.2 Selecting a Platform
The next step is to select a platform from those available such as Google’s Dialogflow, Rasa and aws lex that provides appropriate functionality to develop and deploy the dialog system.
After analysing various frameworks, we selected lex for its simple development environment, comprehensive documentation, rapid development cycle, use of deep learning and other advanced natural language understanding functionality for intent recognition. It is a service offered by Amazon Web Services (aws) that provides an easy-to-use IDE to define intents and entities. lex also supports speech input and output, although at present our use of the platform extends only to text based dialogues.
In addition to defining intents specific to two use cases, we use few intents predefined by lex. lex provides a variety of inbuilt intents such as “greet_intent” and “goodbye_intent”. We use the former to identify utterances such as “hi” and “good morning” and the latter for utterances such as “bye” and “goodbye”. lex also provides inbuilt slots such as “First_name”, “Last_name”, “Number” and “Email_address” which we use to persist user details in FF.
2.3 Bot Hierarchy
We use a hierarchical structure of multiple bots to overcome the limits on the number of permitted intents and slots within lex, but for the structured view of intent distribution in different use cases and to make the maintenance and improvement easier under such complicated intent relations. Three unique bots structured at two levels are designed to share a parent-child relationship as shown in Figure 1. The intents are divided amongst three bots such that the parent bot consists of intents and child bots with a total of intents (child bot-FF with intents and child bot-FAQ with intents).
As shown in the figure, user interacts directly with the parent bot via user interface. The parent bot handles the basic user intents such as “greet_intent” and “goodbye_intent”. In addition to these intents, we include two additional custom intents in the parent bot. A custom intent, all_faq, represents sentences from all intents of a child bot-FAQ and the other intent, all_ff, consists of all sentences defined in the second child bot child bot-FF. Thus, when the parent bot receives FF or FAQ utterance, the underlying model maps (or classifies) the utterance as either FAQ or FF query based on the all_faq or all_ff intent invoked. The query is then passed to the meta-classifier for the respective child bot that identify and return a specific intent recognised for the input utterance. We define child bot-FAQ and child bot-FF that receives and classifies FAQ and FF related queries respectively. Once the parent bot receives an intent for the input utterance from one of the child bots, it returns the corresponding response to the user. The response returned by the system is deterministic. This means that once the parent bot receives intent from a child bot, a response is retrieved from a predefined dataset known as the response resource described in section 3.
The bot hierarchy can be further extended to include multiple levels of delegation. For instance, to include more use cases, multiple child bots at the same level or at different levels can be integrated such that bots at the same level are independent from each other (i.e. include intents with mutually exclusive utterances) and the connected bots at different levels communicate with each other. That is, the bots at same level cannot facilitate delegation whereas the bots at different levels can. Each bot, irrespective of level, can be designed to have a unique classifier.
A bot in a commercial setting includes various building blocks from development to deployment as shown in Figure 2. We discuss in detail each component in the prototype development of a conversational agent.
An important component of this architecture is lex, an AWS chatbot service described in the previous section 2.2. lex has a chat window to facilitate conversational dialogue between the agent and the user. The user chats with the agent via chat box. The chat box is present on a chat interface such as a website or as an app on a collaborative platform such as slack. The chat is purely text-based and includes graphical features such as buttons to display different slot choices. For the end-to-end service, we utilise other services provided by aws which are described further in this section.
We utilise aws serverless computing platform, aws Lambda 444https://aws.amazon.com/lambda/ in the backend for various purposes. First of all, a Lambda function consists of algorithms to define and control the conversation flow between the user and lex. We chose to use Python, among various programming languages by aws Lambda. A lambda function also talks with CloudWatch 555https://aws.amazon.com/cloudwatch/ to automate the process of monitoring conversations, store log of transcripts, troubleshoot issues and visualise logs. In the case of FF usecase, another lambda function triggers aws Amazon Simple Email Service (SES) 666https://aws.amazon.com/ses/ to send emails containing user responses. Thus, when the user has entered the contact details and case related information, these details are automatically sent to the inbox specifically created to receive emails from the bot. The inbox is accessible by key members at the firm such as IT team, marketing team and a set of solicitors. The user’s legal information is used by the legal expert to understand the user’s legal requirements prior to engaging directly with the user. Finally, to deploy the conversational agent on a website, we use aws CloudFormation 777https://aws.amazon.com/cloudformation/. It deploys a collection of aws resources and dependencies to launch and configure them together as a stack.
In retrieval-based dialog models, the response returned by the system is deterministic. This means that once an intent is identified, a response is generated from a predefined dataset known as the response resource (see Figure 2). In our case, the response dataset consists of a set of responses per intent per service for FAQ and per intent for FF. These responses are acquired and curated with the help of legal experts.
4 Linguistic resources
In the following sections we describe the process of acquiring the linguistic resources to train the deep learning model driving lex.
4.1 Data collection
To improve the accuracy of intent recognition, a dataset is required for training the dialog model that consists of various paraphrases for each intent. One of the challenges in building a legal dialog system is the lack of publicly available datasets (possibly due to confidentiality issues). To address this, we identified three sources of data which are described in this section in turn. An initial training dataset was generated by subject matter experts and representatives from each service such as solicitors, secretaries and paralegals. We refer to this henceforth as the ’baseline dataset’ used to build and train the initial baseline model. We trained the baseline model on a total of utterances. Table 1 shows the number of utterances allocated to each use case.
To improve the quality and quantity of the training data we investigated the use of crowdsourcing as a second data source. [9, 1] use a crowdsourcing approach to elicit diverse conversations and thus expand their dataset. Although their objective is similar to ours, there are two major differences. First, they invite participants in real-time on Twitter and Amazon Mechanical Turk with tests that are open to general public. By contrast, we deploy our baseline model and elicit responses via the Slack collaborative platform. Second, we recruit from a cohort of law students at a UK university as we require domain expertise and knowledge of legal terminology to elicit meaningful responses.
For example, the word “will” can be used as a modal verb to express beliefs about the future, or used to refer to a document that describes distribution of assets.
We recruited four law students to interact with the agent for a session of roughly hours. Each participant was assigned a private channel in the slack workspace and was given a number of hypothetical scenarios for various legal services. One of the scenarios is described below,
Name: David Clark
Email address: email@example.com
David owns a telecommunication company, Telecom Corp. Ltd. There are employees in his company. He needs help with drafting employment policies and procedures and would like information about some or all of the following:
length of whole process, visit Clacton office due to some reason, length of the appointment, bring someone to the meeting, attend meeting with someone, price, home visit, legal aid, prepare for an appointment, opening hours / days.
Finally, he wants to arrange an appointment with a solicitor.
In the scenario above, text can be replaced with whatever the participant thinks is suitable. This brings diversity in the text. For example, visit Clacton office due to some reason can constitute a question like ‘can I visit Clacton office as that is closest to my office?’.
The third data source constitutes the last three years’ daily user enquiries received via an enquiry form located on the firm’s website. These enquiries are not only a source of sentence paraphrases but also help identify new user intents which were later added to the dataset. For instance, some of the new intents identified are location of a legal firm, urgency of the legal matter and method of contact. The process of obtaining the data from third data source is described as follows. The data from this source is unstructured, that is the enquiry form usually consists of sender’s name, subject, legal service, contact details, message body and other metadata. We have automated the process of extracting the message body from the enquiry. The rest of the content from the enquiry is discarded. Basically, we write a Python script that creates an excel file and writes the extracted message in the file. The script creates multiple sheets in the file each representing a legal service that is each sheet consists of enquiries related to a particular service. Once all the enquiries are processed, we manually extract the most relevant utterances which becomes part of either training or test set as described further in this section.
4.2 Training and Test Data preparation
The collected utterances from last two data sources are manually evaluated for relevance, appropriateness, and clarity. First, utterances that are not relevant to a specific context in the conversation, such as ”I will keep an eye on my emails”, are discarded. Second, the utterances are evaluated for appropriateness. For example, ”can you help me with employment law?” is a legitimate question, but out of scope as the dialog system is currently not trained to handle enquiries related to employment law. Third, utterances are evaluated for clarity. For example, an utterance such as ”will validity” is ambiguous and therefore lacks clarity. Utterances that fail to meet these three criteria are discarded. The remaining utterances are then manually labelled with corresponding intents. Finally, the labelled data is used as input to the system trained on the baseline dataset. Utterances that are correctly recognised are added to a set of regression test cases and remainder are added to the training dataset. The regression test cases are used to ensure that the modifications to the system do not have a negative impact on the performance of the model. In future, we expect the training dataset and test cases to grow in size, and regression testing is performed every time the dialog model is modified or (re)trained. During regression testing, when an unseen utterance is encountered, lex uses a classification technique to assign a confidence score to each intent based on intent classification accuracy, and the intent with the highest score is returned 888https://docs.aws.amazon.com/lex/latest/dg/confidence-scores.html.
|Legal experts (Baseline)||54||96||150|
utterances were collected from our crowdsourcing exercise. Out of these, we discarded utterances as they did not satisfy the three criteria described above. Of the remaining utterances, utterances were correctly recognised by the system and thus were added to the set of regression test cases. The remaining were added to the training dataset. As seen in table 1, crowdsourcing elicited only one further instance for the FF use case (in addition to the in the baseline dataset). This is due to the fact that participants were able to choose a service using user interface buttons rather than textual input. Consequently, most crowdsourced utterances are for FAQ rather than FF. In future, we plan to suppress these buttons to give participants more incentive to articulate their intent and choice of service using textual input only.
The result of our crowdsourcing exercise was that utterances were added to the baseline training set of utterances. This resulted in a total of utterances which are then mapped to lower case prior to use in building the dialog model.
Figure 3 shows the number of intents (y-axis) by training sample size (x-axis) for both use cases combined. Evidently, the distribution is highly uneven especially for FF which can lead to a bias in favour of intents with greater numbers of training instances. This issue is mitigated and the training data is rebalanced with the use of data generated from the third data source that is user queries received through the firm’s website.
Although modest in size, the training dataset has proven sufficient to train the dialog model in lex. The lex model achieves an accuracy of 93.69 % when tested on test sentences. Due to limited space we only show a portion of test cases in the figure 4. First column in the figure is the index of the test sentence. Last column shows whether the model has correctly classified the sentence (Pass) or not (Fail), where a class represents an intent in one of the three bots.
5 Additional features
A number of additional features and functions have been added to the agent to improve the conversation flow and overall user experience.
5.0.1 Persist Linguistic context
First among these is a need to persist linguistic context so that unnecessary repetition or clarification can be avoided. To achieve this, we store information from previous intents and slots. For example, if a user enquires about the price for a service and then changes intent to speak to a solicitor regarding this service, both the previous intent and the previous service are maintained as part of the linguistic context. This allows more seamless switching of intents within a use case or switching from an intent in one use case to an intent in another use case. Storing the previous intent also helps the system interpret follow-up context based conversation, for example:
User: I want to know the price of selling a business.
: It is very difficult to answer this question unless we have more information regarding your case. We would be happy to ring you to find out more if you can provide us with your contact details. You will not incur any charges until you have accepted any estimate which we give you.
User: How about an NDA?
Sys: We can provide a mutual NDA for a fixed price of £175 plus VAT.
In this example, the system interprets the “Price” context by storing the “Cost” intent (or input type) and the service as “selling a business”. For the follow-up question, the system recognises “How about…” and “What about…” style sentences and applies the last recorded input type, which in this case is the “Cost”, i.e. price for preparing an “NDA”.
5.0.2 Restart / Resume
We give user an option to restart (i.e. start the conversation again) or resume (i.e. pick up from the last conversation). In the latter, the stored linguistic context is used to pick up from where the conversation was left. To facilitate this feature, we define an intent in the parent bot, “restart_intent” that is invoked with sentences such as “restart the session” and “Can I restart the session?”. For both restart and resume intents, the system prompts a clarification question such as “Are you sure you want to restart?”. To capture user response to this question a slot is defined, “restart_slot”, that can take values such as “yes”, “sure”, “no” etc.
It is expected that some of the input utterances will not be identified by the model. In such cases, it is best to avoid generic responses such as “Sorry I do not understand.”. To address this, we define a “fallback” intent in the parent bot. This intent is triggered when the model fails to assign a confidence score of higher than to any intent. A confidence score informs how likely the user utterance belongs to an intent. It’s value falls between and . Once this intent is triggered, the system suggests the user to enter their contact details so that they may subsequently be contacted by a representative of the firm to assist them with their case. The score of was derived empirically based on multiple trials of the lex intent classifier. This score indicates that a given utterance is significantly different to the training data, and hence not recognised by the model. We later manually classify the unidentified sentences and include them in the training data to reduce further invocations of the “fallback” intent.
5.0.4 Automated Testing
To test the overall bot performance with these additional features followed by a use case (FF or FAQ) or vice versa we have curated various single and multi turn conversations as part of a ’conversation test set’. ’conversation test set’ helps to test the system end-to-end. This test set consists of more than conversations which aims to cover all possible conversation scenarios including switching from one use case or feature (as outlined above) to another. Each scenario shown below consists of multiple conversations consisting of user utterances.
Fallback in the parent bot
FF followed by restart / resume
FAQ followed by restart / resume
FF followed by fallback
FAQ followed by fallback
FAQ followed by FF followed by FAQ
FF followed by FAQ followed by FF
FAQ followed by fallback in the child bot-FAQ
FF followed by fallback in the child bot-FF
An example for FF followed by FAQ is shown below,
User: can you help me with employment contract?
Sys: Are you an employee or an employer?
Sys: What is your name?
Sys: What is your phone number?
Sys: What is your email address?
Sys: Do you require assistance with the review of an existing contract or would you like a new contract to be drafted?
User: i would like review of an existing one.
Sys: What is your preferred contact method? If by telephone, are there any specific times of the day we should avoid?
User: email, thanks.
Sys: Is your matter urgent (i.e. needs to be finalised within the next 48 hours)? User: No
Sys: Thanks for that. One of our legal experts will contact you as soon as possible. Is there anything else we can help you with?
User: How long will the process take?
Sys: This varies from case to case dependant on the complexity of the issue. For straightforward reviews we aim for a turnaround of one or two days.
We set up testing via aws Command Line Interface (commonly known as aws cli) that tests the intent identification capability of the model. Each exchange in a conversation is submitted to lex and once all the conversations are tested, a report is generated that shows the accuracy of the model, that is, the number of utterances in a conversation correctly identified by lex. This automated testing conducted weekly, in addition to regression testing, also helps in debugging issues and identifies any bugs in the algorithms designed to control the conversation flows.
5.0.5 Free form user input
There are multiple ways to fill slots. The bot can be trained to match a keyword (a slot value) in the user text if there is an exclusive list of predefined slot values. Displaying a set of choices (slot values) in the form of a button is another way to fill the slot. In addition to using these approaches, we allow users to enter free form text for a number of questions asked in FF case. For instance, the bot asks the user to provide a description of their legal case to fill “case_desc” slot. It is not possible to predetermine all the responses to this question due to its varied responses. Hence, for questions such as above the user responds in the form of free form text. An option of free form text entry is not provided in lex
so we design a heuristic that assigns a dummy slot value such as “to_be_filled” to the “case_desc” slot. In the next parse of the algorithm, if slot value is “to_be_filled”, algorithm replaces the dummy slot value with the user input.
5.0.6 Multiple line bot response
To further improve the user experience we split a bot response consisting of multiple lines into multiple responses. We make use of regular expressions to divide a long response that is returned to the user in the form of multiple responses. We restrict each response with no more than three sentences. For example, in the text shown below, the bot divides the response to return it as a sequence of responses. The first response consists of initial three sentences and the rest of the sentences are sent as a second response to the user.
“If you would like advice concerning your contracts I will be happy to help. If you are able to email copies of the contracts you would like advice on please send them to firstname.lastname@example.org. If you could provide a few more details and answer the following eight questions, I will get one of our legal experts to contact you to discuss your requirements. You will not incur any charges until you have accepted any estimate which we give you. What is your first name?”
6 Conclusion and future work
In this paper we present a retrieval-based dialog system for the legal profession that supports the answering of FAQs and fact finding by identifying an appropriate legal service and recording case details. We describe the process by which requirements are acquired and the platform selected, along with the architecture and linguistic resources created in developing the system. In conclusion, the model achieved the accuracy of 93.69% on the regression test set. We also describe the proposed hierarchical bot design with multi-level delegation and additional features we developed to improve user experience.
One limitation of our work is that we heavily rely on the response resource designed with the help of legal experts. In future, we plan to develop a generative model that, unlike retrieval models, would automatically generate responses to the user queries. Generative model introduces variability in bot responses. A further study would be needed to identify resources with multiple sentence paraphrases to train this model. We also plan to extend the bot to other use cases such as booking an appointment.
-  (2012) Dialog system using real-time crowdsourcing and twitter large-scale corpus. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 227–231. Cited by: §4.1.
-  (2016) Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §1.
-  (2017) A survey on dialogue systems: recent advances and new frontiers. Acm Sigkdd Explorations Newsletter 19 (2), pp. 25–35. Cited by: §1.
-  (2018) Deep learning for dialogue systems. In Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts, pp. 25–31. Cited by: §1.
-  (2018) Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1371–1374. Cited by: §1.
-  (2015) Building end-to-end dialogue systems using generative hierarchical neural network models. arXiv preprint arXiv:1507.04808. Cited by: §1.
-  (2015) A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742. Cited by: §1.
-  (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1.
-  (2016) Chatbot evaluation and database expansion via crowdsourcing. In Proceedings of the chatbot workshop of LREC, Vol. 63, pp. 102. Cited by: §4.1.
-  (2000) JUPlTER: a telephone-based conversational interface for weather information. IEEE Transactions on speech and audio processing 8 (1), pp. 85–96. Cited by: §1.