When is ACL's Deadline? A Scientific Conversational Agent

by   Mohsen Mesgar, et al.

Our conversational agent UKP-ATHENA assists NLP researchers in finding and exploring scientific literature, identifying relevant authors, planning or post-processing conference visits, and preparing paper submissions using a unified interface based on natural language inputs and responses. UKP-ATHENA enables new access paths to our swiftly evolving research area with its massive amounts of scientific information and high turnaround times. UKP-ATHENA's responses connect information from multiple heterogeneous sources which researchers currently have to explore manually one after another. Unlike a search engine, UKP-ATHENA maintains the context of a conversation to allow for efficient information access on papers, researchers, and conferences. Our architecture consists of multiple components with reference implementations that can be easily extended by new skills and domains. Our user-based evaluation shows that UKP-ATHENA already responds 45 of defined intents with 37


page 1

page 2

page 3

page 4


Introducing MANtIS: a novel Multi-Domain Information Seeking Dialogues Dataset

Conversational search is an approach to information retrieval (IR), wher...

BERT Embeddings Can Track Context in Conversational Search

The use of conversational assistants to search for information is becomi...

Analogy Search Engine: Finding Analogies in Cross-Domain Research Papers

In recent years, with the rapid proliferation of research publications i...

Vector Representations of Idioms in Conversational Systems

We demonstrate, in this study, that an open-domain conversational system...

Deep Learning Based Chatbot Models

A conversational agent (chatbot) is a piece of software that is able to ...

An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog

Online conversations include more than just text. Increasingly, image-ba...

Scim: Intelligent Faceted Highlights for Interactive, Multi-Pass Skimming of Scientific Papers

Researchers are expected to keep up with an immense literature, yet ofte...

1 Introduction

Researchers need to be up-to-date about the latest status of their research areas to deliver novel contributions. However, the amount of such information is exploding as research areas are rapidly growing in various aspects such as the number of published papers, authors, conferences, conference participants etc.

Several solutions have been proposed to obtain new insights from such heterogeneous information. For example, GrapAL111https://allenai.github.io/grapal-website/ betts2019grapal is a web-based tool for exploring scientific literature enabling, e.g., finding experts on a given topic for peer reviewing. Google and semantic Scholar are two web-based tools that provide information about researchers (e.g. their h-index) and papers (e.g. the number of their citations). Some tools are specifically designed for the NLP research area: the ACL Anthology is one of the primary knowledge bases that collects papers published in the NLP conferences and journals. CL Scholar singh2018cl

develops a knowledge graph from the ACL Anthology. wan2019aminer propose solutions for expert finding, trend analysis, reviewer recommendation and alike.

Despite the valuable outputs of these solutions, their benefits remain restricted as they are not working together under a unique interaction environment. Each of these tools has its own user-interface, which is not consistent with those of other tools. None of them interacts with users via human (natural) language. Moreover, the insights provided by these disjoint solutions are independent of the history of interactions with users.

Recent advances in conversational agents have shown to simplify the interactions between human users and computers in various tasks such as chitchats serban2017deep, recommending a restaurant wen-etal-2017-network, and booking a table bordes2016learning. While such agents become available to consumers at a large scale, the NLP and ML research community who largely contributes to the agents’ development does not yet use this technique for boosting the scientific process.


Dialogue Master




Skill Tree


input text

response text

input text

response text

input text

input state

input state

response states

response states

response text

input state

response states


Figure 1: A general view of the components used in .

In this paper, we present UKP-ATHENA (henceforth  ) as a scientific conversational agent, which provides easy access to massive scientific information.  responds to questions about various aspects of the NLP research area by retrieving answers from multiple scientific knowledge bases and services. To the best of our knowledge,

 is the first open-source conversational agent developed for helping researchers in finding and exploring scientific literature, identifying relevant authors, planning conference visits and preparing paper submissions.

We perform a human evaluation to measure the quality of dialogues between  and researchers. Participant researchers in our study find that  is beneficial for their research because it already fulfils many of their essential needs.

2 Ukp-Athena

The general architecture of  is shown in Figure 1

. Its modules are grouped into the user interfaces (UIs), the main dialogue components natural language understanding (NLU), dialogue management (DM), and natural language generation (NLG) directed by a master, to query the requested information. The DM is backed by a tree of multiple skills returning information from external Knowledge Bases (KBs) or services.

2.1 User Interfaces (UIs)

User interfaces (UIs) let users interact with machines easily. We implement a web-form, a command-based, and a web-service UI for . The web-form UI is appealing to interact with for non-technical human users, and the command-based UI fits developers and technical users. The web-service UI enables  to be used as a virtual member in chat-rooms.

2.2 Dialogue Master

Dialogue master is responsible to communicate with the three main dialogue components for which it uses our primary data structure “state”. A state encodes a dialogue state which includes salient information presented in any utterance: domain, intent, and slots. If information in a state is extracted from an utterance said by a user, we refer to it as an input state. If it is provided by  for generating a response, we refer to it as a response state.

A domain indicates the topic of an utterance, e.g., conference, paper, and people. An intent refers to the intention of the speaker of saying an utterance, e.g., give-deadlines is an intent in the conference domain. Slots are lists of nominal entities that are required to fulfill an intent in a domain, e.g., {CONF-NAME} is a slot for intent give-deadlines in domain conference. The implementation of  released with this paper has intents and slots for domains shown in Figure 2.

[Master [General [Context] [Exit [Survey] ] [Fallback] [Feedback] [Greeting] [Identity] [Menu] ] [Task [Paper [Meta-data] [Discourse] ] [Conference [Dates [Deadlines] ] [Events [Keynotes] [Social_events] [Tutorials] ] [Program] ] [People] [NLPnews] ] ]

Figure 2: The tree hierarchical relationships among domains implemented for the current version of .

2.3 Natural Language Understanding (NLU)

The NLU module transforms an input text to an input state. More concretely, it performs three main tasks: (i) identifying the domain of an utterance, (ii) identifying the intent of an utterance, and (iii) extracting the values of slots that are mentioned in the utterance.

 already features a wide range of NLU models including rule-based and machine learning (ML-)based approaches that not only can be used independently or in combination but also can be easily extended.

Rule-based NLU

Our rule-based NLU approach relies on a set of pre-defined templates. We overlay templates on a user input text to detect the domain and intent as well as to fill the values of the slots that appear in the matched template. A rule-based approach is highly precise in accomplishing NLU tasks, therefore it is suitable for phrases with high frequency, low complexity and low ambiguity. However, this approach suffers from low recall. If an input text slightly differs from the templates then the NLU module fails to recognize domains and intents. Increasing the number of templates does not mitigate this issue but rather fosters ambiguity, for example, the template When does … start? can be used for both When does ACL 2020 start? and When does Deep Adversarial Learning for NLP start?, where the domain of the former question is conference and of the latter one is tutorial. To overcome this problem, additionally, we train and integrate ML-based NLU models.

ML-based NLU

Machine learning models require annotated training data. Since there is no available dataset for training scientific conversational agents, we propose a new dataset by automatically augmenting our templates with paraphrases.

To do so, we extract 74 most frequent n-grams (n = {2,3,4}) from four non-scientific task-based dialogue corpora: ATIS

hemphill1990atis, Snips222snips.ai, DSTC2 henderson2014second, and Frames asri2017frames. The rationale behind our approach is to capture frequent expressions that are often used by human users to chat with conversational agents (e.g. give me, I need to know, …) and then combine those with the informative parts of our templates (e.g. the deadline for {CONF_NAME}

) to augment the templates. Our augmentation approach makes the ML-based NLU models robust to such variances in the input utterances. The informative expressions are automatically extracted from the templates using two approaches: (i) We extract explicit questions that start with

where, when, which, whose (e.g., I don’t know when is {CONF_NAME} when is {CONF_NAME}), (ii) We extract phrases starting with the and a, given that the sentence contains what or who at the beginning333This ensures that the question is about the slot/event itself and not its location or time or some other attribute (e.g., who is the author of {PAPER_TITLE} the authors of {PAPER_TITLE}). We extract informative expressions from our templates using the first approach, and using the second one. These phrases can be prepended with extracted most frequent n-grams to produce new templates. Finally, for each template, we replace the slots with different concrete slot values obtained from KBs (e.g.,{ CONF_NAME } is replaced with different NLP conference names. The final number of instances for the training and test sets is shown in Table 1. Our dataset is publicly available444The link comes later.

Train Test
# of human-provided templates 285 161
# of added templates 1621 816
# of instances 1906 977
Table 1: Some statistics of our dataset for NLU.

We use two approaches to encode utterances into vectors. First, we represent utterances by their TF-IDF representations as feature-vectors. We then supply the vector representation of each utterance to a Support Vector Machine (SVM) with a linear kernel for domain and intent identification, and to a Hidden Markov Model (HMM) for slot filling. Second, we encode words in a dialogue utterance by GLoVe, as benchmark pre-trained word embeddings, to include the semantic relationships among words. We compute the average of word embeddings in an utterance to represent the utterance by a vector. Table 

2 shows the performance of the described models.

Model Intent Detection Slot Filling
Random baseline
Majority baseline
Table 2: The accuracy(%) of the ML models for NLU.

Given the results, we use the SVM model for the intent detection and GloVe-based model for slot filling. To benefit from both rule-based and ML-based NLU models, we first use the rule-based models to transform an input utterance to a an input state; if it fails then we use the ML-based models.

2.4 Dialogue Manager (DM)

We associate each domain with a skill of . The dialogue manager is in charge of triggering a sequence of skills to provide a response to an input utterance. To do so, we define a tree-based hierarchical relationship among skills (See Figure 2). We use the intents in each skill as actions it should perform to react to input utterances. Each node in the tree is a skill, which includes several sub-skills to which we refer as “children”. This sort of hierarchical relation among skills makes developing the DM module efficient because it narrows the scope of the active context at each dialogue turn. For instance, when a user asks about the title of a paper given its author names, all nodes in the path from the dialogue master, which is the root of the tree, to the Meta-data

skill are active as a context to provide the response. The follow-up question, which could be about showing the abstract of that paper, is interpreted given the active path of the tree as the context. However, if the topic of a dialogue changes in the follow-up question, the entire path becomes inactive and another proper path will be activated. For any input state to DM, it first checks whether there is an active path in the tree. If so, it uses the final skill of the active path and gives a response according to the state. If not, it will classify the state into a new path.

The tree structure enables  to consider local context (a few last dialogue turns) for providing a response. To benefit from the long history in a conversation, we introduce a memory to retain the most essential information that is given and taken during a conversation with . Since the most salient information of input and response utterances are encoded via states, we retain the input and response states of each dialogue turn into a stack of states.

2.5 Knowledge Bases (KBs)

We design and implement each domain as a skill for  to ease the process of extending its knowledge for the future use-cases. Each skill connects to at least one source of data (which are mainly websites) to acquire relevant data for responding questions. Table 3 shows the sources used for extracting data. We have one or two sources per domain to demonstrate our approach, while our underlying framework enables implementing connectors to a wide range of additional or alternative sources. The license terms of these websites give permissions to use their data for research purposes.

Domain Source
Papers and Authors www.aclweb.org/anthology
NLP News newsletter.ruder.io
Conference Deadline http://www.wikicfp.com/cfp/
Conference Program NAACL 2019 database
Table 3: The data sources used for different domains.

One of the goals of  is to assist participants in conferences to plan their visit effectively by alleviating the need for searching in the conference programs. Such programs present information about the time schedule, location, title, and other details of events (e.g. oral presentation sessions, keynotes, tutorials, etc) in a conference. We collect the information related to the conference schedules from their websites.

We also implement a script for each of the paper and authors skills to crawl the websites that contain relevant information (See Table3). These scripts have two main functions. One function obtains the data from a website instantly by making an instance connection to the website and querying the information required for responding to a question. This functionality is suitable for KBs that have giant data, e.g., Google Scholar. The other function downloads essential data from the KBs in a regular time-period. Besides, to ensure that  always provides the latest and most real-time information, we let  update its KBs regularly.

For gathering the deadline dates of NLP and ML conferences we use two source websites (See Table 3). Since conference deadlines are set months in advance, we use the capability of our website crawler to collect such data every 30 days from the websites. We design  to assist researchers in their daily work. It is crucial for researcher to know about the latest news in their area. Currently, the NLP news website (See Table 3) provides such information.  collects the news from the corresponding website and transforms them to a structured format for further processing, for example summarizing the news.

2.6 Natural Language Generation (NLG)

We implement a template-based NLG module that receives the values for its slots from a response state. Slot values in response states are obtained from the input state and the information that the corresponding active skill extracts from KBs. The task of the NLG module is to find a proper template for generating an informative response. To do so, among the templates that are defined for a skill, we filter the most informative ones, which include the most number slots that can be filled. We then randomly choose one of these filtered templates to generate a natural response. Our motivation of randomly selecting a template is to not repeat a response through a dialogue. The filled response templates are first sent to the dialogue master and then transferred to a UI for displaying to users.

2.7 Developing New Skills

We make the framework and the skill set of  open source with the goal of jointly building a conversational agent for NLP researchers in a community effort. Contributing new skills is easy because of our proposed hierarchical structure between skills. A skill can easily be added to  by defining two disjoint sets of NLU and NLG templates. We then automatically recognize all new skills, and integrate them into the tree structure.

3 Human Evaluation

To evaluate  in a realistic scenario, we conduct an in-house user study with one postdoc, three PhD candidates, and two master students from our lab working on different NLP topics. Our study consists of three main tasks, i.e. intuitiveness, diversity, and information coverage. We also collect a general satisfaction survey.

1 (strongly agree) 2 (agree) 3 (disagree) 4 (strongly disagree)
 was able to “understand” my questions
16.7% 50.0% 33.3% 00.0%
 was able to provide answers to my questions
00.0% 50.0% 50.0% 00.0%
I was satisfied with the informativeness of the answers provided by
33.3% 33.3% 00.0% 33.3%
I was satisfied with the fluency of the answers provided by
16.7% 50.0% 16.7% 16.7%
 could respond in a reasonable time
50.0% 33.3% 16.7% 00.0%
The GUI of
 was suitable for reading the provided answers
50.0% 00.0% 33.3% 16.7%
 reduces my need to google a specific information
16.7% 66.7% 00.0% 16.7%
 would help me save some time in my work
33.3% 33.3% 33.3% 00.0%
I would like to use
 in the future on a daily basis
00.0% 66.7% 00.0% 33.3%
I will use
 to plan for my next conference
16.7% 16.7% 33.3% 33.3%
Table 4: The output of the general survey.


We aim at estimating the intuitiveness of

 for the users who interact with it for the first time. We provide human judges with a set of slot values as the pieces of information that they can inquiry. They are asked to randomly choose a piece of information and then keep formulating different questions until  provides a correct response. We measure the average number of formulations users defined to obtain the correct response as a proxy for intuitiveness of . Users are asked to stop this task if  fails to deliver a correct response after 20 tries.


We measure to what extent  identifies the intent of various input questions targeting an identical piece of information. For this task human judges are provided with a set of information. Then they are asked to choose a piece of information (different from the one chosen in the intuitiveness task) to talk to . The difference to the intuitiveness task is that human judges are asked to try five different formulations for asking the selected piece of information, regardless of whether  succeeds or fails in providing the correct response. We report the percentage of the formulations for which  provides correct responses, with respect to the total number of formulations tried by all human judges.

Information coverage

We measure how well  responds questions about different slot values given by human judges. We provide a set of information and their corresponding question templates for interacting with . However, templates contain one slot, which needs to be replaced with a concrete value of that slot type by human judges. To narrow the domain of slot values, we ask human judges to focus on the NLP research area. We report the percentage of slot values for which  responded correctly, with respect to the total number of slot values tried by all human judges.

General survey

The human judges are asked to participate in a survey (Table 4) after completing the above tasks to assess their general feelings from interacting with . They assign integer scores between 1 and 5 to answer the survey questions.


The evaluation scores for intuitiveness, diversity, and coverage, which are described above, are , , and , respectively. For the intuitiveness task, among 13 given pieces of information, human judges chose the number of citations of a paper, keynote speakers in a conference, the conclusions of a paper, the deadline of conference (2X), and the abstract of a paper, where  responded to the questions about two latter ones correctly in the first try. For the diversity task, human judges were interested in the following given pieces of information: the authors of a paper, the conclusions of a paper (2X), the deadline of a conference, the tutorials in a conference, and the keynote times in conference.

For information coverage, the human judges chose the following pieces of information: the deadline of conference, the start time of a keynote at a conference, the h-index of a person, retrieving the figures (2X) in a paper, and the bib entry of a paper. For h-index,  correctly responded all examined slot values (three out of three).

The results of the survey shows a general satisfaction feeling of interactions with , confirming our motivation that the existence of such an agent helps researchers (See Table 4). 83% of participants agree that  reduces their needs to search through the web (e.g. using search engines) to obtain information related to their research; and 66% use  in the future. However, 66% of human judges disagreed on using  for planning their schedule for a conference. This observation could be because the current version of  mainly retrieves information for users but planning for a conference needs some inferences on such information as well.

This observation manifests itself more in the question about the possible future avenues for : What features would be helpful for your daily work and would like to see in ? We group the answers as follows: (i)  should gather some background information about its users and their research interest either by asking some questions during conversations or in its login page, (ii)  needs to cover more information from web such as social media and the content of papers, and (iii)  needs to make some inferences on the retrieved information to help users, for example by inferring if a paper is interesting for a user.

4 Conclusions

The size of research communities is drastically growing which yields exploding information about them on the web. Accessing such an amount of heterogeneous information in a coherent way takes much time and attention of researchers. We propose UKP-ATHENA to ease the access to this information through a conversational environment. The current version of UKP-ATHENA achieves satisfactory results based on our human evaluations. In future work, we would enable UKP-ATHENA to respond questions about the content information of scientific papers and to perform some inference on conference data. UKP-ATHENA is publicly available to chat: http://athena.ukp.informatik.tu-darmstadt.de:5002.