Named Entities troubling your Neural Methods? Build NE-Table: A neural approach for handling Named Entities

04/22/2018 ∙ by Janarthanan Rajendran, et al. ∙ University of Michigan ibm 0

Many natural language processing tasks require dealing with Named Entities (NEs) in the texts themselves and sometimes also in external knowledge sources. While this is often easy for humans, recent neural methods that rely on learned word embeddings for NLP tasks have difficulty with it, especially with out of vocabulary or rare NEs. In this paper, we propose a new neural method for this problem, and present empirical evaluations on a structured Question-Answering task, three related Goal-Oriented dialog tasks and a reading-comprehension-based task. They show that our proposed method can be effective in dealing with both in-vocabulary and out of vocabulary (OOV) NEs. We create extended versions of dialog bAbI tasks 1,2 and 4 and Out-of-vocabulary (OOV) versions of the CBT test set which will be made publicly available online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Problem Description

We come across Named Entities (NEs) in many Natural Language Processing (NLP) tasks. The need to interact well with NEs become critical in tasks such as Question-Answering (QA) and Goal-Oriented dialog, where they play a crucial role in task completion. Examples include QA systems for retrieving information from a given story or about courses offered at an university and dialog systems that do restaurant reservation, flight ticket booking, and so on. In many cases, these tasks also involve interaction with external knowledge sources such as DataBases (DB) which could have a large number of NEs. NEs in these systems include people names, course numbers, restaurant names, locations, phone numbers, etc.

Recently, there has been a lot of interest in building neural methods for NLP tasks. Interacting with NEs poses some unique challenges to neural methods. There are different ways in which past work has tried to handle NEs in neural systems. One straightforward way is to add each and every NE (including those in the DB) to the vocabulary. This approach has been evaluated for only synthetic or small tasks (Neelakantan et al., 2015). For real world tasks, especially those with large DBs, this causes an explosion in the vocabulary size and hence the number of parameters to learn. There is also the problem of not being able to learn good neural embeddings for individual NEs, as individual NEs (e.g., a particular phone number) generally occur only a few times in a dataset.

Another approach that has been proposed in the literature is to encode all the NEs with random representations and keep them fixed throughout (Yin et al., 2015), but here we lose the meaning associated with the neural embeddings and risk their representations interfering and correlating with those of others in unexpected ways.

There is another simple way in which NEs are handled in many real world systems, which is to first recognize the NEs with either NE taggers (Finkel et al., 2005) or entity linkers (Cucerzan, 2007; Guo et al., 2013; Yang and Chang, 2015), and then replace them with NE-type tags. For example, all location names could be replaced with the tag NE_location. This prevents the explosion in vocabulary size; however, the system loses the ability to distinguish and reference different NEs of the same type. In addition to this, there is also the possibility of new NEs arising during the test time. In fact, many of the OOV words that arise during test time in many NLP tasks are NEs.

Furthermore, there are many NLP tasks where it is easier and accurate for the system to work with the actual exact values of NEs rather than their neural embeddings, like providing a phone number to a user or searching for a faculty name over a DB. None of the above neural methods have the ability to interact with exact values while still fully remaining within the neural gradient based learning framework.

In this paper, we propose a simple idea for neural methods to interact with NEs that handle all the aforementioned issues, including robustness to OOV NEs during test time. The core idea is to not include any of the NEs in the vocabulary, but rather to generate a neural embedding for them on the fly when the agent encounters them, store these embeddings and the associated exact values in a table, and then use the generated representations to retrieve and use the actual NE value from the constructed table whenever required. We demonstrate our idea on three types of tasks: a reading-comprehension task, a simple structured Question-Answering (QA) task and three goal-oriented dialog tasks. The QA and dialog tasks involve interaction with a DB for which we use a multiple-attention based neural retrieval mechanism. Our results clearly suggest that our proposed way of handling NEs is effective in many NLP tasks.

2 Details of Proposed Solution

To explain our idea in detail, consider a neural dialog system participating in a dialog with a user333Though the idea is applicable for various NLP tasks, here we choose one of them to explain the idea.

. It builds a predefined vocabulary obtained from the training data by excluding all NEs. The sentence encoder (e.g., Recurrent Neural Network (RNN)) processes the user utterance as shown in Figure

1 and described in Equation 1. A Named Entity Recognizer (NER) is used to identify named entities and their types. For tasks such as goal-oriented dialog systems with DB, a NER is not required as NEs and their types can be obtained easily by referring to the DB. Let be the user utterance at time step . Let be the words in the user utterance . For words that are part of the vocabulary, their neural embeddings can be obtained from the encoding matrix . If the word is a NE, then it will not be part of the vocabulary and hence will not have a neural embedding in the encoding matrix. For NEs, the dialog system uses its knowledge of the dialog so far (), the current utterance so far () and the NE-type () (e.g. NE_course_number) of the NE encountered to generate a neural embedding () for it. This on-the-fly generated embedding, which crucially is a function of the dialog context so far, is used by the sentence encoder while encoding the NE. It is also stored in a separate table called the NE-Table. A separate, initially empty, NE-Table is used for each individual dialog. The NE-Table is populated with key-value pairs, where the key is the embedding generated () by the dialog system and the value is the actual NE (e.g., EECS 545) encountered.

Figure 1: Instantiation of the idea in an RNN.

The following are the equations associated with encoding a single word of the user sentence.

(1)

where, Is_NE is if is a NE and NE_type gives the NE type of (e.g. NE_type(EECS 545) = NE_course_number). Note that though NEs are not part of the vocabulary, their NE-type tags are; hence, the NE-type tags will have an embedding in the encoding matrix . The various ’s are the parameters that are learned.

When the dialog system wants to refer/get back to a NE value, it can do so by generating a key to match the keys in the NE-Table and then retrieve the corresponding value (e.g. EECS 545) and use it. For example, it can refer to a NE that it came across earlier in the dialog from the NE-Table, and use that in its system utterance (output sentence) or to match over an attribute’s (e.g. Course Number

) values in an external DB. The specific action performed with the NE retrieved depends on the choice of the natural language generator or the DB retrieval mechanism.

Note that while the matching of a NE selected from the NE-Table with other NEs in the DB is done through exact value match, the actual selection of that NE from the NE-Table happens using neural embeddings (key) matching. This makes the process differentiable and allows the system to learn this selection through the gradient signals obtained from the downstream task or module (eg., a retrieval module) that does this selection. NEs encountered in system utterances of a dialog are also handled in the same way. Thus, all and only the NEs that have appeared in that particular dialog so far will be present in the NE-Table associated with that dialog. The system learns to generate representations for the NEs as they come in, such that the representations have relevant and enough information to allow it to match and retrieve them when required later.

3 Experiments and Results

We evaluate our idea on three types of tasks: a reading-comprehension task, a structured Question-Answering (QA) task and three goal-oriented dialog tasks. Instead of adding our NE-Table idea to each of the specialized architectures for 3 different tasks, we chose the end-to-end memory network architecture from Sukhbaatar et al. (2015) as the base architecture for all tasks and added our proposed NE handling, in order to evaluate our idea rather than trying to get state-of-the-art performance in a particular task/dataset. Our proposed NE-Table idea is generic and can be added to the state-of-the-art approaches for these tasks.

3.1 Reading Comprehension Task

We test our idea on The Children’s Book Test dataset (CBT) introduced by Hill et al. (2015), designed to test the role of memory and context in language processing and understanding. The CBT is built from children’s books from ProjectGutenberg. Example ‘questions’ are formed by enumerating 21 consecutive sentences, where the first 20 sentences form the context, and a word is removed from the 21st sentence, identifying which becomes the query. The specific task is to identify the answer word among a set of 10 candidate answers appearing in the context sentences and the query. There are 4 question types - Named Entities, (Common) Nouns, Verbs and Prepositions and naturally we test our idea on the Named Entities questions dataset.

We use the Window memory architecture proposed by Hill et al. (2015) and perform 2 baseline evaluations: encoding the windows using BoW and LSTM. In Window memory, each memory slot refers to a window of text from the context centred on an individual mention of a candidate, instead of a full sentence from the context. We use a single-hop architecture for all of our experiments on CBT dataset.

Model Validation Test
Window Memory (BoW encoding) 0.4955 0.4169
Window Memory (LSTM encoding) 0.4940 0.4110
NE-Table (BoW) 0.5705 0.5128
NE-Table (LSTM) 0.5575 0.5108
Table 1: Results (accuracy %) on CBT dataset

For the two NE-Table models that incorporate our idea into the two baselines, we generate representations fo NEs on-the-fly using an LSTM. The NE embeddings are generated by passing window memories through an LSTM These representations are then added to window memory, in place of the NE. For a fair comparison, as done in Hill et al. (2015), we only create windows for words mentioned in the candidates, instead of creating windows for all NEs present in the story. Since the task is to predict the correct NE, we can directly perform attention on our NE-Table instead of candidates to retrieve the correct answer. Table 1 shows that replacing the baseline handling of NEs with our NE-Table based handling of NEs achieves higher performance on both BoW and LSTM window encoding baseline models, across both validation and test sets444For CBT dataset, window_size - 5 was the optimal value reported in Hill et al. (2015). We think that since the window size is small, both BoW and LSTM models perform similar..

Figure 2: Results on CBT-OOV test sets

To further evaluate how OOV NEs impact the baselines and our idea, we created additional OOV test sets as follows555The OOV versions of the CBT-NE test data will be made publicly available online.. There are 422 unique NEs (answers) from 2500 samples in the test set. We generate new test datasets by replacing these NEs with new NEs not present in the train and validation sets. We generate 5 such OOV test sets with varying percentage of OOV NEs ( 20%, 40%, 60%, 80% and 100%). Figure 2 shows the comparison of our model with the baselines on the OOV test sets. The baseline models perform very poorly as OOV% increases, going down to as low as 5% from 41%. Our NE-table models perform far more robustly on the OOV test sets. We observe a slight reduction in accuracy from 51% to 46% because the new entities are also part of the windows, which are used to generate NE embeddings. These additional experiments clearly illustrate that our model performance is robust to OOV NEs. Detailed results of our experiments on the OOV test sets are in Appendix D.

3.2 Multiple-attention based neural retrieval mechanism

The remaining tasks involve interaction with an external DB. Here, we describe the neural mechanism that we use for that and present our results on those tasks in the following sections.

For our structured QA and extended bAbI dialog tasks, information is present in a single database table, where each row corresponds to a new entity of interest and the columns correspond to the different attributes associated with it. For example, in the structured QA task, each row corresponds to a course and the columns correspond to course attributes, such as course number, course name, instructor name etc. Each column of the table has a column heading, which labels the attribute of that column. These headings are part of the vocabulary. While the non-NEs in the DB are part of the vocabulary and represented by their learned neural embeddings, the NEs (not part of the vocabulary) are represented by their exact values.

The DB retrieval module performs attention over column headings and attention over rows to select the final cell(s) in 3 steps. In step 1, the column(s) that the final cell(s) belong to are selected by neural embedding attention over the column headings. For the example question Who teaches EECS545?, step 1 selects the column ’instructor name’. In step 2, the system selects the attributes (columns) by which it wants to represent the rows with, by attending over the column headings again (in the example above, the NE column ’course number’ is selected).

The third step is to do attention over the rows. For each non-NE column selected in step 2, the column embeddings are added together along each row, to generate an embedding for each row. We perform attention over these row embeddings to select matching row(s). For each NE-column selected in step 2, an NE value is retrieved from the NE-Table to do an exact match search over the NE-column to select matching row(s). The intersection of these matching row(s) gives the final set of selected row(s) and their intersection with the set of column(s) selected in step 1 gives the retrieved cell(s). For our example, only one column is selected to represent the rows: ’course number’, which is an NE-column. Therefore, an NE value is retrieved from the NE-Table (EECS545 in our example) and an exact match search is done over the course number column). Appendix B provides further explanation and details of the mechanism with examples.

3.3 Structured QA from DB

The task here is to retrieve an answer (single cell in a table) from DB in response to structured one line questions. We used the details of course offerings at a University to create these question-answer pairs. Each row in the DB table corresponds to a unique course, and the columns correspond to course attributes. The DB is a single table of 100 rows and 4 columns (Course Number, Course Name, Department, Credits), where course numbers and course names are treated as NEs.

The question/answer pairs are generated automatically, following the format -
Q: NE-type-1 NE-type-1-value NE-type-2 ?
A: NE-type-2-value
where Course Number and Course Name are the two NE types. 500 question-answer pairs were created in the above format and the data is split randomly between training and test set (400-100), where the random split results in new OOV NEs in the test set, not present in the training set666The statistics for the dataset are as follows- number of unique course numbers - 100, unique course names - 96, unique dept names - 10 and number of unique credits - 4.

Example structured question-answer pair:
Q: Course Number EECS545 Credits? A: 4

The experiments were performed with two models. Both models use a simple RNN to encode the question and the multiple-attention based neural retrieval mechanism to retrieve answers. The baseline model (W/O-NE-Table) does not distinguish NEs from normal words, and all words (including NEs) that occur in questions and the DB are part of the vocabulary and have individual word embeddings. The With-NE-Table model builds NE-table to handle the NEs (course numbers and course names in this task).

For the example question above, both models perform attention over the column headings for identifying the correct column Credits required for the answer. Then, both models attend over column headings to find the column Course Number used for representing the rows. For W/O-NE-Table model, since all course numbers are part of vocabulary, each row is represented by neural embeddings associated with course numbers and neural embedding attention is done over the row embeddings. For With-NE-Table model, since course numbers are NEs, each row is represented with exact course number values. A neural attention over NE-Table is performed to return the NE value, EECS545 which is then used to perform an exact match with the NE row representations.

Model Retrieval accuracy (%)
W/O-NE-Table 81.0
With-NE-Table 100.0
Table 2: Results on structured QA task

Table 2 shows the retrieval accuracy for both models. The train accuracy for both models was 100%. For W/O-NE-Table model, one reason for the 19% drop in performance during test time is due to OOV NEs encountered in the questions during test time777 Out of 19%, 11% drop is due to OOV NEs encountered at test time. The rest 8% can be attributed to the inability for the model to learn good representations for unique NEs that were seen during training and also encountered during testing. These NEs are in the DB, and hence are part of the vocabulary, but have random representations which did not change during the training time. The task was specifically constructed to be simple and with a small table to show that, even in this very simple task where the W/O-NE-Table model achieves 100% accuracy at training time, its test accuracy is affected significantly due to OOV NEs at test time. However, this does not pose a problem for our model With-NE-Table. The With-NE-Table model can also easily scale to large datasets with thousands of NEs without a drop in performance.

3.4 Goal-Oriented Dialog Tasks

Dialog bAbI dataset from Bordes and Weston (2016) is a restaurant reservation goal-oriented dialog dataset. It has 5 tasks - Task 1: Issuing API calls, Task 2: Updating API calls, Task 3: Displaying Options, Task 4: Providing extra information and Task 5: Conducting full dialogs (combination of tasks 1-4). Each of the four tasks (1-4) test different capabilities required in a general (commonly required) goal-oriented dialog system. The system is evaluated in a retrieval setting. At each turn of the dialog, the system has to select a candidate response from a list of possible candidates.

In the original bAbI tasks, the process of retrieving information from the DB is bypassed by providing all possible system utterances with all combinations of information pre-retrieved from the DB in a large candidate response list. We extend the tasks by adding an actual external DB so that the system can also be tested on learning to actually retrieve the required information from the DB. We evaluate our idea on extended versions of task 1, 2 and 4 described below888Task 3 requires to learn to sort. Bordes and Weston (2016) achieve close to 0% accuracy on full dialog. So we decided to skip tasks 3 and 5 (task 5 includes task 3 dialogs) to focus on evaluating our approach for NE handling.. Appendix A gives examples of the original and the extended dialog bAbI tasks999The extended versions of the dialog bAbI tasks 1,2 and 4 will be made publicly available online..

We implement NE-Table in an end-to-end memory network (Sukhbaatar et al., 2015), which is similar to the model used in Bordes and Weston (2016) paper except that we encode the sentences using an RNN, while they use a bag-of-words (BoW) encoding. The embeddings for dialog history are stored in the memory and the RNN-learned embedding of the last user utterance (query) is used to attend over the memory to get relevant information from the memory. This is done multiple times (3 in our experiments). The last internal state generated is used to select both the candidate response, and to generate the key embeddings for performing DB retrieval using the multiple-attention based neural retrieval mechanism.

We use the DB to identify the NEs along with their types (if a word is present in an NE-column in the DB it is a NE; the column where it appears gives its type). The NE-type information is given to both the NE-Table and the W/O-NE-Table models. For the W/O-NE-Table model, all input words are part of the vocabulary. For NEs, however, their embedding given to the sentence encoder is the sum of the NE word embedding and the embedding associated with its NE-type. For the With-NE-Table model, during the process of encoding the dialog sentences using an RNN, a NE key is generated when a NE is encountered and stored in the NE-Table. The ground truth attention labels are used for training the DB retrieval module.

3.4.1 Extended dialog bAbI tasks 1 and 2

In the original bAbI task 1, the conversation between the system and the user involves getting information necessary to issue an api_call with the appropriate argument values. In task 2, the user can ask the system to update his/her request for information by changing some of their preferences. The system has to take this into account and make an updated api_call.

In our extended versions of these tasks, once the system determines that the next utterance is an api_call, the system also has to actually retrieve the restaurant details from the database (rows of the DB table) which match user preferences. The system is evaluated on having conversation with the user, issuing api_call and retrieving the correct information from the DB. The DB is represented as a single table, with each row corresponding to a unique restaurant and different columns corresponding to attributes, e.g. cuisine, location etc.

We have two models, the W/O-NE-Table and With-NE-Table. Both the models first select the four relevant (cuisine, location, price range and number of people) columns (attributes) to represent each row (restaurant). The W/O-NE-Table model then selects the rows using attention over the row embeddings got through the combined (additive) representation of the four attributes selected above. In the With-NE-Table model, whenever cuisine and location names (which are NEs) occur in the dialog, a NE key is generated on the fly and are stored in the NE-Table along with the NE values. The model splits the row selection into two simpler problems. For cuisine and location, one NE value each is selected from the NE-Table and an exact match in the DB is performed. The neural embeddings of the non-NE attributes (price range and number of people) are added to perform attention for selecting rows. The final retrieved rows are the intersection of the rows selected by NE column and non-NE column based selections. Additional details for DB retrieval mechanism for tasks 1 and 2 are provided in Appendix C.

The results for task 1 and task 2 are shown in Table 3. The With-NE-Table model achieves close to 100% accuracy in both tasks, while the W/O-NE-Table performs poorly. During DB retrieval, for the With-NE-Table model, two NEs are chosen from the NE-Table and exact matching is done over different cuisines and locations in the DB table, but embeddings for these NEs are learned for W/O-NE-Table. This results in poor performance of W/O-NE-Table as a particular location and cuisine value occurs only a few number of times in the dataset resulting in poor learned embeddings for them. In addition to that, OOV cuisine and location values can occur during the test time.

Per-dialog accuracy does not involve DB retrieval101010None of the system responses in tasks 1/2/4 contain any NEs, therefore, the baseline model (W/O-NE-Table) performs 100% on Per-Dialog accuracy on test sets, but performs poorly on DB-retrieval accuracy, as it requires interaction with NEs.. Here, the system needs to understand user utterances which might have NEs and select the correct response from candidates. Both models perform well on the normal test set. However, in the OOV-test set, for task 1, W/O-NE-Table model is affected by OOV NEs (90.3%), while With-NE-Table model performance is robust (99.0%).

3.4.2 Extended dialog bAbI task 4

The original task 4 starts at the point where a user has decided a particular restaurant. The system is given all information about only that particular restaurant as part of the dialog history and user can ask for the phone number, address or both. The system must learn to use the given information to answer these questions by selecting the correct response from a list of candidate responses which contains responses with all possible restaurant phone numbers and address.

In the extended version, the system needs to search from the full DB of all of the restaurants. The NEs in candidate responses are replaced with their NE-type tags. For example, Suvai_phone is replaced with NE_phone. The system has to select candidates with NE-type tags and then replace tags with the actual NE values from DB retrieval111111Our setting is closer to how a human would do this task. When someone asks for phone number of a restaurant, we don’t try to memorize it or figure out how it is related to another phone number, rather, we search for it in the DB.. The restaurant name, phone number and address are the NEs here. This setting is similar to system action templates proposed in Hybrid Code Networks from (Williams et al., 2017).

For With-NE-Table model, the restaurant name that appears in the dialog would be stored in the NE-Table. When the user asks for information such as phone number, the restaurant name stored in the NE-Table is selected and used for retrieving its corresponding phone number from the DB. In W/O-NE-Table model, all input words (including NEs) are part of the vocabulary and the phone number is selected by neural embedding attention over all restaurants with the restaurant name mentioned by the user.

The results for task 4 are shown in table 3. We observe that both models perform well in Per-dialog accuracy (retrieving candidate responses). The W/O-NE-Table model fails in DB-retrieval (0%) because it needs to learn neural embeddings for all restaurant names, while our With-NE-Table performs well (100%) as it uses the NE-Table to generate NE embeddings on-the-fly and use the actual NE values later for exact value matching over restaurant names in the DB.

Task Model DB-Retrieval Per-Dialog Per-Dialog + DB-Retrieval
Task 1 W/O-NE-Table 10.2 (7) 100 (90.3) 10.2 (6.7)
With-NE-Table 98.5 (99.0) 98.8 (99.0) 97.3 (98.0)
Task 2 W/O-NE-Table 0.75 (0.95) 100 (100) 0.0 (0.1)
With-NE-Table 99.6 (99.8) 100 (99.9) 99.2 (99.7)
Task 4 W/O-NE-Table 0.0 (0.0) 100 (100) 0.0 (0.0)
With-NE-Table 100 (100) 100 (100) 100 (100)
Table 3: Results for extended bAbI tasks 1, 2 and 4. % Accuracy for Test and Test OOV (given in parenthesis). DB-Retrieval %: Retrieval accuracy for rows (task 1,2) and a particular cell (task 4). Per-Dialog %: Percentage of dialogs where every dialog response is correct. Per-Dialog + DB-Retrieval %: Percentage of dialogs where every dialog response and information from DB retrieval are correct.
Task Model Evaluation Task 1 Task 2 Task 4
Original bAbI tasks Baseline(MemN2N + match-type + RNN-encoding) Per-Dialog 100 (100) 99.9 (50.6) 100 (100)
Extended bAbI tasks With-NE-Table Per-Dialog + DB-Retrieval 97.3 (98.0) 99.2 (99.7) 100 (100)
Table 4: Performance comparison of our model in the extended dialog bAbI tasks, with a baseline model in the original bAbI tasks. Accuracies in % for Test and Test Out-Of-Vocabulary (given in parenthesis).

3.4.3 Comparison with original dialog bAbI tasks:

We choose the best model (MemN2N + match-type features) from Bordes and Weston (2016) (they use match-type features for dealing with entities) and update the baseline model by using RNN encoding for sentences (similar to With-NE-Table). Note that we achieve higher accuracy for our updated baseline model for original bAbI tasks than reported in Bordes and Weston (2016), which we attribute to the use of RNN for encoding sentences (they use BoW encoding).

For match-type features, Bordes and Weston (2016) add special words (R_CUISINE, R_PHONE etc.), for each KB entity type (cuisine, phone, etc.) to the vocabulary. The special word (e.g. R_CUISINE) is added to a candidate if a cuisine (e.g. italian) appears in both dialog and the candidate. For example, for a task 4 dialog with restaurant information about RES1, only one candidate ”here it is RES1_phone” will be modified to ”here it is RES1_phone R_PHONE”. Now, if the user utterance (query) is for the restaurant’s phone number, using match-type features essentially reduces the output search space for the model and allows it to attend to specific candidates better. Hence, match-type features can only work in a retrieval setting and will not work in a generative setting where the next system utterance is generated word-by-word. Our With-NE-Table model will work in both retrieval and generative settings.

Table 4 compares the performance of the With-NE-Table model in the extended bAbI tasks with that of a baseline method on the original bAbI tasks. Note that extended dialog bAbI tasks require the dialog system to do strictly more work compared to the original dialog bAbI tasks. Though not a strictly fair comparison for our model, we observe that the performance of our With-NE-Table model in extended bAbI tasks is as good as the performance of updated baseline model in original bAbI tasks. In addition to that, for bAbI task 2 OOV test set, With-NE-Table model performance is actually much higher compared to the baseline model (99.7% vs 50.6%).

4 Related Work

NE in QA: Neelakantan et al. (2015); Yin et al. (2015) transform a natural language question/query to a program that could run on databases, but those approaches are only verified on small or synthetic databases. Other papers dealing with large Knowledge Bases (KB) usually rely on entity linking techniques Cucerzan (2007); Guo et al. (2013), which links entity mentions in texts to KB queries. Yih et al. (2015); Yin et al. (2016); Yu et al. (2017) compare the text spans in questions with KB entity names at the character-level for entity linking; after the linked entities have their properties extracted, the corresponding text spans are replaced with special NE tags for further text processing like KB relation extraction. Recently, Liang et al. (2016) extended end-to-end neural methods to QA over KB, which could handle large KB and large number of entities. However, their method still relies on entity linking Yang and Chang (2015) to generate a short list of entities linked with text spans in the questions, in advance. Yin et al. (2015) propose ’Neural Enquirer’, a neural network architecture similar to the neural retrieval mechanism used in this work, to execute natural language queries on DB. While using the Neural Enquirer, they keep the randomly initialized embeddings of the NEs fixed as a way to handle NEs and OOV words.

NE in Dialog: There has been a lot of interest in end-to-end training of dialog systems Vinyals and Le (2015); Serban et al. (2016); Lowe et al. (2015); Kadlec et al. (2015); Shang et al. (2015); Guo et al. (2017). Among recent work, Williams and Zweig (2016) use an LSTM model that learns to interact with APIs on behalf of the user; Dhingra et al. (2017)

use reinforcement learning to build the KB look-up in task-oriented dialog systems. But the look-up actions are defined over each entity in the KB and is therefore hard to scale up. Most of these papers actually do not discuss the issue of interacting with NEs though they are present.

Williams et al. (2017) proposed Hybrid Code Networks which combine an RNN with domain-specific knowledge encoded as software. They achieved state-of-the-art performance on the Facebook bAbI dataset, but their approach involves a developer writing domain-specific software components.

NE in Reading Comprehension and others: For certain tasks such as Machine Translation and summarization, neural copying mechanisms Gulcehre et al. (2016); Gu et al. (2016) have been proposed as a means to handle OOV words. We could use our NE-Table idea along with such copying mechanisms for cases like dialog generation.

5 Conclusion and Future work

In this paper we proposed a method for interacting with NEs in neural settings. Our experiments on the CBT dataset illustrate that our model with NE-Table performs better than models without NE handling and clearly outperforms the baseline models on the OOV test sets. From our experiments on the structured QA task and goal-oriented bAbI dialog tasks, we show that our method can be used for interacting with entities from the DB provided, within the neural framework. Overall, our experiments on the three NLP tasks show that, handling NEs in the proposed way works and can be useful for many NLP tasks where it’s useful to work with actual NE values, or it’s hard to learn good neural representations for NEs as they occur rarely or OOV NEs are encountered at test time.

In future, we are interested in testing the NE-Table idea with other retrieval mechanisms such the ’Neural Enquirer’ proposed in (Yin et al., 2015)

which can work with multiple tables, instead of the multiple-attention mechanism used in our work. We are also interested in exploring the use of this idea in a more general setting of working with both symbols and distributed representations, exploiting the benefits of both of them, wherever appropriate, while staying within a end-to-end trainable neural setting.

References

Appendix Appendix A
Extended dialog bAbI tasks’ examples

Original dialog bAbI task 1:
U: Hi !
S: Hello, what can I help you with today?
U: I’d like to book a table for six people in an expensive price range with British food.
S: I’m on it
S: Where should it be?
U: In London
S: Ok, let me look into some options for you.
S: api_call British, London, Six, Expensive
Extended dialog bAbI task 1:
U: Hi !
S: Hello, what can I help you with today?
U: I’d like to book a table for six people in an expensive price range with British food.
S: I’m on it
S: Where should it be?
U: In London
S: Ok, let me look into some options for you.
S: api_call
S (Retrieved Information):
{The_Place, british, london, …, }
{The_Fancy_Pub, british, london, …, }

Original dialog bAbI task 2:
U: Hi !
S: Hello, what can I help you with today?
U: I’d like to book a table for six people in an expensive price range with British food in London.
S: Ok, let me look into some options for you.
S: api_call British, London, Six, Expensive
U: Instead, could it be Indian cuisine
S: Sure, is there anything else to update?
U: No.
S: Ok, let me look into some options for you
S: api_call Indian, London, Six, Expensive

Extended dialog bAbI task 2:
U: Hi !
S: Hello, what can I help you with today?
U: I’d like to book a table for six people in an expensive price range with British food in London.
S: Ok, let me look into some options for you.
S: api_call
S (Retrieved Information):
{The_Place, british, london, …,}
The_Fancy_Pub, british, london, …, }
U: Instead, could it be Indian cuisine
S: Sure, is there anything else to update?
U: No.
S: Ok, let me look into some options for you
S: api_call
S (Retrieved Information):
{Suvai, Indian, london, …, }
{Madras_Masala, Indian, london, …, }

Original dialog bAbI task 4:
DB Information given :
Name:The_Place, Cuisine:british,
Location:london, Price Range:Expensive,
Rating:7, Phone:The_Place_phone,
Address:The_Place_address, Availability:six
U: Hello
S: Hello, what can I help you with today?
U: I’d like to book a table at The_Place
S: Great, let me do the reservation.
U: May I have the address of the restaurant.
S: Here it is, The_Place_address.

Extended dialog bAbI task 4:
DB Information given : Entire DB
U: Hello
S: Hello, what can I help you with today?
U: I’d like to book a table at The_Place
S: Great, let me do the reservation.
U: May I have the address of the restaurant.
S: Here it is, NE_address.
S: The_Place_address (retrieved information)

Appendix Appendix B
Multiple-attention based neural retrieval mechanism

Figure 3: Multiple-attention based neural retrieval mechanism. When the encoder RNN encounters a NE, it generates a key representation for it and stores it in the NE-Table. When the dialog manager/decoder RNN wants to retrieve information from the DB, it attends to the relevant rows and columns of the DB by generating attention key embeddings ACC, ACR and ARR.

Figure 3

shows the schematic of the entire retrieval process. In order to retrieve a particular cell from the table, the system needs to find the correct column and row corresponding to it. The DB retrieval module does that by generating 3 different attention key embeddings (vectors): Attention over Columns for Columns (

ACC), Attention over Columns for Rows (ACR), Attention over Rows for Rows (ARR).

The column(s) that the final retrieved cell(s) belong to, are selected by matching ACC key embeddings with the neural embeddings of the column headings (Course Number, Instructor, Credits etc). A separate ACC key embedding is generated for every column heading and matched with its embeddings to provide attention scores for all the columns. For the example, Who teaches EECS545?, the system would want to retrieve the name of the Instructor. Therefore, the Instructor column heading alone will have high attention score and be selected. In our experiments, the attention scores are computed through dot products followed by a sigmoid operation, which allows for multiple selections.

Now that the column(s) are chosen, the system has to select row(s), so that it can get the cell(s) it is looking for. Each row in the table contains the values (EECS545, Machine Learning, Scott Mathew etc) of several attributes (Course Number, Course Name, Instructor etc). But we want to assign attention scores to the rows based on particular attributes that are of interest to the present scenario (Course Number in this example). The column/attribute headings that the system has to attend to for selecting these relevant attributes are obtained by matching ACR (Attention over Columns for Rows) key embeddings with the neural embeddings of the different column headings.

The last step in the database retrieval process is to select the relevant rows using the ARR (Attention over Rows for Rows) key embedding. ARR is split into two parts ARR NE and ARR non-NE. In a general scenario, ACR can select multiple columns to represent the rows. For each selected column that is a NE column, a separate NE value is retrieved from the NE-Table using a separate ARR NE embedding for each of them. These NE values are used to do exact match search along the corresponding columns (in the NE row representations) to select the matching rows. For the non-NE columns that are selected by ACR, their neural embeddings are combined together along each row to get a fixed vector representation for each row in the DB (e.g. weighted sum of their embeddings, weighted by the corresponding column attention scores). ARR non-NE is then used to match these representations for selecting rows. The intersection of the rows selected in the NE row representations and the non-NE row representations is the final set of selected rows.

In short, the dialog system can use neural embedding matching for non-NEs, exact value matching for NEs and therefore a combination of both to decide which rows to attend to. Depending on the number of columns and rows we match with, we select zero, one or more output cells. For our running example, ARR NE is used to match with the keys in the NE-Table to select the row corresponding to EECS 545 and the value EECS 545 is returned to do an exact match over the NE row representations (represented by the course number values). This gives us the row corresponding to EECS 545 and hence the cell Scott Mathew.

We could use our NE-Table idea with potentially many types of neural retrieval mechanisms to retrieve information from the DB. The multiple-attention based retrieval mechanism, described above, is only one such possible mechanism.

Task Model ACR ARR non-NE ARR NE DB-Retrieval Per-response Per-Dialog Per-Dialog + DB-Retrieval
Task 1 W/O-NE-Table 100 (100) 9.0 (6.9) - 10.2 (7) 100 (98.2) 100 (90.3) 10.2 (6.7)
With-NE-Table 99.4 (98.1) 96.9 (96.7) 100,100 (100,100) 98.5 (99.0) 99.8 (99.8) 98.8 (99) 97.3 (98.0)
Task 2 W/O-NE-Table 100 (100) 8.6 (7.6) - 0.8 (1.0) 100 (100) 100 (100) 0.0 (0.1)
With-NE-Table 100 (100) 99.1 (99.8) 100,100 (100,100) 99.6 (99.8) 100 (100) 100 (100) 99.2 (99.7)
Table 5: Results for extended dialog bAbI task 1 and 2. Accuracy % for Test and Test-OOV (given in parenthesis). ARR non-NE columns are price and number of people. ARR NE columns are cuisine and location. DB-Retrieval %: Retrieval accuracy for rows (task 1,2) and a particular cell (task 4). Per-Dialog %: Percentage of dialogs where every dialog response is correct. Per-Dialog + DB-Retrieval %: Percentage of dialogs where every dialog response and information from DB retrieval are correct.
Model ACR ACC ARR non-NE ARR NE DB-Retrieval Per-response Per-Dialog Per-Dialog + DB-Retrieval
W/O-NE-Table 100 (100) 100 (100) 0.0 (0.0) - 0.0 (0.0) 100 (100) 100 (100) 0.0 (0.0)
With-NE-Table 100 (100) 100 (100) - 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
Table 6: Results for extended dialog bAbI task 4. Accuracies in % for Test and Test Out-Of-Vocabulary (given in parenthesis). DB-Retrieval %: Retrieval accuracy for rows (task 1,2) and a particular cell (task 4). Per-Dialog %: Percentage of dialogs where every dialog response is correct. Per-Dialog + DB-Retrieval %: Percentage of dialogs where every dialog response and information from DB retrieval are correct.

Appendix Appendix C
Goal oriented dialog tasks: extended results

Extended results for tasks 1 and 2
The detailed results for task 1 and task 2 are shown in Table 5.
With-NE-Table:
For issuing an api_call in tasks 1 and 2, four argument values are required - cuisine, location, price range and number of people. We consider cuisine and location to be NEs. So whenever cuisine and location names occur in the dialog, a NE key is generated on the fly and is stored in the NE-Table along with the NE values.

  • ACC: For tasks 1 and 2, ACC is not required as we are interested in retrieving rows from the table.

  • ACR: ACR is used to select the columns required to represent the rows. There are four columns - NE columns (cuisine and location) and non-NE columns (price range and number of people)

  • ARR-non-NE: Each row in the DB is represented by weighted vector sum of its price range and number of people values. The model returns the relevant rows using attention on the non-NE columns embeddings.

  • ARR-NE: The model attends over the NE-Table by matching its generated key with the keys present in the NE-Table to retrieve NE values. The selected NE values are then matched (exact-match) with cuisine and location values in DB to retrieve the relevant rows. The final retrieved rows are the intersection of the rows selected by both these parts.

W/O-NE-Table:
ACR is used to attend to the four relevant columns. However, each row is represented by the combined neural embedding representation of all four attribute values, cuisine, location, price range and number of people. ARR non-NE is used to retrieve the relevant rows.

From Table 5, we can see that both the models perform well in selecting the relevant columns, but the model W/O-NE-Table performs poorly in retrieving the rows, while With-NE-Table performs very well. This results in With-NE-Table model achieving close to 100% accuracy in DB retrieval while W/O-NE-Table performs poorly.

This is because, in the With-NE-Table model, the retrieving rows task is split into two simpler tasks. The NEs are chosen from the NE-Table, and then exact matching is used (which helps in handling OOV-NEs as well). The non-NEs, price range and number of people, have limited set of possible values (low, moderate or expensive for price range and 2,4,6 or 8 for number of people respectively). This allows the system to learn good neural representation for them and hence have high accuracy in ARR non-NE. Whereas in W/O-NE-Table model, ARR non-NE involves the neural representations of cuisine and location values as well, where a particular location and cuisine value will occur only a few number of times in the dataset. In addition to that, new cuisine and location values can occur during the test time (Test OOV dataset, performance shown in parenthesis).

For the dialog part (which does not involve the DB retrieval aspect) of extended tasks 1 and 2, the system utterances do not have any NEs in them. However, the user utterances contain NEs (cuisine and location that the user is interested in) and so the system has to understand them in order to select the right system utterance. The accuracy in performing the dialog (by selecting responses from candidate set) is similar for both the models on the normal test set. However, in the OOV-test set, for task 1, where the system has to maintain the dialog state to track which attribute values have not been provided by the user yet, W/O-NE-Table model seems to get affected, while the With-NE-Table model is robust to that. While W/O-NE-Table gets a Per-Dialog accuracy of 90.3% in the OOV-test set, With-NE-Table is able to get 99%.

Test set Window Memory (BoW) Window Memory (LSTM) NE-Table (BoW) NE-Table (LSTM)
Original 0.4169 0.4110 0.5128 0.5108
20% OOV 0.4068 0.3918 0.5076 0.5112
40% OOV 0.2724 0.2684 0.5036 0.5056
60% OOV 0.1736 0.1702 0.4887 0.4919
80% OOV 0.154 0.1538 0.4743 0.4751
100% OOV 0.0524 0.0508 0.4663 0.4751
Table 7: Results (accuracy %) on the CBT-NE OOV datasets

Extended results for task 4
The detailed results for task 4 are shown in table 6.
With-NE-Table:
In task 4, the user tells the system the restaurant in which he/she wants to book a table. The restaurant name, which is a NE, is stored in the NE-Table along with it’s generated key. When the user asks for information about the restaurant such as, phone number, the NE restaurant name stored in the NE table is selected and used for retrieving its corresponding phone number from the DB. For this particular case, ACC attends over the column Phone and ACR attends over Restaurant Name. Since the column selected by ACR is a NE column, the NE value (here the actual restaurant name given by the user) is retrieved using ARR NE from the NE-Table. The retrieved NE value is used to do an exact match over the DB column selected by ACR to select the rows. The cell that intersects the selected row and the column selected by ACC is returned as the retrieved information and used to replace the NE type tag in the output response.

W/O-NE-Table:
Here, all input words (including NEs) are part of the vocabulary and for NEs, their embedding given to the sentence encoder is the sum of the NE word embedding and the embedding associated with its NE-type. The candidate response retrieval (dialog) is same as the above model and the column attentions are also similar. However, the models differ with respect to attention over rows. Since NEs are not treated special here, attention over rows happens through ARR non-NE. For this task, when ACR is selected correctly (restaurant name), each row will be represented by the neural embedding representation of its restaurant names. ARR non-NE generates a key to match these neural embeddings to attend to the row corresponding to the restaurant name mentioned by the user.

Appendix Appendix D
Results on CBT-NE OOV datasets

For further evaluating our idea of handling NEs, we create additional OOV test sets from the original CBT-NE test set. The detailed results for our experiments on CBT-NE OOV datasets are mentioned in Table 7.