Answer Extraction in Question Answering using Structure Features and Dependency Principles

10/09/2018 ∙ by Lokesh Kumar Sharma, et al. ∙ 0

Question Answering (QA) research is a significant and challenging task in Natural Language Processing. QA aims to extract an exact answer from a relevant text snippet or a document. The motivation behind QA research is the need of user who is using state-of-the-art search engines. The user expects an exact answer rather than a list of documents that probably contain the answer. In this paper, for a successful answer extraction from relevant documents several efficient features and relations are required to extract. The features include various lexical, syntactic, semantic and structural features. The proposed structural features are extracted from the dependency features of the question and supported document. Experimental results show that structural features improve the accuracy of answer extraction when combined with the basic features and designed using dependency principles. Proposed structural features use new design principles which extract the long-distance relations. This addition is a possible reason behind the improvement in overall answer extraction accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Feature extraction (Agarwal et al. 2016) for question answering is a challenging task (Sharma et al. 2017). There are several answer extraction approaches (Severyn et al. 2013; Wei et al. 2006) which use the feature extraction. The issue with the existing techniques is that they work with limited features, and their success depends on a particular dataset. In this work, This issue has been resolved by proposing new features (i.e. Structural features), and this has been tested on variously available datasets (TREC and WebQuestions) and an original KBC dataset. For this firstly the features have been collected automatically (Yao et al. 2013) from an unstructured text document or a question. Few algorithms are designed to extract basic features using some design principles. The feature extraction algorithms are designed to extract new features from dependency parse of the question and the document. Prominent features are selected using feature selection techniques, and their relevance is decided using feature relevance techniques. In question answering task (Bishop et al. 1990; Brill et al. 2002; Bunescu et al. 2010), in vector space model, a question (Q) is represented as (Equation 1):


Where, is defined as feature and value of the question Q whereas, N total number of features in Q. Due to the size of vector space particularly non-zero valued features are kept in the vector model. Therefore the size of individual features is pretty small despite the large size of feature space. These features are categorized into i) Basic features, and ii) Proposed features. Feature extraction algorithms are designed for both basic and proposed features. The basic features including all the lexical features, semantic features, and syntactic features are added to feature space.

Fig. 1: Raw dataset collection, features extraction, selection and new data generation for QA systems

The origin of these features and their extraction and selection procedure to create new dataset is shown in Figure 1. The Figure 1 shows that the data is taken from the KBC game show questions of a particular episode (especially season-5). Apart from the KBC, the TREC (8 and 9) (Voorhees 1999; Singhal et al. 2000; Chowdhury 2003) and WebQuestions (WQ) (Wang et al. 2014) datasets are also selected. In the first stage, preprocessing is done, and features are extracted using feature extraction algorithms, and after the sampling process the dataset it is split into training and test question dataset. These datasets are further processed to select the relevant features and scaling is performed on these features. After this, relevant features are selected for training and testing to produce the final model. These features are applied for a successful answer extraction in QA. In the next sections, the two categories of features are discussed in details.

Ii Basic Features

Lexical Features-

These are usually selected based on the words presented in the question. If we consider the single word as features is called unigram feature. Unigram is a particular case of the n-gram features. To extract n-gram features, a sequence of n-words in a question is counted as a feature. Consider for example the question

‘Which sportsperson was made the brand ambassador of the newly formed state of Telangana?’ from KBC dataset. Basic features of the lexical category are shown in Figure 2.

Fig. 2: Lexical features present in a KBC question

Feature space for unigram is: = (Which, 1), (sportsperson, 1), (was, 1), (made, 1), (the, 1), (brand, 1), (ambassador, 1), (of, 1), (newly, 1), (formed, 1), (state, 1), (of, 1), (Telangana, 1), (?, 1). The pair is in the form (feature, value), only the features with non-zero values are kept in the feature vector. The frequency of the words in question (feature values) can be viewed as a weight value. It utilized this aspect to weight the features based on their importance. They joined different feature spaces with different weights. In their approach, the weight value of a feature space and the feature values (term frequencies) are multiplied. If any two consecutive words are considered as a different feature, then the feature space is extremely larger compared to unigram feature space and that demands larger training size. Therefore with same training set, unigrams perform better than bigrams or trigrams. In most of our experiments for answer extraction bigrams give better results than unigrams or other features.

Huang et al. (2011) examine a separate feature that is question’s wh-words. They modified wh-words, namely which, how, where, what, why, when, who and remaining. For example, this feature of the question ‘What is the deepest ocean of the world?’ is ‘what’. Considering the wh-words as a separate feature improves the performance of QA according to the experimental studies. The other kind of lexical feature is Word Shapes (). It refers to possible shapes of the word: upper case, all digit, lower case, and other. Using word shapes alone is not a reliable feature set for question answering, but their combination with another feature improve the performance of QA. The another lexical feature is question’s length; it is a total number of words in the question. The features are represented in a similar way to the Equation 1.

Syntactical Features- The most basic syntactical features are Part of Speech (POS) tags and headwords. POS tags indicate such as NP (Noun Phrase), JJ (adjective), etc. The above mentioned the pos tags: Which/WDT sportsperson/NN was/VBD made/VBN the/DT brand/NN ambassador/NN of/IN newly/RB formed/VBN state/NN of/IN Telangana/NNP. A POS tagger obtains the pos tags of a question. In QA, all the pos tags of a question in feature vector can be added applied as bag-of-pos tags.

Some more feature namely tagged unigram which is a unigram expanded with part-of-speech tags. Instead of using common unigrams, tagged unigrams can help to identify a word with different tags as two separate features.

Fig. 3: Syntactic features present in a KBC question

In syntactic features, headword is the most edifying word in a question or a word that represents the object that question attempts. Identifying a headword can improve the efficiency of a QA system. For example for the question ‘Which is the newly formed state of India?’, ‘state’ is the headword. The word ‘state’

majorly contribute to classifier to tag

LOC:state. Extracting question’s headword is challenging. The headword of a question frequently selected based on the syntax tree of the question. To extract the headword, it is required to parse the question to form the syntax tree. The syntax (parse) tree is a tree that represents the syntactical structure of a sentence base on some grammar rules. Basic syntactic features are shown in Figure 3.

Semantic Features- These are extracted from the question on the basis of the meaning of the words in a question. Semantic features (Corley et al. 2005; Islam et al. 2008; Jonathan et al. 2013) require third party resources such as WordNet (Miller 1995) to get the semantic knowledge of questions. The most commonly using semantic features are hypernyms, related words, and named entities.

Hypernyms are the lexical hierarchy with important semantic notions using the Wordnet. For example, a hypernym of the word ‘school’ is ‘college’ of which the hypernym is ‘university’ and so on. As hypernyms provide abstract over particular words, they can be useful features for QA. Extracting hypernyms is not easy as,

  1. It is difficult to know the word(s) for which one need to find the hypernyms?

  2. Which part-of-speech should be counted for focus word selection?

  3. The focus word(s) expanded may have several meanings in WordNet. Which meaning is to be used in the given question?

  4. Which level can one go to the hypernym tree to achieve the prominent set?

To overcome the problem of obtaining a proper focus word. The question can consider the headword as the focus word and it can be expanded for its hypernyms. All nouns in a question are considered as candidate words. If the focus word and the hypernym are same, this word can be expanded further. Consider the question again ‘What is the most populated city in India?’. The headword of this question is ‘city’. The hypernym features of the word with value six as the maximum depth will be as follow: (area, 1) (seat, 1) (locality, 1) (city, 1) (region, 1) (location, 1). The word ‘location’ features, can contribute the classifier to categorize this question to LOC.

Fig. 4: Semantic features in a KBC question

Named entities are the predefined categories of name, place, time etc. The available methods are applied to achieved an accuracy of more than 92.0% on determining named entities. For example for the question, ‘Who was the second person to reach at the moon surface?’, their NER system identifies the following named entities: ‘Who was the [number second] person to reach at the [location moon surface]?’ In question answering the identified named entities improves the performance when added to the feature vector. Basic features of a lexical category are shown in Figure 1. Apart from these basic features, proposed features for answer extraction are discussed in the next section. Figure 4 shows the basic semantic features.

Iii Proposed Features and Feature Extraction Algorithms

The proposed structural features in a question are extracted from its dependency parse (Lally et al. 2012; Pedoe et al. 2014) with additional Design Principles (DP), discussed later in details. These new features equally contribute in feature vector which is used for EG, NLQA and answer extraction in QA systems. Before going into details of the proposed features and their feature extraction rules, the feature extraction algorithms for basic features have been discussed. The steps of the algorithm is explained in the next sub section. The Algorithm 1 is used to extract all six lexical fatures to form a lexical feature vector of basic features.

INPUT: Question set (Q)
OUTPUT: Lexical feature vector from Q
Variables Used:

1:  for questions in dataset  do
2:     if  then
3:        extract lexical features of the question
4:     end if
5:     if  extract a unigram then
9:        if  extract a bigram then
13:           if  extract a trigram then
17:           end if
18:        end if
19:     end if
20:     if input is (then
26:        if  then
28:        end if
29:     end if
30:  end for
31:  Return
Algorithm 1 Algorithm 1: Lexical feature extraction

Iii-a Explanation of Algorithm to Extract Basic Features

Lexical features are easy to extract because these are obtained from the question, and no third party software (e.g. WordNet) is required. For a given question set (e.g. KBC) all lexical features are extracted to form a feature vector called lexical feature vector (. In Algorithm 1, from step 2 to 4 question length is checked. Than starting from step 5 to 8, step 9 to 12 and 13 to 16 the Unigram, Bigram and Trigram features are extracted respectively. These three features have been added in to the feature vector. Than from step 20 to 29 mainly the Word Shape and Question Length features are added in to the feature vector. A is the outcome of this overall algorithm. This is used to train the question model on the lexical features.

  1. (Lines 1 to 7 in Algorithm 1)

  2. (Lines 8 to 22 in Algorithm 1)

  3. (Lines 24 to 29 in Algorithm 1)

  4. (Lines 30 to 32 in Algorithm 1)

In Algorithm 1, the combined feature extraction algorithm for all lexical features shown. The accuracy of each feature is shown in Table I in the end of this section. Total 500 KBC questions are used to examine the feature extraction accuracy, and the algorithm attains 100% feature extraction accuracy for all lexical features. Now, syntactic features are extracted in the similar manner shown in Algorithm 2.

INPUT: Question set (Q)
OUTPUT: Syntactic feature vector from Q
Variables Used:

1:  for questions in dataset  do
2:     if  then
3:        extract syntactic features of the question
4:     end if
5:     if  extract a tagged_unigram then
9:        if  extract a POS_tags then
11:           if  extract a Headword then
12:               (using headword extraction algorithm)
14:               ( headword hypernym of i word )
15:           end if
16:        end if
17:     end if
18:     if input has multiple then
20:     end if
21:  end for
22:  Return
Algorithm 2 Algorithm 2: Syntactic feature extraction

Iii-B Explanation of Algorithm to Extract Syntactic Features

For a given question set all syntactic features are extracted to form a feature vector called syntactic feature vector (. Apart from similarities in Algorithm 1 and Algorithm 2, from step 8 to 11 tagged unigram is extracted. Than from step 12 to 20 Headword and Headword tag is extracted. Later in step 21 it is checked for multiple headwords. A is the outcome of this overall algorithm. This is used to train the question model on the syntactic features.

Syntactic features are quite difficult to extract because these features are extracted from the question and also require a third party software (e.g. WordNet). For a given question set (e.g. KBC) all syntactic features are extracted and placed into the features vector, it is called a syntactic feature vector (.

Algorithm 2 is showing the combined feature extraction algorithm for syntactic features, and the accuracy of each of these features is shown in Table I in the end of this section.

  1. (Lines 1 to 7 in Algorithm 2)

  2. (Lines 8 to 20 in Algorithm 2)

  3. (Lines 14 to 17 in Algorithm 2)

  4. (Lines 21 to 23 in Algorithm 2)

The tree traversal rules shown in Figure 5 are implemented to get a headword which is used as a significant syntactic feature of the question. The accuracy of this headword extraction algorithm is 94.3% on KBC questions as the traverse rules are formulated manually.

Fig. 5: Tree traversal rules for headword
Accuracy of Lexical Feature Extraction Algorithm
Total No. of Question = 500
Lexical Feature Features Extracted Accuracy
Correct Incorrect
Unigram () 500 0 100%
Bigram () 500 0 100%
Trigram () 500 0 100%
Wh-Word () 500 0 100%
Word Shape () 500 0 100%
Question Length () 500 0 100%

TABLE I: Accuracy of lexical feature extraction algorithms

The Table 1 is showing the accuracy of feature extraction on basic (lexical) features. The accuracy of finding the correct headword is 94.3% (as discussed), it can be improved by learning methods.

The headword extraction algorithm in Algorithm 3 is using the traversal rules shown in Figure 5. Loni (2011) uses these traversal rules for headword extraction for question classification. For example, for the question “Who was the first man to reach at the moon?” the headword is “man”. The word “man” will contribute for getting the Expected Answer Type (EAT). The Algorithm 3 extracts the headword using the rulebase (Mohler et al. 2009).

1:  procedure Extract_Tree
2:  if  then
3:     return tree
4:  else
5:     root_node apply-traversal-rules (tree)
6:     return Extract-Question-Headword
7:  end if
8:  end procedure
Algorithm 3 Algorithm 3: Headword extraction algorithm

Few examples are showing the headword of a question. The words in bold are the possible headwords. There should be clear mention for a headword and a focus word.

  1. What is the nation flower of India?

  2. What is the name of the company launched JIO 4G in 2016?

  3. What is the name of world’s second longest river?

  4. Who was the first man to reach at the moon?

Iii-C Proposed Structural Features

The proposed structural features are obtained from the features in the yield of dependency parse. These structural features (say, ) are employed for complicated relations presented in similar questions and used for the uniqueness of efficient constants available in parsing results.

Fig. 6: Structural features in a question used to align two words

The Question Feature Form (QFF) produced for a question contains one composite feature function. Structural features allow the model to adapt for all questions used for alignment using the question structure. Figure 6 is showing the structural features available in a KBC question. There are some relations where ‘state’ can be aligned with ‘newly formed’ and ‘of Telangana’ and another structural feature where ‘made’ is aligned with ‘ambassador’ and ‘sportsperson’. The link between newly-formed Telangana and state made cannot be identified directly. The connection provides a structural confirmation which has been described in details later in this section.

Iii-C1 Dependencies Rules for Structural Features

Researchers in different domains have successfully used dependency Rules (DR) or Textual Dependencies (TD). In Recognizing Textual Entailment the increase in the application TD is distinctly apparent.

Fig. 7: Structural features in a KBC question

It is found that the rules are designed from the dependencies in the extraction of a relation between question and document, a system with DR semantics considerably outperforms the previous basic features on KBC dataset (by a 9% average improvement in F-measure). The tree-based approach uses a dependency path to form a graph of dependencies. The system those uses TD demonstrates improved performance for the feature-based techniques.

In the Figure 7 structural features of a KBC questions are highlighted. The parsing technique uses the relation ‘Vidhya Balan, a film character, has worked as Ahmed Bilkis in 2014’ separated by commas on the NP. The Parser uses a diverse dependency presented in questions and relevant document. Another example is the PP where many relations mean an alternative attachment with structures. By targeting semantic dependencies TD, provides an adequate representation for a question.

Fig. 8: Transforming structural features into binary relations

The structural features are transformed into a binary relation by removing the non-contributing words (i.e. stopwords). Figure 8 show such a design for two structured features and of a question which is shown in Figure 8. There can be more than two structural features in the question so there can be more than two structural transformations.

KBC dataset has manually annotated data for information retrieval in the open domain version to be tagged with the TD scheme. These conversion rules that are used to transform TD tree into a binary structural relation.

Iii-C2 Design Principals for structural features

The structural feature representation bears a strong representation of feature vector space, and, more directly, it describes the grammatical relations (Severyn et al. 2013; Sharma et al. 2015). These design principals are used as a starting point for extracting the structural features. For obtaining SF, the TD helps in structural sentence representation, especially in relation extraction. SF makes available two options: in one, relations between it and other nodes, whereas in the second, making changes and adding prepositions into relations.

The intended use of structural extraction SF attempt to adhere to these six design principles (DP to DP):
DP: Every dependency is expressed as a binary relation obtained after the structural transformation.
DP: Dependencies should be meaningful and valuable to EG and NLQA.
DP: The structural relations should use concepts of traditional grammar to link the most frequently related word.
DP: The relations with a maximum number of branches should be available to deal with resolving the complexities of indirect relations helpful for alignment.
DP: There should be the maximum possibility of relations to be in NP words and should not be indirectly mentioned via non-contributing words.
DP: Initially the is the longest meaningful connection on which minimum non-contributing words than linguistically expressed relations.
From dependency rules and design principals for structural features the feature extraction algorithm aims to extract all possible structural features of the question. This structural feature extraction algorithm considering these design principles is discussed in the next section.

Iii-C3 Structural Feature Extraction Algorithm and Score

The proposed structural features which are obtained from the dependency structure of a question on the basis of DPs. Consider to be the root for this relation () will participate in structural features.

The relations of dependency tree (Wei et al. 2006) are used for extracting SFs to capture long-distance connections in the question and text. For the sentence: ‘Which sportsperson was made the brand ambassador of the newly formed state of Telangana’, dependency relations are as follows. dobj(made-4, Which-1), nsubjpass(m ade-4, sportsperson-2), auxpass(made-4, was-3), root (ROOT-0, made-4), det(ambassdor-7, the-5), compound(ambassdor-7, brand-6), dobj(made-4, ambassdor-7), case(stat e-11, of-8), advmod(formed-10, newly-9), amod(state-11, formed-10), nmod:of(amb assdor-7, state-11), case(Telangana-13, of-12), nmod:of(state-11, Telangana-13). The structural features are designed using the dependency principals of the proposed algorithm that is shown in Algorithm 3.4. The root word and its siblings are expanded to measure the design principles.

The TD includes many relations which are considered as structural features: For instance, in the sentence ‘Indian diplomat Devyani Khobragade posted where, when she was arrested in a visa case in 2013’, The following relations under the TD representation are obtained:
amod(khobragade-4, Indian-1)
det(case-15, a-13)
compound(case-15, visa-14)
nmod(arrested-11, case-15)
The algorithm extracts four structural relations numeric modifier relation between ‘arrested’ and ‘case’. Algorithm also provides an apposition relation between ‘posted’ and ‘arrested’. The relation between these words represent the best possible link available in the text. For example, the adjectival modifier gleeful in the sentence, relation of verb to have textual dependecy is shown:
dep(posted-5, where-6)
advmod(arrested-11, when-8)
advcl(posted-5, arrested-11)
nmod(arrested-11, case-15)

The connection between these outcomes shows that SF proposes a wider set of dependencies, catching relation distance which can contribute to evidence gathering and question alignment. The parallel structural representations help in linking two words which can not be linked otherwise, and this is the reason for choosing NP words as root.

The TD scheme offers the option prepositiona dependencies involvement. In the example ‘Name the first deaf-blind person who receive a bachelor of arts degree?’ instead of having two relations case(degree-12, of-10) and dobj(receive-7, bachelor-9) or nmod(bachelor-9, degree-12) and acl:relcl(person-5, receive-7), SF gives a relation between the pharses: case(degree-12, person-5). These links are used later in this work in EG & NLQA. Some more useful structural extractions such as, e.g. ‘Which sport uses these terms reverse swing and reverse sweep?’. TD gives direct links between ‘swing’ and ‘swap’ for (dobj),
dobj such as (reverse-6, swing-7)
dobj such as (reverse-9, sweep-10)
(reverse, sweep, dobj)

The information in is not apparent in the TD which follows dobj in a similar way, have relations with three parameters such as, (, , TD).

Fig. 9: Adding named entities to structural features

SF representation is enhanced with the addition of named entities, for the sentence in Figure 9. In Figure 9 structural features are extracted from design principles, dependency rules and named entities which give an outcome as structural features with NER. The information available for the word Telangana in the SF scheme: (Telangana∼5, location). The structural information becomes valuable with the use of named entities and, SF provides the root to relate the words from the named entities. The Structural feature extraction algorithm using DR & DP extraction rules is shown below in Algorithm 4.

INPUT: Question set (Q)
OUTPUT: Structural feature vector from Q
Variables Used:

1:  for questions in dataset  do
2:     if  then
3:        backtrack tree
4:     else
5:        expand_root procedure
6:     end if
7:  end for
8:  procedure expand_root
9:  if  has child nodes then
10:     for childs in tree  do
11:        if  then
12:           backtrack tree
13:        else
14:           head_child apply_rules from (Rule 1 to 6)
16:           head_child apply_rules from
18:           head_child apply_NER
20:        end if
21:     end for
22:  end if
24:  Return
Algorithm 4 Algorithm 4: Structural feature extraction algorithm & weight calculation

Iii-C4 Feature Alignment with Individual Featuure Score

It is important to calculate the individual feature value. These values are provided to a formula for final feature score. The formula is tested over 100 KBC questions having at least two similar questions. Extraction algorithms for all features have been discussed in earlier sections. Document for example feature score has a passage Indian tennis star Sania Mirza was today appointed ’Brand Ambassador’ of Telangana.

Lexical Score- It is shown here that how relevant lexical features are extracted from a question. For an example from KBC dataset, ‘Which sportswoman was made the brand ambassador of the newly formed state of Telangana?’ Q and, Indian tennis star Sania Mirza was today appointed ‘Brand Ambassador’ of Telangana? D. Lexical features are extraction here and individual feature score is calculated from these is shown in Table II. Equation 2 is showing the value of for Unigram features. Similarly Equation 3, 4, 5, 6, and 7 are showing the respective feature line and value of of Bigram, Trigram, Wh-word, Word Shape and Question length feature respectively.








Average = = = 0.403







Average = = = 0.335





Average = = = 0.556

TABLE II: Calculation of average feature score to get the final feature form score

1) Unigrams ()- Unigrams of the questions are tagged as, (Which, 1) (sportswoman, 2) (was, 3) (made, 4) (the, 5) (brand, 6) (ambassador, 7) (of, 8) (the, 9) (newly, 10) (formed, 11) (state, 12) (of, 13) (Telangana, 14). Refer table II to see the feature score calculation of Unigram. feature regression line is shown in Equation 2.


2) Bigrams ()- Bigrams of the questions are tagged as, (Which-sportswoman, 1) (sportswoman-was, 2) (was-made, 3) (made-the, 4) (the-brand, 5) (brand-ambassador, 6) (ambassador-of, 7) (of-the, 8) (the-newly, 9) (newly-formed, 10) (formed-state, 11) (state-of, 12) (of-Telangana, 13). Refer table II to see the feature score calculation of and the regression line is shown in Equation 3.


3) Trigrams ()- Trigrams of the questions are tagged as, (Which-sportswoman-was, 1) (sportswoman-was-made, 2) (was-made-the, 3) (made-the-brand, 4) (the-brand-ambassador, 5) (brand-ambassador-of, 6) (ambassador-of-the, 7) (of-the-newly, 8) (the-newly-formed, 9) (newly-formed-state, 10) (formed-state-of, 11) (state-of-Telangana, 12). Refer table II to see the feature score calculation of Trigram. feature regression line is shown in Equation 4.


3) Wh-word (), Word Shape () and Q-Length ()- word of the example is which, feature provide (UPPERCASE, 2), . Refer table II to see the feature score calculation of Wh-word, Word Shape and Question Length. The feature regression line of and is shown in Equation 5, 6, and 5.


Iii-D Multiple Regression Analysis on Features

To calculate the final score of basic and proposed features, the formula is obtained from Multiple Regression (MR) techniques on feature scores. In MR the value of

, also known as the coefficient of determination is generally used statistic to estimate model fit.

is 1 minus the ratio of residual variability. While the variability of the residual values nearby the regression line corresponding to the overall variability is small, regression equation is well fitted. For example, if there is no link between the X and Y variables, then the ratio of the residual variability (Y variable) to the initial variance is equal to 1.0. Then

would be 0. If X and Y are perfectly linked then the ratio of variance would be 0.0, making = 1. In most cases, the ratio and will be between these 0.0 and 1.0. If is 0.2 then we understand that the variability of the Y values nearby the regression line is 1-0.2 times the original variance; in other words, we have explained 20% of the original variability and left with 80% residual variability. The equation 8 is used for all the feature and tested on MR of available features. The regression line to of the individual feature to calculate QFF and DFF is shown in Figure 10.

Fig. 10: Regression line of faetures to calculate the QFF, DFF formuula

Iii-D1 Intermediate QFF Score

The logical form is used to query a knowledge base. Intermediate QFF generated in this work is not bothered about querying any KB, but it is used to represent the question to its QFF weight and then to map it with another QFF weights. These QFF weights are the QFF scores calculated from the formula shown in equation 9 (generated from RA of features). In the equation 10 represents the lexical features, represents the syntactic features, represents the semantic features and represents the structural features. In the equation 8 putting the value of and as the value is used for the alignments the effect of coefficient can be ignored once and treated as . QFF score is shown in equation 9.

Fig. 11: Diagram showing the effect of log

QFF and also a (Document Feature Score) DFF score can be compared to question and document as there are about each other. One can also use multiple regression coefficients to compare QFF and DFF. In this work, the complete dataset has questions paired with options and answers and documents having the answer-evidence are ranked. The equation is merely showing that all feature are contributing equally to calculate QFF and QFF score. Equation 11 calculates the FFScore of each question in dataset.


Iii-E Proposed Feature Relevance Technique

Individual features are tested on the different dataset to get the final answer (features are used in QA). The feature which is contributing in attaining the highest accuracy by QA system is marked as the most relevant feature. The accuracy of answer correctness after including these individual features is shown in Table 3.

Correct Answers (%)
Basic Features WebQ TREC KBC Relevance (Fr: 1-5)
Unigrams 61 63 67 4
Bigrams 82 79 88 5
Trigrams 58 55 52 3
Wh-word 48 35 32 3
Word Shape 51 43 48 3
Question Length 28 23 19 2
Tagged Unigram 43 42 46 3
POS tags 46 51 56 3
Headword 87 88 91 5
Headword Tag 62 58 52 4
Focus Word 76 72 80 4
HW Hypernyms 66 54 63 4
Named Entity 83 82 77 5
Headword NE 57 52 49 3
Proposed features (structural features )
with DP 56 61 65 4
with DR 67 68 72 4
with NER 92 88 91 4
TABLE III: Basic and proposed features with their relevance in QA

The feature relevance is calculated by the Equation 12, where is the sum of correctly answered questions (i KBC, WebQuestions and TREC) and is the total number of questions.


The relevance score of the features () is useful where feature vector is redundant and we need to reduce the space. In such situations the features with low relevance score can be removed from the feature vector. The feature selection techniques (Agarwal et al. 2014) are also used for selecting the features with relevant information.

Iv Dataset and Result Analysis

Iv-a Dataset Used

To measure the correctness of the proposed features and their extraction algorithms, the publicly available KBC dataset is used. This dataset consists of open-domain questions with option and answers. For accurate and more stable experiments, TREC and WebQuestions datasets are also used that consist the relevant documents.

Iv-B Performance Metrices

The performance of the feature extraction algorithms on KBC dataset and other datasets is measured by the total number of questions accurately answered by each features and by the combination of features.

  • Correct Answers (CA): It belongs to the number of correct answers provided by a particular feature.

  • Incorrect Answers (IA): It belongs to the number of incorrect answers provided by a particular feature.

  • Correct Documents (CD): It belongs to the number of correct documents selected by a particular feature.

  • Incorrect Documents (ID): It belongs to the number of incorrect documents selected by a particular feature.

The feature accuracy is employed for estimating the performance of basic and proposed features in QA. The precision of features is the division of the total questions that are correctly expressed by the features and the total number of documents that are to be expressed (it is the summation of TP and FP) as given in Equation 14.


The recall is the division of the total number of correctly expressed question or documents to the total number of question or documents that are to be expressed (it is the sum of TP and FN) as given in Equation 15.


F-measure is the aggregate of and is given by Equation 15.


In this analysis, accuracy of answer extraction () (also termed as F-measure) is done to report the performance of feature representation for QA systems and later NLQA and EG algorithms.

Iv-C Results and Discussions

Ten-fold cross validation method is employed to estimate the accuracy of the proposed methods. The question data and documents are split randomly into 90% training and 10% testing. Wh-words (who, where, when) give an idea of expected answer type of the question, that is why it is important to handle such wh-words in QA. In the experiments, a simple approach is adopted such as to find the Expected Answer Type in the document. For example, ‘which is the largest city of the world’, the ‘city’ is the EAT for this question and the document is searched for all the named entities with NE tag ‘Location’. The weighting formula for individual features has been discussed in Algorithm 1, 2 and 3 are used to calculate the weight of the basic features and proposed features.

Iv-C1 Determination of Prominent Features

In the experiments it can be observed that bigrams () features are better than any other features on datasets as shown in Table 4. The bigram feature set provides the accuracy 69% as compared to 62%, 58%, and 52% for unigrams, trigrams and word-shape feature respectively on KBC dataset as shown in Table 4. The probable causes for this can be explained as follows. Unigram feature set contains lots of irrelevant features, which depreciates the accuracy of answer extraction. Also, trigrams feature set scattered than bigrams which demote the accuracy. Word-shape features are not valuable for the question and important mainly for document analysis. Word-shape feature set contains less information that is not enough for answer extraction in QA. Hence these perform worst when used separately. The dependency features are present in both question and document, resulting in more particular features. That is why these features are used to design structural features and contribute much for QA. These standard features are more useful for QA.

Table IV present the accuracy of answer extraction () for all the basic features and Table V present the accuracy of answer extraction for all the proposed features. The accuracy of bigram features is considered as the baseline for this experiments.

Basic Features on KBC Dataset
(%) (%)


1 Unigram 62 84

2 Bigram 69 84

3 Trigram 58 66

4 Wh-word 38 72

5 Word Shape 52 67

6 Length 18 64

(a) 49.5 84

(b) +19.5 +34.5


1 Tagged 45 78

2 POS Tag 52 58

3 Headword 62 78

4 Tag 53 70

5 Focus Word 61 72

(c) 54.6 71.2

(d) +7.4 +6.8

(e) 52.1 83.6

(f) +14.4 +12.8


1 Hypernym 44 78

2 Named Entity 65 78

3 NE 62 67
(g) 57 74.3

(h) +8 +3.7

(i) +15.5 -27

(j) 53.7 89.6

(k) +12 +32

(l) +5 +35
TABLE IV: Accuracy of Answer Extraction () of basic features on KBC dataset
Proposed Features on KBC Dataset
(%) (%)


1 with DR 64 78

2 with DP 68 82

3 with NER 58 78

(m) 63.3 82

(n) 5.7 +32.5

(o) 1.3 +27.4

(p) 6.3 +25
TABLE V: Accuracy of Answer Extraction () of proposed features on KBC dataset

The proposed feature sets features with the addition of DP, DR, and NER increases the accuracy for datasets (WebQuestions, TREC, and KBC). For example, features increased the accuracy from 64% to 78% (+14%) with addition of on KBC dataset. The proposed features also increased the accuracy from 68% to 82% (+14%) with addition of on KBC dataset. It is because adding NER to structural features improve the accuracy by dropping unnecessary and unrelated features. Structural features with design principals in addition to NER attain the accuracy of 82% as compared to its comparable combined basic feature set () i.e. 89.6%. It is 7.4% more than the proposed structural features but still meaningful. Therefore, structural features are very relevant and selective features. It is clear that in basic features, bigram features performed well than others, while used independently. Whereas, proposed features are concerned dependency based structural features are more prominent than basic features (including bigram features), and the accuracy over datasets is presented in Table VI.

Without structural feature ()
Question Dataset (No. of question taken)

WebQ (W) TREC (T) KBC (K) (W + T + K)
(5500) (2000) (500) (300 + 300 + 300)


1 63.2 60.1 67.3 65.2

2 68.1 55.3 62.6 66.1

3 59.2 56.3 62.4 58.3

4 38.3 53.6 58.4 61.2

5 52.1 56.8 64.7 67.1

6 16.1 19.4 21.2 23.2
49.5 50.25 56.03 56.85


1 43.2 55.8 57.5 53.6

2 50.8 52.1 61.9 64.3

3 60.8 86.5 71.3 70.6

4 57.2 58.9 61.4 66.3

5 53.8 55.9 59.6 55.4

53.16 61.84 62.34 62.02


1 44.5 55.3 75.8 74.7

2 65.1 55.8 73.9 71.6

3 60.5 52.4 68.3 72.9

56.7 54.5 72.67 73.06

With structural feature ()


1 68.2 (+5) 62.4 (+2.3) 72.7 (+5.4) 71.7 (+6.5)

2 69.8 (+1.7) 58.9 (+3.6) 71.1 (+8.5) 73.3 (+7.2)

3 60.1 (+0.9) 57.7 (+1.4) 63.2 (+0.8) 68.9 (+10.6)

4 41.4 (+3.1) 55.5 (+1.9) 61.2 (+2.8) 72.6 (+11.4)

5 55.3 (+3.2) 62.4 (+6.1) 57.7 (-7.0) 68.3 (+1.2)

6 15.3 (-0.8) 19.5 (+0.1) 22.1 (+0.9) 23.1 (-0.1)

51.68 (+2.18) 52.73 (+2.48) 58.0 (+1.97) 62.98 (+6.13)


1 61.3 (+18.1) 55.9 (+0.1) 69.7 (+12.2) 75.9 (+18.4)

2 55.3 (+4.5) 56.7 (+4.6) 71.4 (+9.5) 74.3 (+10.0)

3 65.5 (+4.7) 86.9 (+0.4) 72.2 (+0.9) 66.9 (-3.7)

4 58.1 (+0.9) 62.7 (+3.8) 72.3 (+10.9) 73.9 (+7.6)

5 61.3 (+7.5) 56.4 (+0.5) 67.5 (+7.9) 71.3 (+15.9)

60.3 (+7.14) 63.72 (+1.88) 70.62 (+8.28) 72.46 (+10.44)


1 79.1 (+34.6) 69.4 (+14.1) 76.2 (+0.4) 78.6 (+3.9)

2 81.3 (+16.2) 78.3 (+22.5) 87.6 (+13.7) 82.4 (+10.8)

3 66.4 (+5.9) 71.2 (+18.8) 84.3 (+16.0) 79.9 (+7.0)

75.6 (+18.9) 72.9 (+18.4) 82.7 (+10.03) 80.3 (+7.24)

TABLE VI: Basic features on each question datsets and comparisions after adding structural features

The proposed structural features give better results than the basic bigram features with very fewer feature sizes. For example, question dataset after addition of structural features produced an accuracy of up to 82.7% (+10.03%) with 15 basic features for KBC adding the structural feature () as shown in Table 3.4. Similarly, all the proposed features are constructed using dependency rules and performed better than similar basic features. For example, attained the accuracy of 73.06%, whereas addition of structural features in these features produced an accuracy 80.3% (+7.24) on a combination of WebQ, TREC and, KBC dataset as shown in Table 3.4. Structural feature set presents the accuracy of 82.7% (+10.03%) on KBC dataset. It is because by including the dependency rules, design principals, and NER relevant semantic information dependency features contain a large number of long distance relation capture. The proposed features produce an accuracy of 75.6% with a maximum increase in the accuracy of +18.9 as shown in Table VI.

The proposed features resolve the issue of hidden features while decreasing the feature space by combining the features with basic features and NER features. Structural features include the noun-phrase dependency distance as per design principals. Structural features are very useful to their ease of extraction, and these reduce the feature vector size significantly. It is followed in the experiments that if structural feature vector is very small then the performance is not well, and as feature vector size is increased the performance increases. It is due to the reason that as vector size is increased, the possibility of a grouping of the root words in the structure performs better. Experimental outcomes state that the proposed structural features with the addition to basic features perform better.

V Conclusion

The performance of several basic features of relevant documents is examined on three datasets namely WebQuestions, TREC (8 and 9), and KBC. Apart from the basic features new structural features are proposed viz. structural features. The feature extraction algorithms for basic and proposed structural features is proposed.

Proposed structural feature are combined with DP, with DR, and with NER. Further, the features have been assigned a relevance value which is calculated from the accuracy of an individual feature by their answer extraction accuracy on QA systems. It is also examined that addition of proposed structural features to the basic features improve the performance of answer extraction on QA systems.

Furthermore, it is noticed that proposed structural features provide improved results for bigrams () features and prominent proposed structural with NER () features provide excellent results than basic features. The accuracy of the question length features was near to the 20% which is the minimum among all features. The main cause for this is the only gives the idea of the question complexity. It is observed that when two basic features are combined their combination gives better results than the individual feature. The combination of bigrams with question length do not perform well for QA systems.

All the features used in this work are useful to gather evidence and a indirect reference based approach for evidence gathering is proposed and combined with feature-based evidence gathering.

References are important to the reader; therefore, each citation must be complete and correct. If at all possible, references should be commonly available publications.


  • [1] Agarwal, B. and Mittal, N., Prominent feature extraction for review analysis: an empirical study

    , Journal of Experimental & Theoretical Artificial Intelligence, pp. 485-498, 2016.

  • [2] Agarwal, B., and Mittal N.,

    Semantic feature clustering for sentiment analysis of English reviews

    , IETE Journal of Research, vol. 6, pp. 414-422, 2014.
  • [3] Bishop, Lawrence C., and George Ziegler. ”Open ended question analysis system and method.” U.S. Patent No. 4,958,284. 18 Sep. 1990.
  • [4] Brill E., Susan D., and Michele B., An analysis of the AskMSR question-answering system, In Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10, pp. 257-264, 2002
  • [5] Bunescu R., Huang Y., Towards a general model of answer typing: Question focus identication, In Proceedings of The 11th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), pp. 231-242, 2010.
  • [6] Chowdhury G. G., Natural language processing, In Annual review of information science and technology, vol. 1, pp. 51-89, 2003.
  • [7] Corley C., and Mihalcea R., Measuring the semantic similarity of texts, In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, pp. 13-18, 2005.
  • [8] Huang, Cheng-Hui, Jian Yin, and Fang Hou., A text similarity measurement combining word semantic information with TF-IDF method, In Jisuanji Xuebao (Chinese Journal of Computers), pp. 856-864, 2011
  • [9] Islam, Aminul, and Diana Inkpen., Semantic text similarity using corpus-based word similarity and string similarity, In ACM Transactions on Knowledge Discovery from Data (TKDD) 2, pp. 10-19, 2008.
  • [10] Jonathan B., Chou A., Roy F. and Liang P., Semantic Parsing on Freebase from Question-Answer Pairs, In Proceedings of EMNLP, pp. 136-157, 2013.
  • [11] Lally, Adam, John M. Prager, Michael C. McCord, Branimir K. Boguraev, Siddharth Patwardhan, James Fan, Paul Fodor, and Jennifer Chu-Carroll., Question analysis: How Watson reads a clue, In IBM Journal of Research and Development, 2012.
  • [12] Loni B., A survey of state-of-the-art methods on question classification, In Literature Survey, Published on TU Delft Repository, 2011.
  • [13] Miller G. A., WordNet: A Lexical Database for English, In Communications of the ACM, vol. 38, pp. 39-41, 1995.
  • [14] Mohler, M., and Rada irschmaMihalcea., Text-to-text semantic similarity for automatic short answer grading, In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 567-575, 2009.
  • [15] Pedoe W. T., True Knowledge: Open-Domain Question Answering Using Structured Knowledge and Inference, In Association for the Advancement of Artificial Intelligence, vol. 31, pp. 122-130, 2014.
  • [16] Severyn Aliaksei.,Moschitti Alessandro, Automatic feature engineering for answer selection and extraction, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 458-467, 2013.
  • [17] Sharma L. K., Mittal N., An Algorithm to Calculate Semantic Distance among Indirect Question Referents, International Bulletin of Mathematical Research, vol. 2, pp. 6-10, ISSN: 2394-7802, 2015.
  • [18] Sharma LK, Mittal N. Prominent feature extraction for evidence gathering in question answering. Journal of Intelligent & Fuzzy Systems. 32(4):2923-32; Jan 2017.
  • [19] Singhal A. and Kaszkiel M, TREC-9, In TREC at ATT, 2000
  • [20] Voorhees, E.M., The TREC-8 Question Answering Track Report, In TREC, vol. 99, pp. 77-82, 1999.
  • [21] Wang, Z., Yan, S., Wang, H. and Huang, X., An overview of Microsoft deep QA system on Stanford WebQuestions benchmark, Technical report, Microsoft Research, 2014.
  • [22] Wei Xing, Croft Bruce. W, Mccallum Andrew, Table extraction for answer retrieval,In Information Retrieval, pp. 589-611, vol 9, 2006.
  • [23] Yao X., Durme V. B., Clark P., Automatic Coupling of Answer Extraction and Information Retrieval, In Proceedings of ACL short, Sofia, Bulgaria, pp. 109-116, 2013.