All It Takes is 20 Questions!: A Knowledge Graph Based Approach

11/12/2019 ∙ by Alvin Dey, et al. ∙ IIIT Delhi 0

20 Questions (20Q) is a two-player game. One player is the answerer, and the other is a questioner. The answerer chooses an entity from a specified domain and does not reveal this to the other player. The questioner can ask at most 20 questions to the answerer to guess the entity. The answerer can reply to the questions asked by saying yes/no/maybe. In this paper, we propose a novel approach based on the knowledge graph for designing the 20Q game on Bollywood movies. The system assumes the role of the questioner and asks questions to predict the movie thought by the answerer. It uses a probabilistic learning model for template-based question generation and answers prediction. A dataset of interrelated entities is represented as a weighted knowledge graph, which updates as the game progresses by asking questions. An evolutionary approach helps the model to gain a better understanding of user choices and predicts the answer in fewer questions over time. Experimental results show that our model was able to predict the correct movie in less than 10 questions for more than half of the times the game was played. This kind of model can be used to design applications that can detect diseases by asking questions based on symptoms, improving recommendation systems, etc.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The 20 Questions game started as a spoken parlor game in early 19th century. During the early phases, it was known as ‘Animal, Plant and Mineral’. In this version, the answerer was supposed to tell the questioner about the category he/she has chosen. The principle behind this game is that one player thinks of an entity, and the other player asks a series of (maximum 20) questions to guess that entity. These questions should be answerable in yes/no/maybe.

Mathematically, the game allows identifying 220 arbitrary objects where each question eliminates half the entities. Therefore, a practical strategy would be to ask questions in such a way that reduces the list of possible answers roughly into half. This game has a huge potential in real-life applications. It can be used to develop healthcare applications, where the patient answers simple questions, and the system predicts the disease. The application models the question over symptoms like:


Are you feeling cold?


Are you feeling nauseous?


Are you having a headache?

These systems help to collect data about human health and improve healthcare facilities. In this paper, we test our model on a huge set of Bollywood movies. The reason for choosing Bollywood movies is the popularity of Bollywood and the humongous metadata available to build such a large-scale system.

The existing baseline systems try to model the same problem. However, they have not been able to develop a system which takes care of human errors, e.g., if unknowingly, the answerer answers a question incorrectly, then the system should be intelligent enough not to eliminate all the possibilities. For example, answering ‘no’

to a question like ‘Was Aamir Khan an actor of your movie?’ should not blatantly remove the possibility of a movie like ‘3 Idiots’ if the answerer may have answered it incorrectly. The probability of getting ‘3 Idiots’ may decrease, but if the answerer answers all other questions correctly, then the model is still expected to predict the correct movie.

In this paper, we present a novel approach to predict movies in 20Q game using a knowledge graph and a probabilistic learning model that evolves as the game is played and predicts correct movie in less than 20 questions. We design the system in five individual segments (discussed in detail in model architecture). The model starts with equal probability for every movie, which changes over subsequent questions. It attains fault tolerance as it re-balances the movies probabilities in a way, that it does not disregard or accept a movie completely after every answer. The question generator poses questions based on three components:

  1. Probability from past experience.

  2. Probability based on the density of edge connectivity in the knowledge graph.

  3. Cumulative probability of movies under a category during the current run (based on player’s responses).

The proposed model overcomes all the existing challenges of the baseline models. The major contributions of the paper are mentioned below:


We collected a dataset of 18,481 Indian movies from DBpedia111 The dataset includes 113 features per movie such as movie name, movie length, director, producer, actors, genre, subject, etc.


We developed and evaluated a novel architecture using knowledge graph to predict Bollywood movies using 20Q game.


The proposed model is robust enough to handle incorrect answers given by the answerer. It predicts the correct answers in 90.8% of cases. In 50% cases, it predicts correctly by asking less than 10 questions.

The codes and datasets are available publicly at

2. Related Work

During the initial phases of AI and NLP, the primary notion for question answering (QA) was that machines would be able to answer by converting the question to a machine-readable form and then match it against a background knowledge stack. However, no such system has been built to represent questions definitively. START (Katz et al., 2005) was the first QA model, based on parsing through structured data for answer prediction. Similar to human approach, the QA model was paired with the ability to find relations between possible questions for efficient answer set reduction.

In the AURA and HALO models (Gunning et al., 2010), the answers were formulated from questions based on structured documents containing principles from the related field. AURA faced issues while relating query templates with underlying entities.

Exiting open-ended QA systems deploy a pipeline of passage search based technique against a related corpus to generate possible answers which match the expected type. For the search component, web information has been used for generating possible answers (Clarke et al., 2001; Dumais et al., 2002; Katz et al., 2003) as well as confirming present possibilities (Magnini et al., 2002; Ko et al., 2007). Wikipedia and other online resources have been referenced as a standard corpora by many Question Answering models (Kupiec, 1993; Ahn et al., 2004) as well as in CLEF validation metric (Giampiccolo et al., 2008). However, these models treated the corpus mentioned above as a Newswire corpus extension. They did not use their underlying properties to improve performance.

From the answer generation viewpoint, existing QA models formulate a semantic-type based method to suggest answers which try to match the expected answer type (Prager et al., 2000; Moldovan et al., 2000). In contrast, our model does not depend on the nature of the problem – it tries to gather a possible candidate set based on associated meta-tags of the movies such as director, actor, release year, etc. Even though our method generates a much broader set of candidate answers, it can outperform semantic-based methods in a wider array of fields.

Advanced techniques in Deep learning have shown remarkable performance in QA

(Tay et al., 2018; Yoon et al., 2019; Tayyar Madabushi et al., 2018). Zheng et al. (Zheng et al., 2018) tried to find ‘help’ documents based on the user query. The user query is used to find out the best possible query template using text mining and simultaneously building a semantic dependency graph. We design our algorithm to reduce the size of possible answers from the knowledge graph as discussed in (Zheng et al., 2018). Yang et al. (Yang et al., 2017) proposed a solution for relating questions to relevant documents using a probabilistic scoring approach. Here we use a similar mechanism for questionnaire generation.

Most of the existing studies develop an answering model for the questions asked by the user. However, there are a few models (Cohen et al., 2007; Chu-Carroll et al., 2012), which create carefully curated questions to predict the information as per the user’s answers, like the model discussed in this paper.

3. Dataset Details

We use DBpedia, which is a structured content extraction project created using information from Wikipedia. It allows us to semantically query relations on properties related to Wikipedia resources and other datasets. We extract data of 18481 movies from DBpedia. Each movie has 113 metadata tags associated with it.

3.1. Data Acquisition

  1. We acquire the dataset and formulate the knowledge graph using a two-way inverted index.

  2. The forward inverted index maps movies to its respective metadata tags.

  3. The backward index maps metadata tags to all the movies associated with it.

Figure 1 shows the forward and backward index on a subset of data.

Figure 1. Sample of data representing the forward and the backward indexing between movies and metadata.
Figure 2.

Model architecture of the proposed system. The initial knowledge graph with equiprobable nodes along with likelihood estimator is provided to the question generator. It generates a question (Q) from one of the levels. The user’s response to the question modifies the probabilities of nodes in the graph. If the stopping criteria are met, the model predicts the answer; else, the system iterates using the updated graph.

3.2. Data Preprocessing

We filter the data during preprocessing by eliminating redundancies and inconsistency. The preprocessing details are as follows:

  1. Reduce the dataset to 200 popular movies for conducting the experiments.

  2. Remove the tags that are present in less than 10% of the dataset.

  3. Create additional tags like ‘Era’ to signify the decade in which the movie was released.

  4. Filter out and keep only relevant values from tags like ‘Genre’ and ‘Subject’, e.g., Indian crime films, Indian romance films, Indian thriller films and remove values like masala films, circus films, films about courtesans in India that are generally used less.

  5. Manually add the missing values for movies in which higher-order attributes such as Director, Music composer, etc. were missing.

4. Proposed Approach

We broadly classify the questions into two layers: (i) primary layer and (ii) secondary layer. The primary layer questions are focused on a wider range of movies, while the secondary layer questions are more specific in nature and targeted towards a smaller set of movies. Figure

2 shows the architectural details of the model.

Primary Questions Secondary Questions
Is your movie from the 1990s era? Is Aamir Khan an actor of your movie?
Is Bollywood romance the genre of your movie? Is Karan Johar the director of your movie?
Is feminist films the subject of your movie? Is A.R. Rahman the music composer of the movie?
Table 1. Different primary and secondary questions generated by the system during a game play

Our model is divided into five components:

1. Question Generator: The generator is a template-based hierarchically structured model. It traverses the knowledge graph to ask questions based on – learned experiences, the answers it received during the current run and the most likely movies based on scores assigned to each movie. The architecture poses the questions taking into account user-specific data in the primary layers to reduce the size of the most probable set. The secondary layer poses tricky questions specific to a limited set of movies, to get an in-depth insight into the choices. Table 1 shows instances of primary and secondary layer questions.

2. Answer Predictor: The predictor outputs a list of five movies in descending order of their probabilities. It makes a guess once the total probability of the top five most likely movies reaches the empirical value of 0.5. The predictor removes the movies from the probable choices if the user replies no to these five guesses. If the user says yes the game stops and asks the user for the exact movie(from the 5 movies). It then alters the edge probabilities in the graph for future games. We perform this adjustment as every choice a player makes is an indication of the popularity of the movie and it’s associated entities.

3. Likelihood Estimator: For the primary layer, we store two different probability values: (i) probability on each level which is decided by the number of movies in which the specific entity is present, and (ii) how many times any user has elected that particular entity on the given level.

Let denote the probability of entity at level , and denote the probability of entity stored throughout the run of the game denoted by . We assign the total probability as weighted sum of both the probabilities:


Here is preset to 0.2. This is to ensure that the model learns more from the games played so far rather than the static scores at each level. We add an additional component for user specific likelihood estimation. For example. to estimate the era of a movie, we use the following formula:


For the secondary layer, an additional probability component for the total distribution score under entity during the current run of the game is computed as follows:


where, set(v) denotes the set of movies under the entity .

4. Distribution Modifier: For cases, where the user answers maybe, the distribution remains unchanged. For definitive answers, the distribution is modified as follows:


for the set of movies where the user’s answer is no, or for yes. is fixed to 1 empirically. is the current distribution. is the normalization factor for current distribution.

The Modifier essentially takes the set of movies for which the user answered yes and increases their probability, while decreasing the probability of the set of movies for which the user answered no (using eqn 5). The set of predicted movies for which the user evaluates as incorrect, we distribute their probability among all the remaining movies equally and set theirs to 0.

We perform the above step for two reasons: Distributing the probability equally won’t change the relative difference in score of all the other movies and the set of movies for which the user answered no will never appear as a guess again.

5. Answer Tracer: Every time the model predicts the movie, the answerer is asked if the prediction is correct. The next prediction, therefore, is based on the response of the answerer. If the system is unable to predict within 20 questions, it gives a trace of user answers along with the corresponding facts related to the movie.

5. Baseline Methods

We compare our model with the following baselines (we designed for this task due to the unavailability of existing methods particularly for this task). We use questions asked and cumulative probability of ranked guesses as the evaluation metrics to study the comparative analysis.


Baseline 1: The model frames questions systematically from six aspects of a movie – era, genre, subject of the story, actors, director, and music composer. The questions eliminate a subset of possible answers after a definite reply by the user. An answer as maybe does not contribute to the understanding of the model and retains the current state. The model poses questions based on the possibilities it gathers over the current run of answers. It eliminates answers in a strict binary fashion without due regard to human fallacies during the game. Figure 3 highlights the game proceedings for question selection.

Figure 3. Question selection model in Baseline 1.

Baseline 2: This model frames questions from the same six aspects of a movie as baseline 1 along with a learning model added to the graph traversal. It poses questions hierarchically, giving weight to initial questions with the maximum possibility of answer set reduction. It also takes into account the user’s personal information such as ‘birth year’ to adapt the timeline of movies to be questioned for a more efficient guess. The model associates probability scores to each subcategory of the aspects mentioned above to determine the most likely category. Like baseline 1, this model also suffers from the lack of robustness towards human errors. However, due to the learning aspect it performs more efficiently than baseline 1.

6. Experiments and Results

The evaluation is conducted through a user case study. A total of 50 participants interacted with the game. Each participant played the game 5 times for different movies. They played with the same set of movies with each of the baselines as well. 21 participants were female, and 29 were male. Their age group distribution is shown in Table 2.

Age group (in years) No. of participants
18-25 37
25-35 8
35+ 5
Table 2. Age group distribution of human participants.

We trained baseline 2 and our model for 50 random movies before the participants interacted with them. This provides a preliminary idea about the distribution to the model so that it won’t suffer from the problem of a cold start. In each game, every competing model generated 5 movies as output in a ranked fashion. Each output is considered as a single attempt for the evaluation.

No. of questions Baseline 1 Baseline 2 Proposed model
¡10 37 72 127
10-15 38 93 62
15-20 109 39 38
Not Answered 66 46 23
Table 3. Number of questions asked to predict the correct movie. The game was played 250 times.
Rank Baseline 1 Baseline 2 Proposed model
1 0.032 0.104 0.564
2 0.052 0.148 0.712
3 0.064 0.196 0.788
4 0.096 0.232 0.828
5 0.108 0.256 0.860
Table 4. Cumulative probability of the competing models to predict the correct movie within th rank (1¡=¡=5).

Table 3 shows the comparison of the different number of questions that were asked before a model predicted the correct movie. Baseline 2 shows significant improvement over Baseline 1 because of the learning aspect incorporated into the model. Baseline 2 poses relevant questions during the initial stages using previous experiences. However, our model outperforms the two baselines as it predicts the correct movie within 10 questions for ~50% of the times while maintaining an error of ~10%, for which the model was unable to guess the movie within 20 questions.

Table 4 shows the cumulative probability of the ranks within which the movie is predicted correctly in the first attempt. Our model significantly outperforms both the baselines.

Our model ranks the movies effectively in lesser number of question as it learns to assign probabilities to each movie rather than just associating them with metadata. It predicts the correct movie even when the participants answer few questions incorrectly. This is evident from the number of movies that each model was not able to predict (Table 3). Answering initials questions incorrectly makes it tough for our model to predict the movie within one attempt. However, it recovers and predicts the movie in later stages of the game in most of the cases.

7. Conclusion

An essential aspect for achieving good QA accuracy is to make sure the correct answer remains in the candidate set. It needs to be maintained despite human errors in judgment. Taking this into account, we propose a knowledge graph-based approach to develop a 20Q game on Bollywood data. The model overcomes the major issue present in the baselines of handling human errors in answering questions by distributing probabilities intelligently. Our model predicted correct movies in fewer questions as compared to the baselines.

This work can be extended to improve recommendation systems, create an application that can ask simple questions and predict class, etc. The question generator can be further extended to generate questions based on the context of the movie plot, characters, script, etc. using Sequence-to-Sequence model. It helps the model to ask more specific questions and predict correct values effectively.


  • D. Ahn, V. Jijkoun, G. Mishne, K. Müller, M. de Rijke, and S. Schlobach (2004) Using wikipedia at the trec qa track.. In TREC, pp. . Cited by: §2.
  • J. Chu-Carroll, J. Fan, B.K. Boguraev, D. Carmel, D. Sheinwald, and C. Welty (2012) Finding needles in the haystack: search and candidate generation. IBM Journal of Research and Development 56, pp. 6:1–6:12. External Links: Document Cited by: §2.
  • C. L. A. Clarke, G.V. Cormack, T.R. Lynam, C.M. Li, and G.L. McLearn (2001) Web reinforced question answering (multitext experiments for trec 2001). Cited by: §2.
  • D. Cohen, E. Amitay, and D. Carmel (2007) Lucene and juru at trec 2007: 1-million queries track.. pp. . Cited by: §2.
  • S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng (2002) Web question answering: is more always better?. In In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp. 291–298. Cited by: §2.
  • D. Giampiccolo, P. Forner, J. Herrera, A. Peñas, C. Ayache, C. Forascu, V. Jijkoun, P. Osenova, P. Rocha, B. Sacaleanu, and R. Sutcliffe (2008) Overview of the clef 2007 multilingual question answering track. In Advances in Multilingual and Multimodal Information Retrieval, C. Peters, V. Jijkoun, T. Mandl, H. Müller, D. W. Oard, A. Peñas, V. Petras, and D. Santos (Eds.), Berlin, Heidelberg, pp. 200–236. External Links: ISBN 978-3-540-85760-0 Cited by: §2.
  • D. Gunning, V. K. Chaudhri, P. E. Clark, K. Barker, S. Chaw, M. Greaves, B. Grosof, A. Leung, D. D. McDonald, S. Mishra, J. Pacheco, B. Porter, A. Spaulding, D. Tecuci, and J. Tien (2010) Project halo update—progress toward digital aristotle. AI Magazine 31 (3), pp. 33–58. External Links: Link, Document Cited by: §2.
  • B. Katz, J. Lin, D. Loreto, W. Hildebrandt, M. W. Bilotti, S. Felshin, A. Fernandes, G. Marton, and F. Mora (2003) Integrating web-based and corpus-based techniques for question answering.. pp. 426–435. Cited by: §2.
  • B. Katz, G. Marton, G. C. Borchardt, A. Brownell, S. Felshin, D. Loreto, J. Louis-Rosenberg, B. Lu, F. Mora, S. Stiller, Ö. Uzuner, and A. Wilcox (2005) External knowledge sources for question answering. In TREC, Cited by: §2.
  • J. Ko, L. Si, and E. Nyberg (2007) A probabilistic framework for answer selection in question answering. pp. 524–531. Cited by: §2.
  • J. Kupiec (1993) MURAX: a robust linguistic approach for question answering using an on-line encyclopedia. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’93, New York, NY, USA, pp. 181–190. External Links: ISBN 0-89791-605-0, Link, Document Cited by: §2.
  • B. Magnini, M. Negri, R. Prevete, and H. Tanev (2002) Is it the right answer?: exploiting web redundancy for answer validation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 425–432. External Links: Link, Document Cited by: §2.
  • D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju, R. Goodrum, and V. Rus (2000) The structure and performance of an open-domain question answering system. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 563–570. External Links: Link, Document Cited by: §2.
  • J. Prager, E. Brown, A. Coden, and D. Radev (2000) Question-answering by predictive annotation. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, New York, NY, USA, pp. 184–191. External Links: ISBN 1-58113-226-3, Link, Document Cited by: §2.
  • Y. Tay, L. A. Tuan, and S. C. Hui (2018) Multi-cast attention networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 2299–2308. External Links: ISBN 978-1-4503-5552-0 Cited by: §2.
  • H. Tayyar Madabushi, M. Lee, and J. Barnden (2018) Integrating question classification and deep learning for improved answer selection. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Cited by: §2.
  • S. Yang, L. Zou, Z. Wang, J. Yan, and J. Wen (2017) Efficiently answering technical questions — a knowledge graph approach. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    AAAI’17, pp. 3111–3118. Cited by: §2.
  • S. Yoon, F. Dernoncourt, D. S. Kim, T. Bui, and K. Jung (2019) A compare-aggregate model with latent clustering for answer selection. CoRR abs/1905.12897. External Links: Link Cited by: §2.
  • W. Zheng, J. X. Yu, L. Zou, and H. Cheng (2018) Question answering over knowledge graphs: question understanding via template decomposition. Proc. VLDB Endow. 11 (11), pp. 1373–1386. External Links: ISSN 2150-8097 Cited by: §2.