Machine Reading Comprehension (MRC) has made significant strides with an array of neural models rapidly approaching human parity on some benchmarks such as SQuAD. However, existing methods are still in their infancy at the level of cognitive intelligence. Recently, brain science and psychology provide an important basis for the development of brain-like computing and the simulation of human perception, thinking, understanding, and reasoning abilities .
Thinking is the generalization and indirect reflection of the human brain on the nature, interrelationships and internal regularities of objective things . Two types of thinking are complementary in psychology: inertial thinking — from a previous to a subsequent stimulus — and reverse thinking — from a subsequent to a previous stimulus . Inertial thinking  is a conventional way of thinking, which thinks and solves problems from the previous ideas. Reverse thinking  is a creative way of thinking, opposite to the inertial thinking. Specifically, in the MRC task, the two types of thinking can be regarded as a process which reasons from questions (answers) to answers (questions). For example, as shown in Fig. 1, we can get the answer easily by locating the entities pregnant wowen and loquat. Contrarily, the generative question, which can be reasoned by reading the answer and passage, describes two aspects, including can pregnant women eat loquat and what is the benefit to eat loquat for pregnant women. We hope that this ability of reverse reasoning can improve performance on reading comprehension tasks.
Previous methods [28, 23, 26, 17] only consider a obverse logical relationship, which is based on the given question and the passage. They ignore the reverse relationship between the given passage and the answer. Although the work  proposes a joint model that both asks and answers questions, it couples all the knowledge rather than decopuling modules, which is consistent with the concept of psychology. Similarly, we hypothesize that the ability of reverse reasoning can help models achieve better QA performance. This is motivated partly by observations made in psychology that devising questions while reading can help students improve in reader-based processing of text .
Therefore, insights into solutions to the problem can be gained from human cognitive processes. Complementary Learning Systems Theory (CLST) [13, 14, 8, 15] suggests that the human brain contains complementary learning systems that support the simultaneous use of many sources of information as we seek to understand an experienced situation. One of the systems acquires an integrated system of knowledge gradually through interleaved learning, including our knowledge of the meanings of words, the properties of frequently-encountered objects, and the characteristics of familiar situations. It is just like inertial thinking that learns relationships between different things in the real world for a long time. The other system is a fast learning system similar to reverse thinking, which is targeted to focus on stimulating and enhancing infrequently-utilized circuit areas in the brain from another unusual perspective.
In this paper, we propose the Bi-directional Cognitive Knowledge Framework (BCKF). And the corresponding Bi-directional Cognitive Thinking Network (BCTN) is designed to validate the effectiveness of the reverse thinking, as shown in Fig. 2, which will be introduced in detail in Section 4.
The contributions can be summarized as follows:
From the perspective of Complementary Learning Systems Theory, we propose a Bi-directional Cognitive Knowledge Framework (BCKF) and corresponding Bi-directional Cognitive Thinking Network (BCTN), which simulate the connection of neural circuits in the brain, and determine the stimulus intensity of reverse thinking in memory based on the gate mechanism.
The proposed model, with the Gate-Reverse Thinker and No-Gate-Inertial Thinker, has the ability to reverse reasoning questions which assists inertial thinking to generate more accurate answers in a decoupling fashion.
We conduct extensive experiments and experimental results show that our proposed BCTN has strong competitiveness with other models. The ablation study validates the importance of our different modules.
2 Related Work
2.1 MRC Datasets
Recently, with the released datasets, MRC tasks have attracted significant amounts of attention. At first, the dataset was cloze-style  which let the machine find an entity to fill the blank, and then was multiple choice [9, 21] and extractive [20, 1]. After that, datasets made a transition from extractive task to generative task [16, 4].
2.2 MRC Models
treat machine reading comprehension as extracting answer span from the given passage, which is obtained by predicting the probability distribution of the start and end position of the answer. Some models[26, 17] focus on generating answers from the question and multiple passages with an extraction-then-synthesis framework or a multi-style abstractive summarization approach. Another inspiring work is from DBLP:journals/corr/WangYT17, where the authors propose to use a sequence-to-sequence framework that encodes the document and generates a question (answer) given an answer (question). However, the model, mixing the bi-directional knowledge with the same module, is hybrid and not decoupled, which is inconsistent with cognitive psychology . Our Bi-directional Cognitive Thinking Network does not consider them simultaneously; instead decouples in a bi-directional way of thinking that stemmed from cognitive psychology.
2.3 Pre-trained Language Model
Another interesting study is pre-trained language models which make a great contribution to a wide range of MRC tasks. Previous work [5, 19] adapt traditional language models to improve downstream tasks. Recently, a significant milestone is the BERT 
, which gets new state-of-the-art results on eleven natural language processing tasks. Then, some work[36, 12] such as XLNet and RoBERTa introduce more data and bigger models for better performance. However, it is difficult to effectively execute them on resource-restricted devices, DBLP:journals/corr/abs-1909-11942 proposes ALBERT to reduce memory consumption and increase training speed. In this paper, we focus on the RoBERTa models to encode our contextual information and regard them as baselines.
3 Bi-directional Cognitive Knowlwdge Framework
Motivated by Complementary Learning Systems Theory (CLST) [13, 14, 8, 15], we proposed the Bi-directional Cognitive Knowledge Framework (BCKF). As shown in Fig. 2 (a), the blue box contains the neocortical system which is organized around a set of inputs. The red box is the medial temporal lobes (MTL) system where the blue ovals (Fusion System, Inertial Thinker, Reverse Thinker, Reasoner and Gate Controller) represent the different modules which are directly or indirectly associated with the orange ovals, defined as a neutral pool containing a small amount of information such as visual and linguistic inputs. Green arrows represent learned connections between different blue ovals, which bind the elements of this embedding together for later reactivation. Green dotted lines indicate that the bi-directional thinkers contain the reasoning module. Blue arrows represent information transfer between different systems. The red and blue circular arrows represent self-learning and self-updating. And the controller determines the stimulus intensity of reverse thinking in memory in order to make different decisions in different situations. Finally, the understanding system guides the behavior of the model and its understanding of the language by combining inertial thinking and reverse thinking.
Following the overview in Fig. 2, the proposed model that derives from our Bi-directional Cognitive Knowledge Framework consists of the following modules, and the training of the model contains two stages.
The Backward Encoder models interactions between the answer and passage during the first stage (Backward Encoder Gate-Reverse Thinker Fusion Layer Soft Decoder), called Reverse Thinking Training (§4.2).
The Forward Encoder, similar to the backward encoder with different parameters and inputs, retrains with the given passage and question during the second stage (Forward Encoder No-Gate-Inertial Thinker Gate-Reverse Thinker Fusion Layer Soft Decoder), called Retraining with Inertial Thinking (§4.3).
The Medial Temporal Lobes (MTL) System contains Gate-Reverse Thinker, No-Gate-Inertial Thinker and Fusion Layer. The Gate-Reverse Thinker learns the reverse connections of the neurons from the negative side and determines the stimulus intensity of reverse thinking in memory. The No-Gate-Inertial Thinker builds the obverse relationship. And the Fusion Layer combines the bi-directional knowledge to prepare for decoding.
The Soft Decoder outputs an answer (question) sentence with soft-gated pointer-generator copying  to synthesize the distribution on vocabulary and the distribution of tokens from the source input into a single output distribution.
4.1 Problem Formulation
The goal of our task could be formulated as follows. Given a question and a passage , the corresponding answer with words , our model predicts an output sequence conditioned on the MTL system that combines the bi-directional knowledge between the passage and question (answer). The training of the model is divided into two stages. In the first stage, the model infers the question based on the passage and answer and it learns the reverse connections of the neurons from the negative side. Then, the model considers reverse thinking and retrains with the given passage and question in a bi-directional schedule of thinking. During the prediction, the model combines the bi-directional knowledge to answer the question.
4.2 Reverse Thinking Training
In this section, we train the Gate-Reverse Thinker with the answer and passage where the reserved parameters are regarded as the connection of the reverse circuit in the brain. And the gate is similar to the controller in Fig. (2) (a) which determines the stimulus intensity of reverse thinking in memory to make different decisions in different situations. Finally, the decoder infers a question based on the answer.
Backward Encoder: Following the implementation of BERT , we add the special classification embedding ([CLS]) which encodes entailment information between the two sentences and separate the answer and passage with a special token ([SEP]). The total length of the input is (), where and
are the length of the answer and passage, respectively. In order to review the answer to locate the answer-relevant semantic information, we encode the answer separately and get a pure answer vectorwith tokens , as:
where and , and denote the representations of the first token of and . is a fully connected layer.
Gate-Reverse Thinker: As Fig. 3, the reasoning module (blue ) contains a stack of reasoning blocks which consist of a start (orange) and an end (green) sub-block. These two sub-blocks have a timing dependency, namely, the calculation of the end sub-block needs to consider the result of the start sub-block. The blocks of reasoning module imitate the process of human thinking and dig the relationships constantly between the and through multiple steps of reasoning. and are the start and end reasoning vectors during the -th reasoning step which can be considered as hidden states to enhance the representation of . The final reasoning vectors and fuse all possible reasoning fragments based on relevance to the answer (or question). Besides, the thinker calculates the gate on the condition of decoded tokens to determine the stimulus intensity of reverse thinking in memory.
Specifically, the module first concatenates and (Fig 3 - 1⃝), we let repeat times for the consistent dimensions, which followed by DBLP:conf/iclr/Wang017a, as:
where , and means the -th reasoning step, when 0, we use a randomly initialized vector and as start and end reasoning vectors, respectively. produces a matrix by repeating the vector on the left for times.
To get the semantic segments that the answer (question) attends, the module rereads it by using the pure answer (question) vector
. Then it computes a start probability distributionover the entire context which will be continuously updated during the reasoning process, as:
Then, we use the start probability distribution to obtain the updated start reasoning vector (Fig. 3 - 3⃝) as a guider for latter reasoning.
The probability distribution and the reasoning vector of the start position have been calculated. Considering the temporal relationship between the end position and the start position, we introduce into the calculation of (Fig. 3 - 4⃝). Similarly, the end probability distribution and the end reasoning vector is computed by:
where , and .
The model update historical and current information iteratively for further reasoning. We use the start and end evidence vector , of the current block as the initial states of the subsequent block. Circularly, we obtain the updated representation (Fig. 3 - 5⃝).
In order to make different decisions in different situations, we introduce the gate mechanism to determine the stimulus intensity of reverse thinking in memory. We take the decoded tokens into account, the attention comes from the multi-head attention , the is defined as:
Where is the length of the outputs of the decoder. Then, the context vector
which is a part of the input to the gate modeled by max-pooling overmultiple times, as:
Where , and
represents a sigmoid function,indicates element-wise multiplication.
Fusion Layer: To combine the reverse thinking and inertial thinking, we adopt the fusion kernel used in DBLP:conf/acl/WangWY18 for better semantic understanding:
Where is the output of No-Gate-Inertial Thinker described in Sec. (4.3), but here, it is initialized with a random vector in the first training step. and are hype-parameters.
Soft-Decoder: Following the paper , we use a stack of Transformer decoder blocks on top of the embeddings provided by the word embedding layer and self attention. The second and third sub-layers perform the multi-head attention over (Eqn. 3) and (Eqn. 15). Also, the pointer-softmax mechanism  that learns to switch between copying words from the document and generating words from a prescribed vocabulary is used. We do not describe the details here due to space limitation, instead directly give the probability distribution of the -th token, as:
4.3 Retraining with Inertial Thinking
To satisfy the situation described in Sec. 3, we retrain our model on the condition of the reverse thinking in a positive way.
Given the passage and question , the Forward Encoder, having different parameters with the Backward Encoder, encodes the contextual information and that is similar to Eqn. (1; 2; 3). And the No-Gate-Inertial Thinker builds the obverse logical relationship in a multi-step reasoning (Fig. 3 ) fashion without a gate, like Eqn. (10):
Then, Gate-Reverse Thinker outputs the reverse thinking information based on , like Eqn. (14):
After calculating the inertial and reverse vectors, similarly, we obtain the bi-directional knowledge like Eqn. (15) and use it to decode the probability distribution of the -th answer, as:
where and are manual specified which determine the proportion of the bi-directional thinking.
5 Experimental Setup
5.1 Dataset & Evaluation Metrics
To demonstrate the effectiveness of our work, we choose DuReader benchmark dataset which is designed from real-world search engines (BaiDu). In terms of the data size, it contains 300k questions and the data has been split into a training set (290k pairs) and a development set (10k pairs). The test split of DuReader is hidden from the public. Therefore, we take 5k question-answer pairs randomly from the development data as validation set and use the rest development data to report test results. As for the evaluation metrics, the answers are human-generated and not necessarily sub-spans of the passages so that the metrics in DuReader are ROUGE-L (R-L) and BLEU-4 (B-4) .
5.2 Implementation Details
The BERT-style baselines have the same hyperparameters given on the paper. We use Adam optimizer 
for training, with a start learning rate of 3e-5 and a mini-batch size of 32. The epoch of DuReader dataset is 12. To coordinate the bi-directional knowledge, we setand . To improve both the efficiency of training and testing, following DBLP:conf/acl/HeLLLZXLWWSLWW18, we select one most related paragraph from labeled documents as input. and are set as 0.8 and 0.2, respectively.
In this section, we evaluate our model on the benchmark dataset: DuReader dataset . Pre-trained language model RoBERTa (RB) and some state-of-art models are used as baselines to test the performance of the proposed BCTN. We use the large pre-trained language model (-large) in the main experiments for getting a strong result. However, to increase training speed, the base model (-base) are utilized in our ablation study and other experiments. The experimental results and analyses validate the effectiveness of our model.
|Model||Dev ROUGE-L||Dev BLEU-4||Test ROUGE-L||Test BLEU-4|
|Deep Cascade ||50.71||49.39|
6.1 Main Experiment
In DuReader dataset, baselines can be classified as three types: state-of-the-art models, RoBERTa-base (RB-base) model and RoBERTa-large (RB-large) model. RB-base and RB-large indicate that we directly use the pre-trained language model as the encoder without MTL System. In order to reduce the complexity of the model, previous methods[23, 31, 35] turn the DuReader into extractive tasks. Therefore, we divide the models into extractive models and generative models. As shown in Table 1, the main results of our single model on DuReader outperforms the BERT-style baselines, 3.86% gain on ROUGE-L and 4.34% gain on BLEU4 on the RoBERTa-base model, as well as 2.26% gain on ROUGE-L and 2.66% gain on BLEU-4 on the RoBERTa-large model. Although our model has decreased a little on BLEU-4 compared to the extractive models, it outperforms them on ROUGE-L about 8.4%. And extractive models usually have better performances.
6.2 Ablations Study & Effect of Different Parameters
We conducted an ablation study on our model to discuss the impacts of the augmented components which can be removed in our framework. Table 2 shows the effectiveness of different parts in our proposed BCTN. Note that by removing all different elements, configuration 3 reduces to the RB-base model.
According to Table 2, the R-L and B-4 of origin system is 54.18% and 38.85%. And we can conclude that: 1) the Gate Mechanism which determines the stimulus intensity of reverse thinking in memory is effective to the proposed model. 2) To validate the importance of reverse thinking, we remove the Backward Encoder, which results in a performance loss, about 4.14% on BLEU-4. 3) The medial temporal lobe system makes a contribution to the overall performance, which confirms our hypothesis that combining the bi-directional knowledge to answer questions is necessary. Either component is important for our proposed model.
We manually set the different parameters and (Eqn. 15) to explore how the bi-directional knowledge affects the performance of the BCTN. From the Table 3, we can observe that the performance of the model reaches 57.02% on ROUGE-L when only using the inertial thinking, however, the model achieves a peak after adding reverse thinking. In the case where the model uses only reverse thinking and ignores inertial thinking, the model’s effectiveness drops significantly. This is consistent with human behavior in psychology that reverse thinking can assist inertial thinking to generate more accurate answers and it is not enough to use reverse thinking or inertial thinking alone.
6.3 Qualitative Examples
Qualitatively, we have observed interesting examples in DuReader before and after adding the bi-directional thinkers. As shown in Table 3, in case one, the proposed model output a generative question How to pass the master level of ”End of Nightmare” which has the same semantic with the gold question. And our proposed BCTN gets the correct answer with a more detailed explanation that marks blue. However, the RB-base output a wrong answer with the ambiguous response, especially the sentence they must reach the blood. In case two, the same conclusion can be reached too. The answer of RB-base describes the nutrients of the loquat instead of the real question. But the BCTN not only gives the response correctly, but also explains why can pregnant women eat loquat. The answers become more explicable and proper with the help of our model, illustrating that our ideas can indeed assist question answering.
|Document||Players who want to clear the master level of “End of Nightmare” need to kill the hero Bian Que. If you want to kill Bian Que, you must arrive at Bian Que before the Iron Face Man, because the Iron Face Man is invincible in this level. In the level, players should have as few contacts with monsters as possible …|
|How to get through ”End of Nightmare” (Master Difficulty) ?|
|How to pass the master level of ”End of Nightmare” ?|
|Answer||RB-base: Players want to clear the level, they must reach the blood before the Iron Faced Man.|
|BCTN: To clear the master level at the end of the nightmare, you need to kill the hero Bian Que. If you want to kill Bian Que, you must arrive at Bian Que before the Iron Face Man, because the Iron Face Man is invincible in this level.|
|Document||Loquat is a kind of southern fruit …… Loquat has a unique flavor and is rich in nutrients. Pregnant women can eat it. It contains protein, fat, fruit acid, fructose and calcium, phosphorus, iron, sodium, potassium and other minerals. Pregnant women eating loquat can increase appetite, relieve heat and relieve thirst. Loquat can stimulate the secretion of digestive glands, and has a good effect on increasing appetite, helping digestion and absorption, and quenching thirst and relieving heat …|
|Loquat, can pregnant women eat it ?|
|Can pregnant women eat loquat ?|
|Answer||RB-base: Fructose, fat, calcium, phosphorus, iron, sodium, potassium.|
|BCTN: Pregnant women can eat loquat which increases appetite, relieves heat and relieves thirst. Loquat can stimulate digestive gland secretion, and have a good effect on increasing appetite, helping digestion and absorption.|
In this paper, we present the Bi-directional Cognitive Thinking Network (BCTN) which corresponding to the Bi-directional Cognitive Knowledge Framework (BCKF) from the perspective of psychology. The BCTN answers the question with bi-directional knowledge by simulating the inertial thinking and reverse thinking. And we decouple these two parts of knowledge rather than couple them with the same module. To determine the stimulus intensity of reverse thinking in memory, we consider the decoded tokens to calculate the score based on the gate mechanism. We show that the proposed BCTN is very effective, it has competitiveness with the previous methods in literature on DuReader with a single model. Our future work will consider to use different datasets and design various models to simulate the behavior of our brain to capture human-level language understanding and intelligence. Finally, we believe that our framework can generalize to other generative tasks, such as summarization and image caption.
We thank all anonymous reviewers for their constructive comments. This work is supported by the National Natural Science Foundation of China (No.62006222).
-  (2018) QuAC: question answering in context. In EMNLP, pp. 2174–2184. External Links: Cited by: §2.1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. External Links: Cited by: §2.3, §4.2.
-  (2016) Pointing the unknown words. In ACL, External Links: Cited by: §4.2.
-  (2018) DuReader: a chinese machine reading comprehension dataset from real-world applications. In ACL, pp. 37–46. External Links: Cited by: §2.1, Table 1, §6.
-  (2018) Universal language model fine-tuning for text classification. In ACL, pp. 328–339. External Links: Cited by: §2.3.
-  (2015) Adam: A method for stochastic optimization. In ICLR, External Links: Cited by: §5.2.
-  (1976) The psychology of mathematical abilities in schoolchildren (chicago, the university of chicago). University of Chicago Press. Cited by: §1, §2.2.
-  (2016) What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences 20 (7), pp. 512–534. Cited by: §1, §3.
-  (2017) RACE: large-scale reading comprehension dataset from examinations. In EMNLP, pp. 785–794. External Links: Cited by: §2.1.
Modeling reverse thinking for machine learning. CoRR abs/1803.00158. External Links: Cited by: §1.
Rouge: a package for automatic evaluation of summaries acl.
Proceedings of Workshop on Text Summarization Branches Out Post Conference Workshop of ACL, pp. 2017–05. Cited by: §5.1.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §2.3, §5.2.
-  (1991) Simple memory: a theory for archicortex. In From the Retina to the Neocortex, pp. 59–128. Cited by: §1, §3.
-  (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review 102 (3), pp. 419. Cited by: §1, §3.
-  (2019) Extending machine language models toward human-level language understanding. CoRR abs/1912.05877. External Links: Cited by: §1, §3.
-  (2016) MS MARCO: A human generated machine reading comprehension dataset. In NIPS, External Links: Cited by: §2.1.
-  (2019) Multi-style generative reading comprehension. In ACL, pp. 2273–2284. External Links: Cited by: §1, §2.2.
-  (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. External Links: Cited by: §5.1.
-  (2018) Deep contextualized word representations. In NAACL-HLT, pp. 2227–2237. External Links: Cited by: §2.3.
-  (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, pp. 2383–2392. External Links: Cited by: §1, §2.1.
-  (2013) MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pp. 193–203. External Links: Cited by: §2.1.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL, pp. 1073–1083. External Links: Cited by: §4.
-  (2017) Bidirectional attention flow for machine comprehension. In ICLR, External Links: Cited by: §1, §2.2, §6.1, Table 1.
-  (2017) Cognitive approach to natural language processing. Elsevier. Cited by: §1.
-  (1982) Active comprehension: problem-solving schema with question generation for comprehension of complex short stories. Reading Research Quarterly, pp. 166–186. Cited by: §1.
-  (2017) S-net: from answer extraction to answer generation for machine reading comprehension. CoRR abs/1706.04815. External Links: Cited by: §1, §2.2.
-  (2017) Attention is all you need. In NIPS, pp. 5998–6008. External Links: Cited by: §4.2, §4.2.
-  (2017) Machine comprehension using match-lstm and answer pointer. In ICLR, External Links: Cited by: §1, §2.2, Table 1.
-  (2017) A joint model for question answering and question generation. CoRR abs/1706.01450. External Links: Cited by: §1.
-  (2017) Gated self-matching networks for reading comprehension and question answering. In ACL, pp. 189–198. External Links: Cited by: Table 1.
-  (2018) Multi-passage machine reading comprehension with cross-passage answer verification. In ACL, pp. 1918–1927. External Links: Cited by: §6.1, Table 1.
-  (2006) Computational thinking. Communications of the ACM 49 (3), pp. 33–35. Cited by: §1.
-  (2018) Large-scale cloze test dataset created by teachers. In EMNLP, pp. 2344–2356. External Links: Cited by: §2.1.
-  (2017) Dynamic coattention networks for question answering. In ICLR, External Links: Cited by: §2.2.
-  (2019) A deep cascade model for multi-document reading comprehension. In AAAI, pp. 7354–7361. External Links: Cited by: §6.1, Table 1.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Cited by: §2.3.