The Training of Neuromodels for Machine Comprehension of Text. Brain2Text Algorithm

03/30/2018 ∙ by A. Artemov, et al. ∙ 0

Nowadays, the Internet represents a vast informational space, growing exponentially and the problem of search for relevant data becomes essential as never before. The algorithm proposed in the article allows to perform natural language queries on content of the document and get comprehensive meaningful answers. The problem is partially solved for English as SQuAD contains enough data to learn on, but there is no such dataset in Russian, so the methods used by scientists now are not applicable to Russian. Brain2 framework allows to cope with the problem - it stands out for its ability to be applied on small datasets and does not require impressive computing power. The algorithm is illustrated on Sberbank of Russia Strategy's text and assumes the use of a neuromodel consisting of 65 mln synapses. The trained model is able to construct word-by-word answers to questions based on a given text. The existing limitations are its current inability to identify synonyms, pronoun relations and allegories. Nevertheless, the results of conducted experiments showed high capacity and generalisation ability of the suggested approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although, natural language processing is one of the most rapidly developing fields in computer science, reading comprehension and questions answering still remain areas where human outperforms any model aimed to comprehensively understand text . The main difficulty for a long time was the lack of a dataset of an appropriate quality and size to train on. An attempt to resolve the issue was made in [Rajpurkar el al., 2016]


, They designed a text corpus based on Wikipedia’s articles, which contains a total of 23 215 paragraphs. Authors developed a logistic regression model trained on correct answers to questions based on the content of the given paragraphs, which is able to answer any question from any other text source. An important specification of the model is that it is able to match nouns, related pronouns and also it distinguishes synonyms. The data for the training set was obtained by crowd-working - people were paid for constructing questions and answers on given paragraphs. The features related to matching words, bigrams, roots and other lexical specifications of question and answer were used. F1-score of 0,51 was obtained while for humans this measure equals 0,86. Also, [Wang, Jiang, 2016]

[2] presented a model architecture based on a match-LSTM whose F1 score is 71% on the test dataset. Another hot issue in engineering of linguistic models is the number of minimal basis elements, by which the text will be divided. For example, in Word2Vec111

Mikolov et al. Vector Representation of Words.
models like SkipGram and CBOW it is a 5-word ’window’ and 4 adjoint words which account for the context. On this problem the following hypothesis was proposed by authors: Two words are enough to determine the unique meaning of the third word

It can be interpreted by the Euclidian geometry in the following way: ”If we span any word in the decomposition in terms of basis of two given words, then its semantic in this basis is represented as a line, which intersects (0,0) and coordinates of the word with that meaning”. Based on this intuition, several algorithms were developed, mutually constituting a complex sequence-to-sequence model.

2 Algorithm description

The process of getting an answer consists of 4 stages. 7 models are used during the whole iteration:

  1. The first algorithm determines parts of speech (POS) which should be included in the answer.
    Words from the question are used as input, e.g. such as ”When”/ ”Who” / ”How” / ”What happened” / ”Which one” and returns ”Numerical”/ ”Noun” / ”Adverb” / ”Noun” / ”Adjective” respectively.

  2. Minimal linguistic semantic unit (MLSU).
    This one gets in the input question’s tokens and required POS, and returns the ID of MLSU.

  3. Verb determining algorithm.
    It takes MLSU ID and returns the verb that is best suitable for the answer

  4. MLSU tokens determination.
    The model gets MLSU id and returns the set of tokens to be used in the answer.

  5. Next token determination.
    To the input the current and 2 previous words are given, and the model returns the token (or set of tokens) which should go next.

  6. Previous token determination.
    Inversed version of previous algorithm.

  7. Token-to-word model.
    The inputs are: adjacent words, token, token’s POS, MLSU’s verb. It returns the next word that will be used in the answer.

At the first stage, the question is being preprocessed. It is separated into two parts - informative and interrogative. From the informative part MLSUs are extracted (they have the following structure : ”verb (or nouns) + START + contextual tokens + END”).

Further, the POS of the searched item (X) is being determined using the interrogative part.

In the next stage, using MSLUs and X’s POS the algorithm determines the ID of the content which is represented as a set of tokens.

And finally, the sentence is being synthesised in the following way: the token of the nex word is chosen using two previous words. After that, we use the token and previous words to determine the final form of the word. The process starts form the verb as it is stored as a word, not a token.

Example: John listens to classical music every day while his sisters listen to emo.

There are two MLSU:

1. MLSU ID1= [ listens; Context ID1={John; classical; music; every; day, to, START, END}]

2. MLSU ID2=[ listen; Context ID2={ while, he, sister, emo, to, START, END}]

Each one consists of a verb and set of tokens, and together they constitute a semantic unit.

The algorithm iterates over all MLSUs and after each iteration the chosen word is removed from the set until it gets START or END.

For example, for the first MLSU we have tokens John; classical; music; every; day,to,START,END}

  1. Algorithm starts iterating from ” none” + ”listen” and gets ”to”. (”to” is a token)

  2. The pair ”listen” and ”to” gives the final form of ’to’ - ”to”.

  3. Then, using ”listen” and ”to” we obtain ”emo”.

  4. The same procedure is repeated until we get END.

So we have reduced our set while, he, sister, ,START for choice of the previous token and its word-shaped form. Then we will go from right to left:

  1. ’To’ and ’listen’ give ’sister’.

  2. From ’listen’ and ’sister’ the model returns ’sisters’ and so on 222For Russian lexemes were used instead of tokens due to linguistic characteristics ..

3 Model Training

3.1 Informational Neurobayesian Approach

Weights in the model are calculated using our special approach named Informational Neurobayesian Approach (INA), in which every weight represents quantity of information of the the object’s feature

which activates given neuron

,(Pointwise Mutual Information modification).

where - coefficient of emergence of the system for class (layer)333[Lutsenko E.V. 2002] [3] and feature . where - number of possible conditions of the system (outputs), - number of features, figuring in the decision process; bias is a parameter for activation of class. Emergence coefficient represents the information, obtained from synthesis of several classes, which was not available before.

Thorough description of the approach can be found in [Artemov et al., 2017] [4].

Figure 1: Summation Process Architecture

That neural network can be illustrated by the Table 1.

1 j W
Features 1
Number of
objects of class
Table 1: Table representation of the knowledge

In the scheme above the is -the object’s feature. – all its features.


For every object with a given set of features the neural network chooses over all classes the one with the maximal informational criteria, represented as an activated sum of all features.

The Y is a chosen class for the given features.

3.2 Training

The problem can formulated as to train the model on natural text and design the algorithm of lexeme determination, which together will be used to answer natural language questions. So, at first, the algorithm chooses the words to use in the answer and at the next stage composes coherent sentence as an answer .

There are 5 types of models designed for problem’s solution. The first one determines which part of speech is not present in the question but should be included in the final answer. The second is seeking for words. Two models are designed to build sequence of words in the sentence and the last one to get a token from the word.

The training process in the first model goes in the following way:

{Why it is light during daytime? – Sun shines }. Full answer: {It is light at daytime, because the sun shines} X is Action (A)
There were 7 types of answers predetermined for the model:

  1. O – object ( nouns, pronouns);

  2. OD– object’s description ( adjective, numerical, participle, gerund);

  3. S – subject (noun, pronoun);

  4. SD – subject’s description ( adjective, numerical, participle, gerund);

  5. A – action ( verb);

  6. AD - actions’s description ( pronoun);

  7. OT - other.

The unknown word represents action, the X is a verb. The input data is preprocessed so that interrogative and contextual parts are extracted from it. This is what allows to train the model to distinguish the type of word to search for. The data for the first stage of the training is shown in Table 2.

Features Intervals of Intervals of Classes
Features Adjective Noun
Bias 0 0
which 0,534 -0,239
by whom 0,040 0,191
Question when 0 0,092
construction to whom -0,186 0,244
who 0 0,261
what 0,117 0,164
what is 0,274 -0,032
whose 0 0,216
Bias 0 0
Lexical part any 0,064 0,101
of POS adverb 0,025 -0,050
adjective 0,084 -0,004

Table 2: A fragment of the knowledge neuromodel for choosing a part of speech
Features Intervals of Intervals of Classes
Features Any Adverb
Bias 0 0
which 0,649 0,113
by whom 0,555 0
Question when 0 0,643
construction to whom 0 0
who 0,658 0
what 0 0
what is 0 -0,061
whose 0,894 0
Bias 0 0
Lexical part any 0,111 -0,361
of POS adverb -0,108 -0,052
adjective 0,026 -0,144

Now using the sentence ”The sun shines at morning and men go to work” as an example of the framework will be demonstrated in Table 3.

Model Training data
            Structured  data
Features Classes
Q: Why it is light at morning?
A. The sun shines.
Q: Where do men go?
A: Men go to work
Question + content tokens:
”Why” + {adverb, noun}
”Where” + {verb, men}
Unknown POS: Verb
Unknown POS : Noun
The sun shines at morning.
Men go to work
Shines (morning, sun)
Go (to, men, work)
Content ID:
Initial sequence of
three tokens:
{Man, go, to}
Tokens and their POS:
man_noun, go_verb.
Next token:
Initial sequence of
three tokens
{The, sun, shines}
Tokens and their POS :
sun_noun, shines_verb.
Previous token:
Initial sequence of
words’ pairs
Sun shines, men work
Token from MSLU_ID
and its POS:
sun_noun, to_any
Next token:
shines, work
Initial sequence
of words’ pairs:
The sun, men go
Token from MSLU_ID and it’s POS:
sun_noun, to_any
Previous token:
the, men
Table 3: Data for training

Authors will be glad to provide access to the data corpus for the model’s training at reader’s request.

4 Experiment Results

Sberbank’s strategy text was used for model training. The text is in Russian and consists of 656 sentences, total volume of 8200 words, 2048 tokens and 514 verbs. Brain2 framework was used to design the network, total number of connections accounted for over 65 mln. The model parameters are shown in Table 4.
The original document is available at Sberbank’s website.


The algorithm’s execution is demonstrated in the web-interface, accessible by the URL:

Figure 2: Framework Demo
Model Size Number Number Number of F1-measure Precision /
of classes of features connections Recall
1. Question Processing 10 Kb 12 133 1 596 0,24 0,34 /0,1
2. Word Search 1 Mb 636 1 577 1 002 972 0,56 0,62 / 0,5
3. Text Composing 23 Mb 1488 8 031 11 950 128 0,750 0,87 / 0,6
4. Text Composing 23 Mb 1464 8 069 11 813 016 0,771 0,91 / 0,6
5. Next word 77 Mb 2656 15 118 40 153 408 0,98 0,99/ 0,9
Table 4: Model’s parameters

4.1 Tests

Three groups of questions were designed to train on: 1) Questions based on the content of the text. The correctness of answers was checked against them. 2) Questions on irrelevant topics. 3) Meaningless questions (for tuning Type II error). Also, more than 6000 questions were designed automatically, dividing 3 groups in the same way.

The interface was designed to illustrate the process the system is going through.

Every obtained answer was compared with the original one, and in case of match 1 point was assigned, half points were assigned if an answer was classified as alternative and 0 otherwise. The confidence was calculated as a fraction from the mean information on a feature. Integral estimate is a dot product of points and their confidences. Consecutive training approach gives better results.

Parameter Parallel Consecutive
Questions asked 30 30
Correct answers 18 24
Correct answers 8.9 19.2
(integral estimate)
Type I Error 48% 23%
Type II Error
Type I Error 81% 41%
(Integral measure)
Type II Error
(Integral measure)
Table 5: Expert questions-based testing results
Parameter Parallel Consecutive
Questions asked 6000 6000
F - measure 0.83681 0.88736
Precision 0.85206 0.8624
Recall 0.82208 0.91382
Table 6: Technical questions-based testing results

5 Conclusion

The results of the experiments confirmed the validity of the two-words hypothesis. Presented natural language processing model is able to answer questions with precision rate of 0.822-0.914. The model has is not able yet to distinguish synonyms, pronouns and allegories. Nevertheless, given the current restrictions it shows quite promising results. In the future, it is planned to develop the algorithm to recognize synonyms, pronouns etc, and also make it available in English.


  • [1] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. Arxiv, 2016.
  • [2] Shuohang Wang and Jing Jiang. Machine Comprehension Using Match-LSTM and Answer Pointer. Arxiv, 2016.
  • [3] Lutsenko E.V. Conceptual principles of the system (emergent) information theory and its application for the cognitive modelling of the active objects (entities). Computer society, 2002.
  • [4] A. Artemov, E. Lutsenko, E. Ayunts and I. Bolokhov. Informational Neurobayesian Approach to Neural Networks Training. Opportunities and Prospects. Arxiv, 2017.