Pre-trained Language Models as Symbolic Reasoners over Knowledge?

06/18/2020 ∙ by Nora Kassner, et al. ∙ Universität München 0

How can pre-trained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that establishes a causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs learn to apply some symbolic reasoning rules; but in particular, they struggle with two-hop reasoning. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rule Definition Example
Reflexivity () Is(dog, dog)
Symmetry () () Married(B. Obama, M. Obama) Married(M. O., B. O.)
Inversion () () ContainedIn(lactose, milk) Contains(milk, lactose)
Composition () () () Faster(tiger, sheep) Faster(sheep, snail) Faster(leopard, snail)
with
Equivalence () () DivisibleBy(number, 2) Is(Number, even)
Negation () () Is(Jupiter, big) IsNot(Jupiter, small)
Implication () (), () Is(dog, Mammal) Has(dog, hair), Has(dog, neocortex), etc.
Table 1: Symbolic reasoning: relational rules (top) and logical rules (bottom) for knowledge triples with an example in natural language. are entities, relations and attributes.

Pre-trained Language models (PLMs) like BERT Devlin et al. (2019)

, GPT-2

Radford et al. (2019) and RoBERTa Liu et al. (2019) have emerged as universal tools that capture a diverse range of linguistic and – as more and more evidence suggests – factual knowledge Petroni et al. (2019); Radford et al. (2019).

Recent work on knowledge captured by PLMs is focused on probing, a methodology that identifies the set of facts a PLM has command of. But little is understood about how this knowledge is acquired during pre-training and why. We analyze the ability of PLMs to acquire factual knowledge focusing on two mechanisms: reasoning and memorization. We pose the following two questions:

a) Symbolic reasoning: Are PLMs able to infer knowledge not seen explicitly during pre-training? b) Memorization: Which factors result in successful memorization of a fact by PLMs?

We conduct our study by pre-training BERT from scratch on synthetic corpora. The corpora are composed of short knowledge-graph like facts: subject-relation-object triples. To test whether BERT has learned a fact, we mask the object, thereby generating a cloze-style query and then measure prediction accuracy.

Symbolic reasoning. We create synthetic corpora to investigate four relational rules (reflexivity, symmetry, inversion, composition) and three logical rules (equivalence, negation, implication); see Table 1. For each rule, we create a corpus that contains facts from which the rule can be learned. We test BERT’s ability to use the rule to infer unseen facts by holding out some facts in a test set. For example, for composition, BERT should infer, by having seen that leopards are faster than sheep and sheep are faster than snails, that leopards are faster than snails. This type of inference is hard because we do not provide the necessary information (“the premise”) at inference time and, during training, it is scattered over the training corpus and interleaved with all other facts.

This setup is similar to link prediction in the knowledge base domain and therefore can be seen as a natural extension of the question “Language models as knowledge bases?” Petroni et al. (2019). In the knowledge base domain, prior work Sun et al. (2019); Zhang et al. (2020) has shown that models that are able to learn relational rules are superior to ones that are not.

Talmor et al. (2019) also investigate symbolic reasoning in BERT und use cloze-style queries. However, in their setup, there are two possible reasons for BERT having answered a cloze-style query correctly: the corresponding fact (i) was correctly inferred or (ii) was seen during training. In contrast, since we pre-train BERT from scratch, we have full control over the training setup and can distinguish cases (i) and (ii).

We find that BERT does well on learning one-hop rules (e.g., symmetry), but struggles with two-hop rules (e.g., composition). However, by providing richer semantic context, even two-hop rules can be learned.

Given that BERT can in principle learn reasoning rules, the question arises whether it does so for standard training corpora. We find that standard BERT has only partially learned the types of rules we investigate here. For example, BERT has learned that “X is the opposite of Y” is symmetric, but it fails to understand rules like symmetry in many other cases.

Memorization. During the course of pre-training, BERT sees more data than any human could read in a lifetime, an amount of knowledge that surpasses its storage capacity. We simulate this with a scaled-down version of BERT and a corresponding training set that ensures that BERT cannot memorize all facts in the training set. We identify two important factors that lead to successful memorization of facts. (i) Frequency. Other things being equal, low-frequency facts (e.g., singletons) are not learned whereas frequent facts are. (ii) Schema conformity. Facts that conform with the overall schema of their entities (e.g., “sparrows can fly” in a corpus with many similar facts about birds) are easier to memorize than exceptions (e.g., “penguins can dive”).

2 Data

To thoroughly test PLM’s reasoning capabilities, natural corpora are inadequate since it is impossible to control what the model might have seen during training (e.g. on Wikipedia). Synthetic corpora provide an effective way to investigate reasoning by giving full control over what knowledge is seen and what kind of underlying rules the corpora follows.

In our investigation of PLMs as knowledge bases, it is natural to use subject-relation-object triples as basic units of knowledge, which we refer to as facts. The underlying vocabulary is composed of a set of entities , relations and attributes . Two types of triples are generated. (i) Attribute facts: relations linking entities to attributes , e.g., Is(leopard, fast). (ii) Entity facts: relations linking entities , e.g., Capital(Paris, France).

In the test set, we mask the objects and generate cloze-style queries of the format “ [MASK]”. We evaluate performance with prediction accuracy.

2.1 Symbolic Reasoning

The vocabulary includes 5000 entities , 200 relations and 500 attributes .

Table 1 gives definitions and examples for the rules in question. These definitions serve as templates to generate facts. The templates involve entity, relation and attribute slots which are filled by sampling from the underlying vocabulary. A rule can link multiple facts. We call these linked facts one instance of the rule. We construct a separate corpus for every rule by filling these templates. Filling the respective templates follows a two step process. The first step fills the template slots defining the rule. The second generates 800 instances of that rule. Both consecutive steps are repeated 50 times, creating 50*800 total rule instances.

The relational rules we test are: reflexivity, symmetry, inversion and composition, which are entity facts. First, we sample to fill the relation slots defining the rule. Second, we sample to fill the entity slots thereby generating multiple instances of the rule.

A reflexive rule instance is defined by one fact and one relation . We sample and generate .

A symmetric rule instance is defined by one relation but two facts. We sample pairs of and generate and .

An inverse rule instance is defined by two facts and two relations . We sample pairs of and generate and .

A composition rule instance is defined by three facts and three relations , and . We sample and generate , and .

The logical rules we test are: equivalence, implication and negation which fall into the attribute fact category. Filling the respective templates now also involves filling attribute slots.

Equivalence is defined by two facts, two relations and two attributes. We sample pairs of and to define the rule. For each pair we sample to generate multiple instances of the rule.

For implication, we link a single cause fact to 5 entailed effects. Therefore, we first sample a cause and entailed and a cause and 5 entailed . Second, we sample to generate the instances of the rule.

To model negation the “not” token is added to the vocabulary. The set of attributes is split in half and . Then attributes are paired as antonyms: = / = for and . We sample , and and generate and .

Figure 1 depicts the the training and test split. For each rule we generate a separate training corpus. 90% of the instances of the rule are completely seen during training e.g., both MarriedTo(Michelle Obama, Barack Obama) and MarriedTo(Barack Obama, Michelle Obama). From those instances the the PLM needs to learn the rule that MarriedTo is symmetric. For the remaining 10% only one direction is seen during training e.g., only MarriedTo(Pierre Curie, Marie Curie). During test time we query the other direction e.g., “Marie Curie is married to [MASK]”.

As multiple true objects are possible for the same subject-relation pair, we count the number of correct answers in the top-m ranked predictions normalize by m where m is the number of true objects.

In addition to the facts following the rule, we add equally many facts that form a control group and do not follow the rule. The control group consists of relations generating random facts. We add an additional pool of 200 relations to the vocabulary. We sample from this pool of relations and the original set of entities and attributes to generate random triplets The rational is that if a rule is truly learned it needs to be distinguishable from facts not following it.

In contrast to Clark et al. (2020), our setup allows the model to only see a subject-relation-object-triple per datapoint. This means that the rule cannot be inferred from a single training point but multi-hop reasoning is required.

We test whether i) BERT memorizes both the facts following and not following the rule seen during training, ii) is able to generalizes to the test set for the relations following the rule.

2.2 Memorization

First, we identify the number of facts needed to max out scaled-down BERT’s memorization capacity. The vocabulary includes 125000 entities , 22 relations and 2250 attributes .

For the frequency baseline we generate 800,000 random but unique triples. In the training corpus, these triples occur in the range of 1 to 10 time. We test on the same facts and report prediction accuracy over the number of training occurrences.

For the scheme conform facts vs. exceptions experiments we split in 250 disjoint groups. For each group member we generate a set of facts:

Figure 1: Exemplary depiction of the training and test split for symbolic reasoning by means of the inversion (a), composition (b) and implication (c) rule. Each green or red box displays a single fact. Facts in one line together capture one instance of the rule. This requires 2 facts for inversion and 3 facts for composition. To learn the rule BERT sees 90% of the instances completely. For the remaining 10%, only incomplete instances of a rule are seen during training. The rest is put in a test set.
  • one attribute relation that defines group membership e.g., IsMember(robin, bird). We only put a fraction of 0.3 of all group membership facts into the training set.

  • one entity relation that links group members e.g., IsLinkedTo(robin, falcon). We only put a fraction of 0.0005 of all group member facts into the training corpus.

  • a set of 10 attribute relations defining common group attributes e.g. Can(robin, fly), Has(robin, feathers), ReproductionVia(robin, eggs), etc..We only put a fraction of 0.3 of all common group facts into the training set.

  • a set of 10 attribute relations defining unique attributes per entity e.g. Has(robin, redbreast), Is(robin, sedentary), OccurresIn(robin, Eurasia), etc.

Additionally, we add a set of 300 exceptions per group. Random entities are chosen from the group and facts contradicting the group attributes are generated e.g. Can(penguin, dive).

We test on the same facts and report prediction accuracy for the different types of facts.

In a third experiment we combine the frequency baseline with the scheme conformity. We use the same setup as for the scheme conformity but we repeat the exceptions 10 times whereas the facts conform with the scheme are only seen once during training.

Figure 2: Corpus generation for composition rule: The entity sets , , are linked via the relational rule. Every member of is linked via to every member of , every member of is linked via to every member of . Therefore, following the composition rule, every member of is linked via to every member of . We hold out all facts from to via in 10% of the cases for the test set.

3 BERT Model

BERT uses a deep bidirectional Transformer Vaswani et al. (2017) encoder to perform masked language modeling. During pre-training, BERT randomly masks positions and learns to fill the words. It is then fine-tuned on downstream NLP tasks. We use source code provided by Wolf et al. (2019) 111github.com/huggingface/transformers. As RoBERTa Liu et al. (2019) we perform dynamic masking and no next sequence prediction.

For symbolic rules we use BERT-base as is. Only the vocabulary file is adapted to the synthetic corpus. For the memorization experiments we need to max out BERT’s memorization capacity. Therefore, we scale it down to a singe hidden layer with 3 number of attention heads, a hidden size of 192 and an intermediate size of 768.

Rule Training Test
reflexivity + +
symmetry + +
inversion + +
composition + -
transitive + -
enhanced composition + +
Table 2: . Relational rule generalization: + indicates that the rule is generalized, - that it is not. Top: The reflexivity, symmetry and inversion rule are generalized to the test set. Composition is unlearned. Bottom: The transitive rule as a simplification of composition with is also unlearned. A semantically enhanced version of composition is learned when facts stating the group membership are added to the training corpus.
Rule Training Test
implication + +
equivalence + +
negation + -
negation simplified + +
Table 3: Logical rule generalization: + indicates that the rule is generalized, - that it is not. Top: Implication and equivalence are generalized. Negation is unlearned. Bottom: A simplified version of negation where each entity is only linked to one pair of opposites is learned.

4 Results

4.1 Symbolic Reasoning

4.1.1 Relational Rules

Results are reported in Table 2. Accuracy on the training set for relations following a relational rule and the relations randomly linking entities are 1.0.

The reflexivity

rule is a baseline which only requires one fact. Therefore the model can generalize this rule within a few training epochs to the test set.

BERT is also able to perform one-hop reasoning: The symmetry and inversion rule are fully generalized to the test set. In case of symmetry we also analyze what BERT predicts when filliping subject and object for the random facts and masking the object. We see that BERT also over-generalizes symmetry to these.

We try to prevent this by adding a set of relations explicitly going against symmetry (anti-rule), which take the form of: , ; with . Still, the preference for symmetry dominates for facts in the test set. Even some of the anti-rule facts in the training set remain unlearned. We observe that BERT has a general tendency to predict symmetrically even on datasets that are completely random.

relation rule
consistent
completions
inconsistent
completions
example
“shares borders with” symmetry 152(152) 2
EcuadorPeru
TogoGhana, GhanaNigeria
“is the opposite of” symmetry 179(170) 71
demandsupply
injusticejustice, justicetruth
“is the capital of”
“’s capital is”
inversion 59(59) 1
IndonesiaJakarta
CanadaOttawa, OttawaOntario
“larger/smaller”
(countries)
inversion 54(23) 99
Russia larger Canada Canada smaller Russia
Brazil smaller RussiaRussia smaller Brazil
“larger/smaller”
(planets)
inversion 9(9) 36
Jupiter larger Mercury Mercury smaller Jupiter
Sun bigger EarthEarth bigger Sun
Table 4: Testing pre-trained BERT’s ability to systematically represent relational rules found in natural language: We report the more consistent result from probing both BERT-large-cased and ROBERTA-large. (i) First part of table: For a given set of entities (e.g. countries) we mask the object and then swap the predicted object token with the subject. We then count all entities with consistent and inconsistent predictions. (ii) Second part of table (similar setup as Talmor et al. (2019)): Entities are ordered by an attribute, e.g. countries by size. We query ”Country A is [MASK] than Country B” and reversed ”Country B is [MASK] than Country A”. If “larger” is higher ranked than ”smaller” for one direction and the opposite holds for the reversed direction, we count this as consistent. For consistent predictions we indicate in brackets how many of them were factually correct. The last row shows examples of consistent (green) and inconsistent (red) completions.

BERT has difficulties with two-hop reasoning: The composition rule remains unlearned in the standard setup. To test if this is due to over-parametrization, we scale-down BERT to 6 layers but still composition remains unlearned.

We investigate this further in three experimental steps. i) We simplify composition to transitivity with ==. Still the rule remains unlearned. ii) We enhance our corpus with more semantics: In the standard setup we sampled three relations and entities to form a complete instance of the composition rule. In the enhanced setting we sample again three relations but instead of single entities we now sample three distinct groups of entities , and of size 10. The template slots are then filled for each group member. The full instance of the rule is now comprised of 10*10*3 facts. This is illustrated in Figure 2. For example, entity groups could be car entities, bike entities and human entities and the relation FasterThan() (==). We then produce a number of facts such that the following holds: every car FasterThan() every bike, every bike FasterThan() every human, every car FasterThan() every human. For the purpose of testing we hold out all facts involving the third relation.

But to fully learn composition, we have to add more semantics: iii) We additionally enhance grouped composition by introducing facts stating group membership. Therefore, we introduce an additional relation “MemberOf()” and add group names to the vocabulary, e.g. we would state for each group A: ”MemberOf(, A). With this semantic help the composition rule is generalized to the test set.

4.1.2 Logical Rules

Results are reported in Table 3. Accuracy on the training set for logical rules and the random facts is 1.0.

Implication and equivalence are learned. Implication shows signs of over-fitting. Here, an entity that is part of the test set has been seen during training with other relations and attributes during training than the ones in question. When over-fitting, the model falls back to the seen ones, instead of generalizing the implication for that particular relation in question.

Negation remains unlearned. By tweaking design-parameters such as total number of attributes, we see an indication of generalization. With 200 attributes an accuracy of 0.8 is achieved. In a very simplified setting where an entity is only linked to a single attribute and its negated antonym, which is very similar to equivalence, the model can generalize easily.

Why is negation more challenging than implication? Implication allows the model to generalize over several entities all following the same rule. On the other hand for a model to actually learn negation and not implication it cannot be allowed to generalize by seeing many examples of entities with the same set of attributes and their negated antonyms. Instead it must learn antonym attributes from their usage in various contexts which seems to be more difficult.

4.2 Pre-trained BERT

We test BERT pre-trained on natural language for a consistent representation of the learned relational rules: symmetry and inversion, e.g. if BERT consistently predicts ”X is the opposite of Y” and ”Y is the opposite to X” even if the fact itself might be incorrect. If BERT’s predictions are asymmetric, we assume it has not learned to generalize. However if predictions are symmetric we can only assume it has generalized if the predictions are factually false or at least many answers are plausible in both ways. This is because if the predictions are symmetric and also factually correct, the model could have just seen both ways in training.

As many relational facts are not captured by BERT or answers span over multiple tokens, testing BERT’s consistency quantitatively is not easy.

We probe BERT-large-cased and ROBERTA-large for the symmetry relations: ”X is the opposite of [MASK]” and ”X shares borders with [MASK]”; and for the inverse relations ”X is the capital of [MASK]” and ”X’s capital is [MASK]”. For countries & planets we also test the inverse relation pair larger/smaller by masking : “X is [MASK] than Y” similar to Talmor et al. (2019). Our probing setup and results are explained in Table 4.

Some probes in Table 4 show consistent predictions for the masked-object queries. But many of them are facts likely seen in training or due to more trivial co-occurrences. The larger-smaller-probes are highly inconsistent. The opposites-probe exhibits several cases that indicate lack of generalization but some where the model seems to intuitively have some notion of symmetry. To see if the model has understood antonymy, we probe with the semantically opposite relation ”X is the same as [MASK]”, but keep the antonym-pairs for the subject and object slots. This causes the number of consistent predictions to drop from 94 to 61, e.g. the model generates ”high is the same as low” and ”low is the same as high”. These results show that the model has not properly learned a symmetric representation of ”is the opposite of”

but instead might just assign a high probability to an antonym in many scenarios.

A) Frequency B) Exceptions C) Exceptions & Frequency
Figure 3: Memorization experiments: A scaled-down version of BERT is trained from scratch to memorize more facts than the parameters of the model are able to store. A) The training corpus consists of randomly generated unique facts that occur between 1 to 100 times. We report prediction accuracies per number of counts. BERT remembers the frequent facts. Rare facts are forgotten. B) The training corpus follows an synthetic semantic scheme consisting of different types of facts. We report the respective learning curves. BERT remembers facts that follow that scheme. Exceptions are forgotten. C) The training corpus combines experiments A) and B). During training, facts following the semantic scheme are seen 1 time whereas exceptions occur 10 times. We again report the respective learning curves. BERT instantly remembers the exceptions and is not able to store all facts following the scheme.

4.3 Memorization

Experimental results for the memorization experiments are shown in Figure 3.

In A) we show results for the frequency baseline. We report prediction accuracies over the number of times a fact occurred during training. We see that frequent facts are preferred over rare facts. Facts that occurred 100 times per training epoch were stored while facts occurring only once were fully forgotten.

In B) we show results for the semantic scheme conformity experiment. For each type of fact the respective learning curves are displayed. In the beginning of training BERT focuses on attributes common for individual groups as well as exceptions going against those group attributes. In the course of training BERT picks up on group membership and unique facts distinguishing entities. Prediction accuracies for the exceptions plateau when BERT’s storage capacities are maxed out. All other facts conform with the overall scheme reach an prediction accuracy of 1.0.

In C) we show results for the combination of the frequency baseline with the semantic scheme. We again report learning curves for the respective types of facts. Right after the first training epoch, BERT’s prediction accuracies for the frequent exceptions jumps high and plateaus at 1.0 early on in training. The other learning curves pick up slowly. Again the group attributes are stored earlier than the other facts conform with the scheme. Towards the end of training the preference of frequent facts even tough they contradict the overall scheme remains unchanged. This time facts conform with the overall scheme are forgotten and exceptions are stored.

5 Discussion

Symbolic Reasoning: We find that BERT is able to perform one-hop reasoning over datapoints seen during training. The relational rules symmetry, inversion, implication are generalized. Two-hope reasoning seems more difficult but with enough semantic enhancement two-hop reasoning is generalized.

The logical rules equivalence and implication are generalized. Negation is difficult to generalize. Experiments that e.g. reduce the total number of attributes, show a trend towards generalization. This is in agreement with findings by Kassner and Schutze (2020) where negation was not generalized during pre-training but fine-tuning on binary sequence classification enables BERT to learn negation.

We see a tendency of BERT to fall back to most similar datapoints seen during training. We find that BERT has a natural tendency for symmetry and we observed over-fitting in the case of implication. In that sense, language model’s objective seems problematic for generalization to unseen data.

Symbolic rules present in natural language diverge from the synthetic setting, posing a more difficult scenario. In natural language, rules are not presented in a consistent fashion, can be more complex and are interleaved with many other datapoints not following any rules. This scope of reasoning might well be beyond PLMs.

Exemplary tests of symmetry and inversion using pre-trained BERT suggest an inconsistent representation of relational rules. Talmor et al. (2019) support this finding. Indications of symbolic rules captured by pre-trained BERT are likely due to similar facts seen during training. Comparison with OlderThan() relations only worked for age spans in the range of humans.

Memorization: We see that fact frequency is the dominant factor influencing if a fact is stored or forgotten. Considering real world factual knowledge involving exceptions this seems desirable as many world concepts and categories go along with exceptions. Exceptions are not factually false e.g. a penguin can dive but not fly. BERT has the incentive to store exceptions if seen frequent enough during training. For rare facts not deducible via a semantic scheme this incentive is missing.

6 Related Work

Radford et al. (2019) show PLM’s knowledge ability to capture knowledge in a zero-shot question answering setting. Similarly, Petroni et al. (2019) investigate PLM’s ability to capture knowledge graph like facts. Our work builds on this and analyzes PLM’s ability to capture knowledge not explicitly seen during train.

Sun et al. (2019); Zhang et al. (2020) show in the knowledge graph domain that models that capture relational rules like symmetry, inversion and composition outperform the ones that do not. We investigate if BERT is also able to capture these rules and therefore generalize knowledge in a manner of link prediction in the knowledge graph setting.

Talmor et al. (2019) test PLM’s symbolic reasoning capabilities probing pre-trained and fine-tuned models with cloze-style queries. In contrast, we investigate if BERT generally is able to learn these symbolic rules via relational and logical rules seen during pre-training. Additionally, we extend their probing setup and test pre-trained BERT for consistent representation of relational rules present in natural language.

Clark et al. (2020) fine-tune BERT for symbolic reasoning. In contrast to our work, they fine-tune pre-trained BERT by explicitly stating the rule and providing all necessary information needed in one training point. We analyze BERT’s pre-training objective, ask for multi-hop rule learning and do not state the rules explicitly.

There are a range of benchmark sets for complex reasoning QA Yang et al. (2018); Sinha et al. (2019). In a transfer setting, PLMs are fine-tuned to the downstream task. PLMs are employed in a blackbox setting without clear understanding if reasoning capabilities are captured by the PLM or the task specific component.

Another line of work Gururangan et al. (2018); Kaushik and Lipton (2018); Dua et al. (2019); McCoy et al. (2019) shows that much of PLM’s performance on reasoning tasks is due to statistical artifacts in datasets, rather than exhibiting true reasoning and generalization capabilities. With the help of synthetic corpora we have full control over the training corpora.

Richardson et al. (2020) introduce a collection of synthetic corpora testing logic and monotonicity reasoning. They show that BERT performs poorly on these new datasets but can be quickly fine-tuned to good performance.

Roberts et al. (2020) show that the amount of knowledge captured by a PLM increases with model size. Our memorization experiments investigate what influences what knowledge is stored and what is forgotten.

Hupkes et al. (2020) study neural model’s ability to capture compositionality of language. We study different rules enabling knowledge acquisition in PLMs. Still, they support our finding that transformers have the ability to capture rules at the same time es exceptions. They neither study different rules nor do they consider what happens if model capacity is exhausted or fact frequency.

Guu et al. (2020) modify PLM’s objective to incentive knowledge capturing. This does not go beyond the scope of memorization.

7 Conclusion

This work is a first study towards understanding BERT’s ability to capture knowledge seen during pre-training by investigating it’s reasoning and memorization capabilities. We identified factors influencing what knowledge is stored and what is forgotten and what is learnable beyond knowledge explicitly seen during training. We saw that theoretically BERT is able to infer facts not explicitly seen during training via symbolic rules. Future work should investigate how to enable BERT during pre-training to use this capability. We see the need to incentivize PLMs to capture symbolic rules and factual knowledge as this could potentially improve PLM’s performance also on downstream tasks where reasoning capabilities or implicit knowledge leverage is needed.

References

Appendix A Hyper-parameters

For all reported results we trained with a batch-size of 1024 and a learning rate of 6e-5.

Appendix B Symbolic rules

We show datasets illustrating grouped symmetry, grouped composition with enhancement and implication. Each line is one datapoint. We also include the control group at the end of each corpus that doesn’t follow any rule. ”{…}” indicate the sampled groups which is not part of the actual dataset.

b.1 Symmetry

{Alex, Mo, Lucy, Sato, Didier}

Alex SiblingOf Mo

Mo SiblingOf Alex

Alex SiblingOf Lucy

Lucy SiblingOf Alex

Alex SiblingOf Sato

Sato SiblingOf Alex

Alex SiblingOf Didier

Didier SiblingOf Alex

Mo SiblingOf Lucy

Lucy Sibling Mo …

Sato SiblingOf Didier

Didier SiblingOf Sato

{Lea, Keith, Fatima, Gertrud, Klaus}

Lea SiblingOf Keith

Keith SiblingOf Lea

{Nick, Ayla, Olga, Maya, Sinead}

Nick ColleagueOf Ayla

Ayla ColleagueOf Nick

Spock RandomRelation Paulo

Ailnee RandomRelation Nora

b.2 Composition with semantic enhancement

{e8, e2, e8, e5}

e8 InstanceOf A

e2 InstanceOf A

e4 InstanceOf A

e11 InstanceOf A

{e15, e13, e12, e19}

e3 InstanceOf B

e17 InstanceOf B

{e25, e24, e29, e20}

e15 InstanceOf C

e13 InstanceOf C

e8 r1 e15

e8 r1 e13

e8 r1 e12

e8 r1 e19

e2 r1 e15

e2 r1 e13

e5 r1 e19

e15 r2 e25

e15 r2 e24

e15 r2 e29

e15 r2 e20

e19 r2 e20

e8 r3 e25

e8 r3 e24

e8 r3 e29

e8 r3 e20

e133 r61 e23

e56 r61 e29

b.3 Implication

{(Flu), (Cough, RunningNose, Headache, Fever)}

Kevin HasDisease Flu

Kevin HasSymptom Cough

Kevin HasSymptom RunningNose

Kevin HasSymptom Headache

Kevin HasSymptom Fever

Mariam HasDisease Flu

Mariam HasDisease Cough