Recently, pre-trained deep learning models have proven their effectiveness in a variety of tasks[radford2018improving, devlin2018bert, brown2020language]. One consequence of this is that the linguistic resources received less attention. In the digital age, the amount of data generated per second is enough to train a model capable of abstracting many different patterns. With a sufficiently large amount of data, these models are shown to be able to abstract the linguistic features on their own [liu2019linguistic]. Even so, training the model unsupervised using data on the internet can generate noisy, biased models [bender2021dangers] and make deep learning black boxes to humans. Therefore, we want to find a balanced solution between those two extremes.
Pre-trained models derive their power from the attention mechanism especially multi-head self-attention [vaswani2017attention]. This mechanism mimics human observation, i.e. only cares about important aspects of the data that are relevant to the decision that needs to be made. The model using attention mechanisms are often learned from data through unsupervised tasks. This helps to increase the performance of tasks without requiring much human intervention. Even so, automation without explanation can carry risks when it is widely adopted. Training models using only data can produce highly accurate but still fallacious models. Table 1
shows a small dataset where if the model is based on the occurrence of the word "is", it can correctly classify 100% of the sentiment labels of the samples. Linguistically, "is" is a functional word and should not be used to make a decision in this case. This is a contrived example, but the same thing can happen for a huge data set that humans are not capable of verifying[chen2021evaluating]. For that reason, before this work, there are proposals trying to force deep learning models to follow the knowledge of linguistic resources [zhang2020semantics, zhou2019head, zhang2020sg]. We observe that such coercion could reduce the flexibility of the model and overall efficiency in exceptional cases.
|This is a great product.||Positive|
|This product is great.||Positive|
|The battery of this product is very good.||Positive|
|I don’t like this restaurant.||Negative|
|The song is perfect.||Positive|
|This is another awesome product from Google.||Positive|
|I think this product should not be sold.||Negative|
|It was a terrible experience.||Negetive|
Finding a neutral solution, we investigate the possibility of injecting some linguistic knowledge for pretrained models to refer to without completely depending on them. With this approach, the model can both use linguistic knowledge for the majority of cases and observe actual cases appearing in the data to adjust its understanding. In addition, this approach guides the model to use the features that are accepted by humans, so that their decision-making is explainable by humans. The challenge of this approach is to find a balance between the information learned in the pretraining and the injected knowledge.
We propose an unprecedented approach to incorporating linguistic knowledge into transformers architecture. This approach has 3 advantages compared to other methods. First, this is a knowledge injection method that can take the edge of many different types of annotated resources in NLP. Second, the injected information is not rigid but only a reference source for the model to reach its final conclusion. Third, training and storing the appended component are extremely efficient, helping to solve environmental and financial problems. Although we only experiment with dependency structure as the knowledge source in the scope of this paper, the approach is general to all linguistic knowledge which expresses relationships between internal elements of a sentence, is friendly with transformer architecture, and is extendable. We believe this approach will create a new research direction in injecting knowledge into the transformer architecture.
Our paper includes the following main contributions as follows:
We first propose a novel framework to pretrain attention heads separately using the linguistic information without pretrain the whole model again.
Next, we introduce HYDRA, a knowledge-injected component to the transformer model, which is pretrained with dependency structure relations.
Finally, from the experimental result, we draw conclusions and future directions of this research.
In the remainder of this paper, we present related work in the attention mechanism and knowledge injection methods in Section 2, the detail of our method in Section 3. Next, we investigate the effectiveness of the proposed method in comparison with the baseline in Section 4. Finally, the conclusion and future works are presented in Section 5.
2 Related Work
2.1 Attention Mechanism
has become the architecture that drastically changes the way deep learning is designed and used. With previous models such as LSTM or CNN, all the information of the inputs would be compressed into a single representation. The human need in NLP is to solve longer and longer sentences. With long inputs, these models run into problems, their representations are not able to store all the information of the input. Instead of storing a loop of information in the architecture like the LSTM, the transformer uses the attention mechanism to perceive constraints in the inputs. Attention can be described as aggregating a query and key-value pairs to the representation. The output (attention vector) is calculated as the weighted sum of the values, where the weight assigned to each value is computed by the query with the corresponding key. In the paper, the authors recommend of a combination of multiple aspects to the final representations, so-called multi-head attention. This mechanism allows the model to view data with different aspects to make decisions.
2.2 Knowledge Injection
Knowldge injection is the method of providing deep learning model with external knowledge. In their paper, Nguyen and colleagues [nguyen2021knowledge]
use the similarity of the distribution as additional knowledge for the learning process. The knowledge was injected into the neural network (LSTM) to guide the model to make better predictions. The author also conduct the experiment to extract the knowledge from the data and use it to guide the model in another task. However, with the advent of transformers, LSTM gradually became obsolete.
Pretrained transformer-based language models such as BERT[devlin2018bert]
have demonstrated their effectiveness in many NLP tasks. However, these bulky models also reveal weaknesses through unsupervised learning of data[chen2021evaluating, bender2021dangers]. As a result, using additional knowledge is a promising way to improve the models. Recent works [zhang2020sgnet, sachan-etal-2021-syntax] discover that the performance of transformer models could be improved by incorporating either semantic information or syntactic information into the transformer models.
To encode syntactic information into a transformer model, the authors [zhang2020sgnet] consider using a syntactic representation to incorporate it into the attention mechanism so-called Syntax-guided network(SG-NET). The key component in SG-NET is the SDOI-SAN, which is a combination of syntactic dependency of interest (SDOI) and self-attention network (SAN). SG-NET used both SDOI-SAN and SAN of the original transformer model for better linguistics-inspired representation. This is a matrix with as the number of words in the input sentence. An element in this metric is assigned a value of 1 if there is a dependency link of the words in the corresponding row and column, otherwise, it is assigned to 0.
With the same idea of injecting knowledge to the model, Zhang et al. [zhang2020semantics]
proposes an architecture with combined semantic information and attention mechanism, so-called Semantics-guided neural network (SGN). The authors assign semantic labels for every token, encode them and integrate this kind of knowledge to the pretrained model using a convolutional neural network. We observe that in terms of implementation, this design is not friendly with the existing architecture of transformers and is hard to extend.
Observing the limitations of the previous methods, in this paper, we propose a novel partial pretraining method on the Transformer architecture. Attention heads are essentially abstractions of data from different perspectives formed during the pretraining process. The information in the heads interacts with each other to make the final decision. So, we want to verify the possibility of pretraining the heads seperately with linguistic information and then incorporate them into the general architecture as an ordinary component.
The overall architecture of the proposed framework is shown in Figure 1
. Based on the transformer’s architecture, we add one more transformer layer at the end of the architecture containing heads, which are pretrained with linguistic knowledge. The model can choose to use the information of these heads or ignore it with the residual connections. Therefore, we can use linguistic knowledge as a reference source for the model without forcing it to accept such knowledge rigidly. Since the idea is derived with a dependency structure, we give this architecture an inspiring name,HYDRA, which stands for Hyper Dependency Representation Attentions.
3.1 Pretraining HYDRA Heads
Let the input be with be the sequence length, we obtain an matrix containing the linguistic relationship of the words as introduced by Zhang et al. [zhang2020sgnet]. After passing the input through the transformer layers, we get hidden state at the last layer with be the number of layers of the original transformer body. We initiate and append layer containing HYDRA heads and pass to this layer. With , are learnable parameters in the layer , we calculate query and key vectors as in Equations 1 and 2:
With be the dimension of the key matrix , we calculate the attention matrix for each head following Equation 3:
We then minimize the element-wise MSE loss () between and for every HYDRA head:
where is the index of element in flatten attention matrix SDOI (). The weights of the transformer body are frozen and the weights in and
in every HYDRA heads are updated via the backpropagation process.
After pretraining, we get the HYDRA heads corresponding to the transformer body coupled to it. These HYDRA heads are saved in the storage and ready to be delivered as a form of deep learning resource. Compared to storing pretrained transformer models, storing HYDRA heads saves much more storage space. A transformer model checkpoint can be measured in gigabytes, while a HYDRA head checkpoint in our experiment is less than 10 megabytes.
3.2 Fine-tuning with HYDRA Heads
Different from approaches that use linguistic knowledge that is rigidly injected into the model, we attach pretrained HYDRA heads to the transformer bodies. These models can refer to but are not limited by the linguistic knowledge learned in the HYDRA heads.
At this phase, the weights of the transformer body and HYDRA head are both updated to match the training data. The final model now contains layers, first layers are from the original transformer body and the last layer contains HYDRA heads. We simply pass thedevlin2018bert]. We have experiments to observe the performance of the model in Section 4. Intuition in this phase is that by acquiring linguistic knowledge, the system can properly model the problem rather than try to fit into the local minima.
4.1 Experimental Settings
4.1.1 Pretraining Stage
In the pretrain phase, we collect data from Wikimedia Downloads111https://dumps.wikimedia.org/ using WikiExtractor package [Wikiextractor2015]. We use the variant of dependency parser described by Honnibal and Johnson [honnibal2015improved], provided by spaCy222https://spacy.io/. After data processing, we obtain 330,000 samples for training and 50,000 samples for validation.
We use BERT [devlin2018bert]
as the base model to pretrain the HYDRA heads. We set the maximum sequence length to 512 and only use sentences whose length does not exceed this number to pretrain the HYDRA heads. The goal of this setting is to help the model learn the linguistic structure of complete sentences of the longest possible length. With 1.8 million parameters, in our experiment, it takes only one or two epochs for the HYDRA heads to reach optimal loss on both the training set and the validation set.
4.1.2 Task Adaptation Stage
To understand the model’s behavior and the method’s effectiveness, we run experiments and compare the performance of the HYDRA variants with the corresponding non-HYDRA baselines. The experiments are run on datasets and standard settings of several famous benchmarks in natural language processing which include:
QNLI [rajpurkar2016squad], MNLI [N18-1101, bowman2015large]
, RTE[dagan2005pascal, bar2006second, giampiccolo2007third, bentivogli2009fifth] : Natural Language Inference and Texture Entailment
Metric for QNLI and RTE: Accuracy.
Metric for MNLI: Accuracy for both matched (m) and mismatched (mm) versions.
QQP [iyer2017qqp] and STS-B [cer2017semeval]: Semantically Equivalence Judgment
Metric for QQP: Accuracy.
Metric for STS-B: Pearson Spearman Correlation.
SQuAD [rajpurkar2016squad] and SQuAD 2.0 [rajpurkar2018know]: Question Answering
Metric for SQuAD and SQuAD 2.0: Exact Match and Macro-averaged F1 Score.
With these benchmarks, we can investigate in detail the effectiveness of the proposed method. We conduct experiments with both the cased and uncased versions of BERT and average the performance.
4.2 Experimental Result and Discussion
Table 2 shows the results of the models on the benchmarks and metrics described in Section 4.1.2. BERT is a very strong baseline, it achieved high performance on all benchmarks. Even so, our model still can slightly improve the results of BERT with the appended HYDRA heads. This result supports the hypothesis that linguistic knowledge pretrained in HYDRA heads can boost the performance of the vanilla model.
In addition to the performance improvement, this is a novel approach to inject linguistic knowledge into language models. One notable feature of our improvement is that this new component is lightweight, requires low pretraining computation cost and storage. With this paradigm, we can improve the bulky transformers models without pretraining the whole network again or force them to follow rigid linguistic rules.
|SQuAD 2.0 EM||0.7137||0.7161|
|SQuAD 2.0 F1||0.7458||0.7480|
This paper proposes an architecture-friendly and extensible method to improve the effectiveness of the transformer-based language models by pretraining and appending new knowledge-guided heads to their architecture. We conduct the experiment with BERT as the base model and the dependency information as the external knowledge. Our experiment shows that our lightweight component can help to boost the performance of transformer models and provide a flexible paradigm to partially inject the knowledge into bulky models without pretraining them again. Extending this work, we can analyze the possibility of pretraining this component with different knowledge forms for problems in narrower domains or explore the potential of this approach for other data such as photos or videos.