PEL-BERT: A Joint Model for Protocol Entity Linking

01/28/2020 ∙ by Shoubin Li, et al. ∙ Institute of Software, Chinese Academy of Sciences BEIJING JIAOTONG UNIVERSITY Microsoft 0

Pre-trained models such as BERT are widely used in NLP tasks and are fine-tuned to improve the performance of various NLP tasks consistently. Nevertheless, the fine-tuned BERT model trained on our protocol corpus still has a weak performance on the Entity Linking (EL) task. In this paper, we propose a model that joints a fine-tuned language model with an RFC Domain Model. Firstly, we design a Protocol Knowledge Base as the guideline for protocol EL. Secondly, we propose a novel model, PEL-BERT, to link named entities in protocols to categories in Protocol Knowledge Base. Finally, we conduct a comprehensive study on the performance of pre-trained language models on descriptive texts and abstract concepts. Experimental results demonstrate that our model achieves state-of-the-art performance in EL on our annotated dataset, outperforming all the baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Internet protocol analysis is an advanced computer networking topic that uses a packet analyzer to capture, view, and understand Internet protocols. These Internet specifications and communications protocols are documented in Request for Comment memorandum (RFCs). RFCs present informative resources for protocol analysis. Entity Linking (EL) recognizes and disambiguates named entities in RFCs and links them to a Protocol Knowledge Base (PKB) and is useful for comprehensive protocol analysis (Fig. 1). However, RFCs are informational or experimental, and their formats are not standard[7], as they are written in an informal way. Besides, RFCs are typically released by different institutions or individuals over many years. Hence, various writing styles or standards are also used, making RFC replete with abbreviations, simplifications, and obsolete expressions (Fig. 2). These characteristics make EL in RFC documents extremely difficult. Since each ontology in the PKB is associated with entities presenting huge discrepancies.

Figure 1: Overview of Entity Linking in RFCs. I. Entity Extraction. II. Context Inference. III. Entity Linking.

Pre-trained language models[1, 2, 3] have become a robust way to deal with Entity Linking (EL). They capture rich language information from text by unifying pre-trained language representations and downstream tasks, thus improve accuracy in many NLP applications. Among these models, BERT[1]

has been the most prominent one in recent NLP studies. Through the self-attention mechanism, BERT manages to encode bidirectional contextual information on character-level, word-level, and sentence-level, which reduces the discrepancies among single words. Fine-tuning on BERT also demonstrate optimal results in various downstream tasks, including Named Entity Recognition

[4], Text Classification[5] and EL[6]. However, the initial BERT has done its pre-training on generic datasets, such as Wikipedia and Book Corpus, for general-purpose. It does not have preferences towards specific domains. In terms of our research, it lacks protocol-specific knowledge in RFCs. Our experiments have shown that standalone fine-tuned BERT is not adequate to perform highly accurate EL in RFCs.

Figure 2: Examples of Various Writing Styles in RFCs. Data frames are extracted from RFC3451 and RFC791. Header field ”Verion” is written as Version in RFC791 whereas abbreviated V is used in RFC3451. Header field ”Header Length” is written as IHL in RFC791 whereas HDR_LEN is used in RFC3451. Header field ”Flag” is written as Flag in RFC791 whereas every flag bit is displayed in RFC3451.

In this paper, we propose a novel model PEL-BERT to tackle EL in RFCs. The key idea is combining a fine-tuned model with an RFC Domain Model. Experimental results have demonstrated our model achieves an accuracy 72.9% on our annotated RFC dataset, outperforming all the baselines. Briefly speaking, our contributions include:

  • We design a Protocol Knowledge Base (PKB) as the guideline for protocol Entity Linking.

  • We propose PEL-BERT, which joints a fine-tuned language model with an RFC domain model. It achieves the highest result in EL on our annotated dataset, outperforming all the baselines.

  • We give a comprehensive analysis on the performance of pre-trained language models on descriptive texts and abstract concepts.

In the upcoming sections, we first present an overview of related researches, followed by a detailed description of the experiment process and elaboration of evaluation methods. In the end, we give some concluding remarks.

2 Related Works

2.1 Entity Linking

Entity linking is the task to link entity mentions in text with their corresponding ontologies. Most of EL aims to link mentions to a comprehensive Knowledge Base (KB)[8]

. Recent approaches have used neural networks

[9, 10] to capture the correspondence between a mention’s context and a proposed entity in the KB. Graph-based[11, 12] and various joint methods[13, 14] are also widely used. Instead of linking to KB, there are also approaches to perform EL to ad-hoc entity lists[15]. In terms of our experiment, we focus on the former one, namely, EL on KB. We aim to bridge the gap between mentions in real word text and entities in well established theoretical schemas. A similar study that links mentions from RFC documents to a list of ontologies is conducted by Jero[19] to generate grammar-based fuzzing. Given limited training data, they generalized this problem by assigning the property with the maximum key phrase overlap to a header field.

2.2 Fine-tuning BERT for Classification

BERT demonstrates high accuracy in classification tasks and has been widely applied to many domain-specific fields. Lee[16] proposed a way of fine-tuning BERT for patent classification, outperforming DeepPatent[17]. Adhikari[18] applied BERT to four accessible datasets to perform document classification, improving the baselines in classification. However, there haven’t been many studies in classification in protocols using BERT.

2.3 Learning with Scarce Annotations

In this paper, we also consider the problem of data-scarcity. Since we manually annotate our dataset, resolving data scarcity allows us to train our model on a relatively small dataset, which largely reduces human efforts. Transfer learning (TL) had been applied widely in researches

[21, 22]

, which uses classifiers trained in large datasets similar to but not the same as target datasets to perform new tasks. Active Learning (AL) is another way of dealing with data scarcity

[23, 24]. It selects queries or sub-spans that are most informative and reinforces its learning result by iterating. Bootstrapping, where classifiers use their own predictions to teach themselves, is widely used for Entity Set Expansion (ESE)[25, 26]. It provides enriching datasets by acquiring new samples each iteration. We utilize AL and Bootstrapping in our experiment.

3 Approach

In this section, we elaborate on the implementation details of our PEL-BERT model. PEL-BERT consists of four modules: Embedding, Fine-tuned BERT Model, RFC Domain Model, and Fusion. The overall architecture of the model is shown in Fig. 3. In the following subsections, we describe each of these four modules in detail.

Figure 3: PEL-BERT Model Architecture

I. Embedding: The input for each experiment is header descriptions concatenated with header fields. Header Field is parsed through graphs of headers (Fig. 1) that are highly unanimous across RFCs. Description is the text chunk that references to its corresponding header field. We infer Description form the nearby contexts of its header field using Zero-Shot Learning (ZSL)[20] similar to Jero[19].

We apply the word list and embedding mechanism proposed in BERT to convert descriptions and header fields into word embeddings. Special tokens [CLS] and [SEP] are concatenated to each description, indicating the start and end of every input sequence, respectively. While header fields are fed directly into the model, descriptions, and header fields are tokenized and then input into the embedding layer where tokens are converted into word embeddings. In the embedding layer, we use the embedding mechanism proposed in BERT. Therefore, each word embedding is constructed by joining the corresponding token, segment, and position embeddings. Since the output of the embedding layer is delivered to two different models, the position embeddings for header fields and descriptions are not consecutive; instead, they all start from zero.

II. Fine-tuned BERT: We use BERT as part of our PEL-BERT. The word embeddings for descriptions, denoted as E, E, E, …, E, E are fed into BERT. The output, denoted as T, T, T, …, T, T, represents the contextual information for descriptions. Let info, info and HS be the descriptions, their embeddings and their hidden states, respectively, this process can be formalized into following formulas:


The hidden state, denoted as T, of the special token [CLS] is considered as the aggregated sequence representation in the output of the BERT layer, which then sends to the Adding Operation Unit. The following formula is used to compute T:


III. RFC Domain Model: The word embeddings for header fields denoted as E, E, …, E, are fed into a non-linear layer. We consider BPNN, CNN, Bi-GRU as possible non-linear layers. This non-linear layer converts word embeddings of header fields into hidden states used as the input for the Linear Aggregation Layer. Since BERT is a pre-training model based on large datasets of general knowledge, the semantic information of header fields is minimal compared to the large number of words presented in it. This would reduce the effect of header fields because other trivial words would overshadow them. Through the non-linear layer, we assure that the semantic information of header fields is examined separately, thus preserve the valuable information of header fields. The output for the non-linear layer, denoted as T, T , …, T, represents header fields. Let field, field and HS be the header fields, their embeddings, and their hidden states, respectively, this process can be formalized into following formulas:


To further incorporate semantic information for header fields, we design a Linear Aggregation Layer to concatenate all the hidden states, the intermediate result from the non-linear layer, to fully explore the augmentation of heuristics inferred from header fields. The Linear Aggregation Layer provides the input for the following Adding Operation Unit. Let T

be the final representation for header fields. This process can be formalized into the following formulas:


IV. Fusion Layer: The fusion phrase consists of an Adding Operation Unit and a softmax layer. The Adding Operation Unit transforms T

using an activation function

. This transformation constructs a new representation for header fields that enables element-wise concatenation with T. In this unit, we also leverage T against T, so that the result is still dominated by BERT, but also integrates the heuristics of header fields. Through this, the header fields are involved in the fine-tuning process of BERT. We regard this as heuristics between two different models. In our specific experiment settings, ReLU

is used as the activation function. During backpropagation, the parameters in BERT and the non-linear layer are mutually independent, which enable BERT to preserve most of its innate characteristics and maintain its high performance. The Adding Operation Unit is applied to combine T

and T

to produce a vector representation O

that is finally ready for the classification task. This process can be formalized into the following formulas:


V. Classification: The output of the fusion layer O is processed by a softmax layer, acquiring semantic labels, denoted as pred

, for current inputs. We use Average Cross Entropy as our loss function. This process can be formalized into the following formulas:


4 Experiment

4.1 Experimental Setup

Dataset: In our experiment, each training sample consists of a header field, a description that describes this header field, and the Knowledge Base entity this header field belongs to, denoted as a triple: {Header Field, Description, Knowledge Base Entity} (Fig. 1). We infer Header Fields and Description from RFCs and manually craft a set of 12 Knowledge Base Entities based on prior knowledge on computer network and protocols.

We sample 71 RFC documents that contain header formats descriptions in their catalogs. We collect a total number of 507 samples and split them into training and test set. After eliminating types that contain too few samples, we mainly consider the following feature set: label, length, content, boolean, address, enumeration set, version number, reserved field, and checksum. The distribution of samples of each feature is shown in Table 1.

Category Size
Identifier-label 112
Length 70
Data 64
Boolean 62
Identifier-address 53
Enum 45
Version Number 36
Reserved 34
Checksum 31
Table 1: Summary of the Dataset

Training: All of our models (PEL-BERT-a, PEL-BERT-b, PEL-BERT-c) have 12 transformer blocks, 768 hidden units, and 12 self-attention heads. For PEL-BERT, we first initialize it using BERT

, then fine-tune the model for six epochs with the learning rate of 2e

. During training and testing, the maximum text length is set to 10 tokens[28]. This limit is chosen because header fields often consist of short phrases in RFCs.

4.2 Baseline

In this subsection, we introduce six baselines against which we evaluate our method. We compare the result from PEL-BERT with these six baselines. We tune the hyperparameters for baselines for fair comparison, and all the baselines take the word embedding for header descriptions as their input.


We use stochastic gradient descent to train the SVM model. The constraints are adjusted using L2 Regularization and margin is set to 1.0. Other parameters are initialized with randomly assigned values.

BPNN[30]: All parameters are initialized with randomly assigned values. The dropout rate is set to 0.1 to avoid over-fitting. During training, we adopt adaptive gradient descent strategy.


We use 3 kernels during convolution. Kernel size is set to 3 * 768. The size of kernels in the max-pooling phrase is 2 * 2. Batch size is 1. Dropout rate is set to 0.1. The output is sent into a linear layer and a

softmax layer to make predictions.

Bi-GRU[32]: Bi-GRU acts similarly to the memory cell in the LSTM network. All parameters are initialized with randomly assigned values. We concatenate the output and use the same linear and softmax lays as CNN to post-process the output.

Adhikari et al.[27]: The model takes word embedding as its input. We concatenate the outputs into a linear layer and a Softmax. The size of kernels in the max-pooling phrase is 2 * 2. The dropout rate is set to 0.1.

DocBERT[18]: A sentence classifier based on BERT. The output of BERT is feed into a linear layer and a softmax layer.

To validate the impact header fields have on the experiment, all the baselines do not consider header information. Also, for SVM, BPNN, CNN, Bi-GRU, we use 8000 iterations to approximate the number of single epoch in BERT. Besides, we set the learning rate for all the baselines except DocBERT to 2e for faster convergence.

5 Evaluation and Analysis

We report Acc, Avg, Avg, Avg for all categories in the schema using 10-fold cross-validation. Acc indicates the accuracy upon the training set. Avg, Avg, Avg indicate the average precision, recall, and F-measure across all the categories in PKB, respectively. For category a, if our model makes the right decision, we do , otherwise do and . Let C be the total number of categories, N be the total number of samples. We compute the above four criteria using the following formulas:


5.1 Evaluation Between Different Implementations of PEL-BERT

Regarding possible implementations of PEL-BERT, we evaluate PEL-BERT based on three non-linear layers, namely, BPNN, CNN, and Bi-GRU. The statistics are shown in Table 2.

Exp. Group
Model Acc Avg Avg Avg
Learning Rate
72.4% 73.9% 74.3% 74.1% 2e
49.6% 51.3% 53.0% 52.1% 2e
72.9% 73.7% 74.7% 74.2% 2e
Table 2: Detailed results of Acc, Avg, Avg, Avg are shown. Best results are highlighted in bold font. Training is done on our manully annotated RFC dataset.

As the result shows, PEL-BERT based on Bi-GRU achieves the best result on our dataset, reaching the highest accuracy of 72.9% and highest Avg of 74.2%. Since CNN is insufficient to capture contextual information, its result is inferior to BPNN and Bi-GRU, which are better in utilizing the location embeddings in the word vector. Bi-GRU is derived from RNN, which is suitable to deal with segmented information. Therefore, we choose Bi-GRU as the non-linear layer for RFC Domain Model in PEL-BERT.

5.2 Ablation Study: Evaluation on performance of the Joint Model

This set of experiments is designed to evaluate the impact header fields have on classification, namely, whether the RFC Domain Model contributes to increasing Avg. The statistics are shown in Table 3.

Exp. Group
Model Acc Avg Avg Avg
Learning Rate
BERT Fine-tuned BERT 69.8% 72.7% 72.2% 72.4% 2e
RFC Domain
BPNN + linearAGGR 47.3% 50.2% 48.5% 49.4% 2e
CNN + linearAGGR 35.5% 33.5% 34.9% 34.2% 2e
Bi-GRU + linearAGGR 60.8% 64.1% 64.1% 64.1% 2e
72.9% 73.7% 74.7% 74.2% 2e

Inputs for RFC Domain Models are word embeddings for header descriptions.

Table 3: Detailed results of Acc, Avg, Avg, Avg are shown. Best results are highlighted in bold font. Training is done on our manully annotated RFC dataset.

The accuracy of fine-tuned BERT is 69.8%. Individual RFC Domain Model also achieves inferior results. Whereas when RFC Domain Mode is aggregated with BERT, the accuracy reaches 72.9%, which is higher than merely using BERT or RFC Domain Model. We can infer that the fusion of the domain model and fine-tuned language model can improve the performance by injecting domain-specific knowledge. Therefore, we choose PEL-BERT-c (Bi-GRU) as our final model.

5.3 Comparison with Baselines

From the statistics shown in Table 4, our approach, the joint model of BERT and RFC Domain Model, achieves the best accuracy of 72.9%.

Exp. Group
Model Acc Avg Avg Avg
Learning Rate
Baseline SVM 10.8% 10.4% 10.8% 10.6% 2e
BPNN 55.8% 47.8% 48.7% 48.2% 2e
CNN 48.0% 44.5% 45.2% 44.8% 2e
Bi-GRU 53.6% 44.9% 41.3% 43.0% 2e
Adhikari[27] 57.6% 48.3% 48.3% 48.3% 2e
DocBERT[18] 70.6% 71.8% 71.1% 71.4% 2e
PLE-BERT 72.9% 73.7% 74.7% 74.2% 2e
Table 4: Detailed results of Acc, Avg, Avg, Avg are shown. Best results are highlighted in bold font. Training is done on our manully annotated RFC dataset.

5.4 Analysis

We can draw an analogy between language models pre-trained on generic corpora and human beings. These models achieve decent results for descriptive texts because they manage to encode contextual information. However, our experimental results illustrate that they fail to comprehend abstract concepts like header field. For example, the header field “IHL” (Fig. 1) is an abstract concept in the protocol domain. Domain-specific information cannot be inferred from its lexical presentation. Thus we are unable to link it to specific categories KB. In our approach, we not only use domain knowledge to fine-tune BERT but also design a domain model to learn domain-specific knowledge from these header fields explicitly. Finally, we combine these two models as our PEL-BERT model. Compared with baselines, our model achieves the best result.

6 Discussion

Use RFC Domain Model to Deal with Header Fields: Given the fact that BERT is apt for texts with sequential relations. By merely appending header fields to the descriptions, we attached false information which will mislead BERT to regard the header fields and descriptions appearing in the adjacent context, while they are not. This hinders BERT’s performance. Therefore, concatenating header fields directly to descriptions is not appropriate. As a result, we do not solely apply BERT to our EL task, with header fields concatenated with descriptions as its input.

Use Neural Network to Handle Header Fields: We consider using neural network, such as Bi-GRU and BPNN, as the domain model to deal with header fields independently, rather than Pre-training Model because our RFC dataset is minimal compared to the large corpora BERT is pre-trained on. The heuristic knowledge BERT acquired from generic corpora largely compromises the valuable domain-specific information contained in our dataset. By training a model explicitly for header fields, we manage to exploit the domain knowledge contained in them fully.

7 Conclusion and Future Work

In this paper, we propose PEL-BERT to better fuse domain-specific knowledge into general purposed language models. We use PEL-BERT to link RFC entities to Protocol Knowledge Base. We give an comprehensive analysis on the performance of pre-trained language models on descriptive texts and abstract concepts. The experimental results demonstrate that PEL-BERT has better abilities in Entity Linking than all the baselines. There are two points we want to address in future studies: (1) optimize and extend out dataset to cover a border range of RFCs; (2) evaluate our model on other domain-specific datasets other than protocols to prove its universality. Resolving these problems will lead to a more comprehensive language understanding.