In recent years, the increasing emphasis on privacy, and the requirement to comply with regulations have led to the development of data protection systems. These systems have relied on pattern matching and other approaches to detect Personally Identifiable Information (PII), Sensitive Personal Information (SPI), and Protected Health Information (PHI) as per their requirements.
While these have served well so far, the introduction of Global Data Protection Regulation (GDPR) in the European Union (EU) countries has significantly expanded the scope of data protection systems.
Data protection systems involve the extraction of personal data entities (entity recognition), their classification (entity classification) and protection (e.g. encryption, de-identification) based on the sensitiveness of the data. At each of these tasks, data protection systems require very high recall, and reasonable precision. This is because false negatives could lead to loss of private data, while a slight loss in precision because of false positives might be acceptable. Pattern matching and dictionary based systems tend to have higher precision, but need to be continuously updated to achieve good recall. Further, each new type needs to be associated with its own rules and dictionaries, which end up being expensive in terms of money and time and human labour.
Personal Data Entity
We can define Personal Data Entity (PDE) as any information about a person. Such information can be present in both the public domain as well as in personal data.
Roby was born in Montgomery, Alabama and attended New York University, where she received a bachelor of music degree.
The above sentence is from the publicly available Wikipedia page of an elected official. This sentence by itself cannot be considered as personal data. But it contains Personal Data Entities (PDEs), i.e. entities which are mentioned in a personal context. A news article may also contain such mentions about an elected official.
On the other hand, data in the private domain like emails, chat conversations, medical patient notes, transcripts of voice conversations, employee records can all be considered personal data. For the purpose of our discussion, a mention of a popular person (e.g. actor Matt Damon) in a private conversation should still be considered a PDE mention.
Montgomery - LOCATION Alabama - LOCATION New York University - ORGANIZATION
In the above examples, LOCATION and ORGANIZATION are the labels assigned by the Stanford Named Entity Recognizer (NER). These labels can be considered as coarse types of these entities. Conditional Random Fields (CRF) can be trained to assign a limited number of such labels. However, NER systems can also provide fine types like below.
ROBY - PERSON Montgomery - CITY Alabama - STATE_OR_PROVINCE
These fine types are typically obtained by pattern matching with regular expressions, looking up dictionaries of people names and geographical data, and rule based systems.
The examples shown above happen to be PDEs. However some of the other labels from NERs like NUMBER, ORDINAL, and PERCENT cannot be considered PDE without knowing the context. In fact, even instances of coarse grained labels such as ORGANIZATION cannot be considered to be personal without observing the context in which the entity was mentioned.
In recent years, a number of Neural Fine Grained Entity Classification (NFGEC) models have been proposed, which assign fine grained labels to entities based on context. For example, New York University could be typed as /org/education.
However the focus of such systems has not been on PDEs. They do not treat the problem of identifying PDEs any different from other entities. For the purpose of GDPR and other regulations, it might be desirable to assign the label /bio/education/alma_mater to New York University and /bio/education/edu_degree to bachelor of music. In contrast to fine grained entity typing systems, standard coarse grained NER systems would have assigned the label /title to bachelor of music, and /organization to New York University.
In this work, we only discuss classifying of PDEs in unstructured data. Personal data entities also occur in structured data, as well as multi-modal data which are beyond the scope of this work. We also do not discuss genome and related biometric data, and leave them for future work. Towards that goal, we can summarize our contributions in this work as follows:
We propose a set of 134 Personal Data Entity Types (PDET), which are fine-grained entity types related to personal data
We introduce 2 new datasets annotated with fine-grained PDETs, which can be used to evaluate PDE typing systems
We propose an approach to improve state of the art models for fine-grained entity classification, by using existing NER systems (hereafter called as Personal Data Annotators) as side information
The rest of the paper is organized as follows. We discuss related work, then describe the personal data entity types (PDET) we have created, and explain how annotated two datasets with personal data entities (PDEs). We then discuss our improvements to a state of the art neural model [Shimaoka et al.2017] by adding the output of Personal Data Annotators as additional contextual features. Later we briefly explain a PDE Classification pipeline which includes the personal data annotators, the neural model and a post processing step. We then point to future work and conclude with a summary of our findings.
Entity classification is a well known research problem in Natural Language Processing (NLP).[Ling and Weld2012] proposed the FIGER system for fine grained entity recognition. In recent years, [Yogatama, Gillick, and Lazic2015], [Shimaoka et al.2017], [Choi et al.2018] have proposed different neural models for context dependent fine grained entity classification. [Abhishek, Anand, and Awekar2017] [Xu and Barbosa2018]
proposed improvements to such models using better loss functions.
[Dernoncourt et al.2017] proposed a RNN model for the de-identification of Protection Health Information. This model has very high F1 on the de-identification task. However, the number of PDE types that can be classified using this approach is limited, as structured prediction and sequence labelling models based on RNNs and CRFs have difficulty scaling up to a large number of classes.
[Ling and Weld2012] introduced the Wiki dataset that consists of 1.5M sentences sampled from Wikipedia articles. OntoNotes dataset by [Weischedel et al.2013] consists of 13,109 news documents where 77 test documents are manually annotated [Gillick et al.2014]. BBN dataset by [Weischedel and Brunstein2005] consists of 2,311 Wall Street Journal articles which are manually annotated using 93 types. [Murty et al.2017] have proposed a much larger label set based on Freebase.
Data Loss Prevention
Entity Classification on Personal Data is much sought after in Big Data and Cloud services. Data Loss Prevention (DLP) systems have used rule based / pattern matching methods to identify personal data. [Wootton et al.2011] describe a rule based approach to categorize data in the cloud for DLP. Amazon Macie111https://aws.amazon.com/macie/, Google DLP222https://cloud.google.com/dlp/, IBM Security Guardium333https://www.ibm.com/in-en/security/data-security/guardium, and Microsoft Azure Information Protection444https://azure.microsoft.com/en-in/services/information-protection/ system are some of the examples of DLP systems.
Personal Data Entity Types (PDET)
[Ling and Weld2012] proposed the FIGER entity type hierarchy with 112 types. [Gillick et al.2014] proposed the Google Fine Type (GFT) hierarchy and annotated 12,017 entity mentions with a total of 89 types from their label set. These two hierarchies are general purpose labels covering a wide variety of domains. Considering the requirements of GDPR compliance, we propose a larger set of Personal Data Entity Types with 134 entity types as shown in Figure 1.
In order to come up with this hierarchy, we started with the taxonomies proposed by various organizations for GDPR compliance. However such taxonomies include multi-modal data and have substantially more labels than FIGER and GFT. Training a neural model with very large number of class labels may not provide optimal results. Further, obtaining training data for each of the labels was also a concern. Hence we have incorporated a subset of these GDPR taxonomies in our hierarchy.
On the other hand, entity recognition and de-identification models typically have limited number of entity types. We have incorporated all the PHI entity types except biometric entity types. We then considered several NERs, rule based systems and pattern matching systems, and incorporated PDEs recognized by them. We discuss these systems in more detail in the next section. Finally, we included the labels in FIGER and GFT that are relevant to PDEs.
We created the Personal Data Entity Types hierarchy as a stand alone exercise, before considering datasets where entity mentions could be found for training models on this label set. This approach is similar to designing an ontology for a domain, although that is beyond the scope of this work.
Any system that assigns a label to a span of text can be called an annotator. In our case, these annotators assign an entity type to every entity mention. The annotators we have chosen are Stanford Open NLP, and two enterprise (rule/pattern based) annotation systems, IBM BigInsights NER555https://www.ibm.com/support/knowledgecenter/en/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.text.doc/doc/ana_txtan_extractor-libraries.html and IBM InfoSphere Information Server666https://www.ibm.com/analytics/information-server.
Stanford Open NLP provides 23 labels, BigInsights provides 18 labels and InfoSphere provides 164 labels. However InfoSphere annotators were written for annotating structured data and need to be provided the spans for entity mentions when dealing with unstructured data.
We use these personal data annotators in 3 ways:
To annotate the two datasets that we are introducing.
To generate the coarse entity types that are used as additional contextual features to our neural model.
As part of the Personal Data Classification pipeline, where for some of the classes, the output of these PDAs are directly used as entity types. These are types like email address, zip codes, number where rule-based systems provide coarse labels at high precision.
While neural networks have recently improved the performance of entity classification on general entity mentions, pattern matching and dictionary based systems continue to be used for identifying personal data entities in the industry.
We believe our proposed approach, consisting of modifications to state-of-the-art neural networks, will work on personal datasets for two reasons. [Yogatama, Gillick, and Lazic2015] showed that hand-crafted features help, and [Shimaoka et al.2017] have shown that performance varies based on training data domain. We have incorporated these observations into our model, by using coarse types from rule-based annotators as side information.
None of the existing fine-grained entity typing datasets have an emphasis on PDEs. As such, in order to evaluate the performance of our proposed approach and the PDE Classification pipeline, we create two new datasets with a focus on fine-grained PDETs. We plan to make these resources available to the community. In this section, we describe our method to create and annotate these datasets.
|Elected Reps||Enron Emails|
|Unique entity mentions||45686||24771|
|Unique entity types||91||47|
As discussed in the introduction section, Personal Data Entities can occur in both publicly available data like Wikipedia pages, as well as in personal data like email conversations. Hence we have created a dataset each from public and personal data, both annotated with PDEs.
Elected Representatives Dataset
We have created this dataset from the Wikipedia page of US House of Representatives and the Members of the European Parliament. We obtained the names of 1196 elected representatives from the listings of these legislatures. These listings provide the names of the elected representatives and other details like contact information. However this semi-structured data by itself cannot be used for training a neural model on unstructured data.
Hence, we first obtained the Wikipedia pages of elected representatives. We then used Stanford OpenNLP to split the text into sentences and tokenize the sentences. We ran the Personal Data Annotators on these sentences, providing the bulk of the annotations that are reported in Table 1.
We then manually annotated about 300 entity mentions which require fine grained types like /profession. The semi-structured data obtained from the legislatures had name, date of birth, and other entity mentions. We needed a method to find these entity mentions in the wikipedia text, and assign their column names or manual label as PDEs.
We used the method described in [Chiticariu et al.2010] to identify the span of the above entity mentions in wikipedia pages. This method requires creation of dictionaries each named after the entity type, and populated with entity mentions. This approach does not take the context of the entity mentions while assigning labels and hence the data is somewhat noisy. However, labels for name, email address, location, website do not suffer much from the lack of context and hence we went ahead and annotated them.
Enron Emails Dataset
The Enron Corpus777https://www.cs.cmu.edu/ ./enron/ is a database of emails from employees of the Enron Corporation, which was made public for research purposes. We converted 917 Enron emails from the dataset into an appropriate format. We treated the text of the email similar to Wikipedia pages above and annotated PDEs on them. We again used other fields like sender, receiver, timestamp etc on the text of the email to further expand the size of the annotated dataset.
Neural Fine Grained Entity Classification
Similar to [Shimaoka et al.2017] , [Abhishek, Anand, and Awekar2017] , [Choi et al.2018] , [Murty et al.2017] , [Xu and Barbosa2018] , [Xin et al.2018], we pose fine-grained entity classification as a multi-class, multi-label classification problem, i.e. each sample can belong to multiple labels, which can themselves be multi-class. As the backbone of our architecture, we use the neural network models from [Shimaoka et al.2017]
, which consists of an encoder for the left and right contexts of the entity mention, another encoder for the entity mention itself, and a logistic regression classifier working on the features from the aforementioned encoders. An illustration of the model is shown in Figure2.
The major contribution of [Yogatama, Gillick, and Lazic2015] was showing the relevance of hand-crafted features for entity classification. [Shimaoka et al.2017] further showed that entity classification performance varies significantly based on the input dataset (more than usually expected in other NLP tasks).
The major drawback of the features used in [Shimaoka et al.2017] was the use of custom hand crafted features, tailored for the specific task, which makes generalization and transferability to other datasets and similar tasks difficult. Building on these ideas, we have attempted to augment neural network based models with low level linguistic features which are obtained cheaply to push overall performance. Below, we elaborate on some of the architectural tweaks we attempt on the base model.
Given an entity mention in a sentence, we can rewrite the input as , where is the windows size for the left and right context, and
is the number of words in the entity mention itself. It can be noted that due to the position of the entity mention, the left or right context can end up being empty, in which case it is replaced with padding. Given this input, the classifier has to predict the labels
. We do this by computing a probabilityfor each possible label . At inference time, the label with the highest probability, as well as all other labels with are predicted.
Similar to [Shimaoka et al.2017], we use two separate encoders for the entity mention and the left and right contexts. For the entity mention, we resort to using the average of the word embeddings for each word. For the left and right contexts, we employ the three different encoders mentioned in [Shimaoka et al.2017], viz.
The averaging encoder, which like the mention encoder, and uses the average as the context representation
The RNN encoder, which runs an RNN over the context and takes the final state as the representation of the context
The attentive encoder, which runs a bidirectional RNN over the context, and employs self-attention to obtain scores for each word, which are in turn used to get a weighted sum of the states to use as the representation.
Details of the different encoders can be found in [Shimaoka et al.2017], and we omit them here for brevity. The features from the mention encoder, and the left and right context encoders are concatenated, and passed to a logistic regression classifier. If we consider to be the representation of the left context, to be the representation of the right context, and to be the representation of the entity mention, each being dimensional then these features are concatenated to form , which is passed to the logistic regression classifier, which in turn computes the function:
where is the set of weights that project the features from a dimensional feature space to a dimensional output, where is the number of labels, and
. Since the output is a binary vector, we employ a binary cross entropy loss during training. Given the predictionsand the ground truth for a sample, the loss is defined as:
We employ stochastic mini-batch gradient descent to optimize the above loss function, and the details are specified later in the experimental results section.
The input to our model is a sequence of words, represented by their corresponding embeddings by a look up table. Traditionally, pre-trained word embeddings such as GloVe [Pennington, Socher, and Manning2014] are used. Earlier work such as [Shimaoka et al.2017] have kept the word embeddings frozen during training, but we update them, to account for words that might be present in our datasets but not in the GloVe vocabulary. Our main contribution comes in the form of augmented embeddings, wherein we concatenate embeddings for token level features to the word embedding. Each word can also be represented in a plethora of ways, such as using POS tags, dependency parse tags, NER tags, etc. We peruse a few of these cheaply available annotations, project them to a low dimensional embedding space, and concatenate the said embeddings to the word embedding. For a word , whose word embedding is denoted by , with features , whose embeddings are denoted by , the final embedding is given by . A pipeline of how to construct the embeddings is shown in Figure 3.
|Dataset||# Test samples||# Labels|
For our experiments, we leverage the widely used OntoNotes dataset [Gillick et al.2014], as well as the Elected Representatives and Enron Emails datasets that we curated ourselves. Table 2 contains the details of the datasets, including train/test splits sizes, as well as number of fine-grained entities in each dataset.
|Encoder||Setting||Accuracy||Macro F1||Micro F1||Gmean|
Performance of adding features embeddings to word embeddings for OntoNotes dataset. GMean denotes the geometric mean of accuracy, macro F1 and micro F1 scores. PDA Features refers to POS tags, NERs and annotations from rule-based annotators. Paper refers to the original numbers as reported in[Shimaoka et al.2017].
We used a standard set of hyperparameters for most of our experiments. Optimal values of learning rate and batch size were obtained by evaluating model performance on held out validation splits. In a departure from previous methods such as[Shimaoka et al.2017], [Abhishek, Anand, and Awekar2017], which use large mini-batches of samples, we use a smaller batch size of 512 samples, after trying out batch sizes of . We also use an appropriate learning rate of , in conjunction with the Adam optimizer [Kingma and Ba2014]. Following [Shimaoka et al.2017], we use a dropout [Srivastava et al.2014] of as regularizer on the encoders. The context window length is also set to , and padding is used if the context is smaller. -dimensional GloVe vectors were used, and words not found in the GloVe vocabulary were initialized randomly and learnt during training. For the RNN and attentive encoders, the LSTM hidden size was set to . Feature embeddings were set to dimensions. POS tags and NER features were obtained using Stanford Core NLP, while Type Tags were obtained using the rule based annotation system InfoSphere mentioned earlier. To evaluate performance, we follow [Ling and Weld2012] and use accuracy or strict F1 score, macro averaged F1 score, and micro averaged F1 score. To compare across different runs, we use the geometric mean of the 3 different metrics.
Influence of Token Level Features
Table 3 shows how our proposed architectural changes at the embedding level improve performance across all metrics when compared to the base model with plain word embeddings from [Shimaoka et al.2017]. The first row, titled for each encoder, denotes the original results as reported by [Shimaoka et al.2017]. Results from our re-implementation, which updates word embeddings during training and uses a smaller batch size of , are highlighted in the rows titled . The final row in each encoder section, titled shows the effect of concatenating token level features using Personal Data Annotators. As is evident, these features always improve the performance irrespective of the type of encoder.
Table 4 showcases the performance of concatenating feature embeddings to the pre-trained word embeddings. Since we have 3 different types of features, viz. POS/NER/TYP, we perform a complete ablation analysis of the influence of each feature. We only display results from the averaging encoder for brevity, although similar trends were observed across all encoders. In the table, Pos refers to POS tags, Ner refers to coarse named entities, while Typ refers to annotations from rule-based annotators. The first row, with features, is the baseline, while the remaining rows highlight the efficacy of adding POS tags, NERs and Type tags to the pre-trained word embeddings. As can be seen, NER and Type tags have the highest influence on fine-grained entity classification. These results support our hypothesis that token level features, specially coarse grained NERs and Type tags from rule based systems, aid fine grained typing of entity mentions with context.
Performance on PDE datasets
|Dataset||Encoder||Setting||Accuracy||Macro F1||Micro F1||Gmean|
The results on Elected Representatives and Enron Emails dataset, which can be seen in table 5, clearly show the same trend, i.e. adding token level features improve performance across the board, for all metrics, as well as for any choice of encoder. The important thing to note is that these token level features can be obtained cheaply, using off-the-shelf NLP tools to deliver linguistic features such as POS tags, or using existing rule based systems to deliver task or domain specific type tags. This is in contrast to previous work such as [Ling and Weld2012], [Yogatama, Gillick, and Lazic2015] and others, who resort to carefully hand crafted features.
Class Wise Performance
|Label||Baseline F1||Features F1|
In table 6, we show the class-wise F1 scores for some select classes in the OntoNotes dataset. As can be seen, performance clearly improves with the addition of token-level Personal Data Annotators features. Similar trends can be observed for labels in the other PDE datasets as well. Note that the classes highlighted are all fine-grained classes, which highlights the efficacy of the proposed PDA features for the task of fine-grained personal data entity classification.
PDE Classification Pipeline
We have implemented a pipeline for Personal Data Entity Classification as shown in Figure 4. This pipeline consists of existing personal data annotators, the neural fine grained entity classification model described in the previous section, and a rule-based post processing step to combine the output of rule-based annotators and the neural model.
The input to our pipeline are text sentences. We use existing entity recognizers to find mentions. The output is a list of fine grained entity types for each of the mentions. We have a rule based system to post process the results from both the Personal Data Annotators and the neural model.
In PDE classification, there are still a number of open problems. We mention some of them here. Using co-reference resolution or other approaches to determine, for example who is the doctor and the patient in a medical patient note could be a useful addition to this work. A downstream anonymization solution can choose to redact the patient name, while leaving the doctor’s name intact. In many applications, the ability to peruse the document for analytics after anonmyization is considered important.
Another potential improvement is generalizing the model to work on any domain, as long as we have some rule-based coarse level annotators, and training data at fine grained level. For example, patient notes and other data in health care domain can be annotated with NLM Scrubber tool from [Kayaalp et al.2015].
In this work, we have focused only on unstructured data. This work can also be extended to PDE classification on structured data
. This can be approached in two ways. Deep Learning for Tabular data has recently begun to gain traction and can be attempted for the PDE classification task. Another approach could be to generate context from meta-data and other columns similar to unstructured data.
We introduced Personal Data Entities (PDE) as a separate set of entities that need be classified differently than general fine grained entity classification. We introduced a hierarchy of 134 Personal Data Entity Types (PDET), and described two datasets annotated with PDEs. We then proposed an approach to use existing rule-based annotators, to generate additional context features for a state of the art neural. Our experiment results show a substantial increase in accuracy, micro and macro F1 over the baseline model.
- [Abhishek, Anand, and Awekar2017] Abhishek, A.; Anand, A.; and Awekar, A. 2017. Fine-grained entity type classification by jointly learning representations and label embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers.
- [Chiticariu et al.2010] Chiticariu, L.; Krishnamurthy, R.; Li, Y.; Raghavan, S.; Reiss, F. R.; and Vaithyanathan, S. 2010. Systemt: an algebraic approach to declarative information extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
- [Choi et al.2018] Choi, E.; Levy, O.; Choi, Y.; and Zettlemoyer, L. 2018. Ultra-fine entity typing. arXiv preprint arXiv:1807.04905.
[Dernoncourt et al.2017]
Dernoncourt, F.; Lee, J. Y.; Uzuner, O.; and Szolovits, P.
De-identification of patient notes with recurrent neural networks.Journal of the American Medical Informatics Association 24(3):596–606.
- [Gillick et al.2014] Gillick, D.; Lazic, N.; Ganchev, K.; Kirchner, J.; and Huynh, D. 2014. Context-dependent fine-grained entity type tagging. arXiv preprint arXiv:1412.1820.
- [Kayaalp et al.2015] Kayaalp, M.; Browne, A. C.; Dodd, Z. A.; Sagan, P.; and McDonald, C. J. 2015. An easy-to-use clinical text de-identification tool for clinical scientists: Nlm scrubber. In AMIA.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
- [Ling and Weld2012] Ling, X., and Weld, D. S. 2012. Fine-grained entity recognition. In AAAI.
- [Murty et al.2017] Murty, S.; Verga, P.; Vilnis, L.; and McCallum, A. 2017. Finer grained entity typing with typenet. arXiv preprint arXiv:1711.05795.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP.
- [Shimaoka et al.2017] Shimaoka, S.; Stenetorp, P.; Inui, K.; and Riedel, S. 2017. Neural architectures for fine-grained entity type classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers.
[Srivastava et al.2014]
Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research.
- [Weischedel and Brunstein2005] Weischedel, R., and Brunstein, A. 2005. Bbn pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.
- [Weischedel et al.2013] Weischedel, R.; Palmer, M.; Marcus, M.; Hovy, E.; Pradhan, S.; Ramshaw, L.; Xue, N.; Taylor, A.; Kaufman, J.; Franchini, M.; et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
- [Wootton et al.2011] Wootton, B.; Dandliker, R.; Tsibulya, A.; Bruening, O.; and Kessler, D. 2011. Methods and systems for normalizing data loss prevention categorization information. US Patent 8,060,596.
- [Xin et al.2018] Xin, J.; Lin, Y.; Liu, Z.; and Sun, M. 2018. Improving neural fine-grained entity typing with knowledge attention. In AAAI.
- [Xu and Barbosa2018] Xu, P., and Barbosa, D. 2018. Neural fine-grained entity type classification with hierarchy-aware loss. arXiv preprint arXiv:1803.03378.
- [Yogatama, Gillick, and Lazic2015] Yogatama, D.; Gillick, D.; and Lazic, N. 2015. Embedding methods for fine grained entity type classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).