Character-Level Feature Extraction with Densely Connected Networks

06/24/2018 ∙ by Chanhee Lee, et al. ∙ Amazon Korea University 0

Generating character-level features is an important step for achieving good results in various natural language processing tasks. To alleviate the need for human labor in generating hand-crafted features, methods that utilize neural architectures such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) to automatically extract such features have been proposed and have shown great results. However, CNN generates position-independent features, and RNN is slow since it needs to process the characters sequentially. In this paper, we propose a novel method of using a densely connected network to automatically extract character-level features. The proposed method does not require any language or task specific assumptions, and shows robustness and effectiveness while being faster than CNN- or RNN-based methods. Evaluating this method on three sequence labeling tasks - slot tagging, Part-of-Speech (POS) tagging, and Named-Entity Recognition (NER) - we obtain state-of-the-art performance with a 96.62 F1-score and 97.73 tagging, respectively, and comparable performance to the state-of-the-art 91.13 F1-score on NER.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/

Effectively extracting character-level features from words is crucial in many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and Slot tagging. Thus, most state-of-the-art methods for these tasks exploit some kind of character-level features [Huang et al.2015, Sarikaya and Deng2007, Kim et al.2011, Kim and Snyder2012, dos Santos and Zadrozny2014]. Recently, generating character-level features with neural architectures such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) has drawn much attention, mainly because it doesn’t require human labor and shows superior performance [Ma and Hovy2016, dos Santos and Zadrozny2014]. However, CNN struggles at distinguishing anagrams, and RNN is inherently slow due to its sequential nature.

In this paper, we propose an effective and efficient way of extracting character-level features using a densely connected network. The key benefits of the proposed method can be summarized as follows. First, it does not require any hand-crafted features or data preprocessing. Each word is processed based on n-gram statistics of the training data, and vectorized using bag-of-characters. Additional features are based on hexadecimal values of the character-set (e.g. UTF-16) and number of characters in the word. Second, it extracts effective character-level features while being efficient. State-of-the-art performance can be achieved using this method, and the feature extraction is done with a simple densely connected network with a single hidden layer. Third, it doesn’t depend on features that are language or task specific, such as character type features or gazetteer (i.e. lists of known named entities such as cities or organization names). The only requirement for adopting this method is that the language should be processable as a sequence of words, which is made of sequence of characters. These benefits, combined with minimum requirements for application, make the proposed method an easy replacement for conventional methods such as CNN or RNN.


Our contributions are three-fold: 1) We propose an effective yet efficient method for character-level feature extraction; 2) We quantitatively show that the proposed method is superior to CNN and RNN via extensive evaluation; 3) We achieve state-of-the-art or comparable to state-of-the-art performance on three of the most popular and well-studied sequence tagging tasks - Slot tagging, Part-of-Speech (POS) tagging, and Named Entity Recognition (NER).

2 Related Work

Prior to the introduction of neural architectures for character-level feature generation, manually engineered features were designed by experts based on language and/or domain knowledge. One example is word shape, in which each word is mapped to a simplified representation that encodes information such as capitalization, numerals, and length (e.g. CoNLL-2003 to AaAAA-0000). finkel2005exploring combined this feature with other information such as n-grams and gazetteers to train a conditional Markov model for identification of gene and protein names in biomedical documents. huang2015bidirectional introduced more hand-crafted features utilizing punctuation or non-letters and used these as an input to a Bi-LSTM-CRF tagger for POS tagging, CoNLL-2000 chunking, and CoNLL-2003 NER. Even though these kinds of hand-crafted features showed strong empirical results, they are more expensive than our approach in that they require expert knowledge of the target domain and language.

In recent years, methods that utilize neural networks to automatically extract character-level features have been proposed. The most widely adopted and successful method for this is CNN. dos2014learning combined this approach with a window-based fully-connected neural network tagger to perform English and Portuguese POS tagging. This work achieved state-of-the-art results in Portuguese and near state-of-the-art results in English. In ma2016end, a Bi-LSTM-CRF model incorporated with a character-level CNN is trained in an end-to-end fashion. They evaluated this approach on English POS tagging and NER, achieving state-of-the-art performance on both tasks. However, feature vectors generated by CNN are position-independent due to the max-over-time pooling layer, and are more sensitive to model weight initialization compared to the method proposed in this paper.

Another effective way of generating feature vectors from a variable length sequence of characters is to use RNN. For instance, lample2016neural extracted character-level features using a bi-directional LSTM and used them with pre-trained word embeddings as word representations for another Bi-LSTM-CRF model. Evaluating this model for NER, they obtained state-of-the-art results for Dutch, German, and Spanish, and close to state-of-the-art results for English. Intuitively, character-level feature generation via RNN should be more effective than CNN, since RNN processes each character sequentially and thus should form a better model of character ordering. However, reimers2017reporting empirically showed that these two methods have no statistically significant difference in terms of performance. Furthermore, RNN has a higher time-complexity caused by its sequential nature, which makes it less favorable.

3 Proposed Method

The proposed method is built on bag-of-characters (BOC) representation. However, BOC is prone to anagrams and thus is susceptible to word collisions, i.e. different words having the same vector representation. The main focus of the proposed method is to minimize word collision while maintaining the key benefits described above. To achieve this goal, we split the word into pieces, and each piece is vectorized using BOC. Then, two non hand-crafted-features are extracted from the word - character order and word length. These sparse vectors are concatenated and normalized to form the sparse character-level feature vector. For a n-dimensional vector , normalizing is done as follows:

(1)

This sparse vector is then fed into a densely connected network with a single hidden layer to obtain the final dense character feature vector. Note that the sparse vector representation of each word is fixed, so it can be cached for efficiency. Figure 1 illustrates the overall process.

Figure 1: Process of generating the character-level feature vector of a word using the proposed method.

3.1 Splitting Words

Each word is split into pieces to reduce the number of word collisions. To maintain the ordering of pieces, concatenation is used instead of summation or averaging to merge the vectors. Word splitting is done based on n-gram frequency. First, n-gram statistics is collected from the training corpus where is the number of times the n-gram appears in the corpus. Then, the n-gram with the highest frequency gets merged into a single piece, and this merging is repeated until only n-grams are left. The number of pieces

per word is a configurable hyperparameter. Finally, each piece is converted into a fixed length vector using BOC. The detailed algorithm is presented in Algorithm 

1. This process is similar to the byte-pair encoding method in sennrich2015neural, except that in the proposed method each word can only be split into pieces whereas byte-pair encoding produces an arbitrary number of pieces. Producing a fixed number of pieces is important, since concatenation is used to merge the vectors.

Input : word , n-gram statistics , number of pieces
Output :  where
1
2 while  do
3      
4      
5      
6 end while
7while  do
8       Append empty string to
9 end while
return
Algorithm 1 Splitting word into pieces

3.2 Character Order Feature

Every character that has a digital representation can be converted into a numerical value via some character-set (e.g. UTF-16). Then, it is possible to numerically compare two characters. Let be a character sequence of length . Then , , , and are defined as follows:

(2)
(3)

Bi-grams with the same character repeating are ignored. A sequence of characters can then be categorized into one of three classes: , ,

. This category info is calculated for each word piece, which is then converted into a 3-dimensional vector using one-hot encoding and concatenated to the sparse word piece vector.

3.3 Word Length Feature

To further reduce the number of word collisions, information about the word’s length is added into the model. One-hot encoding is used to store an integer from 0 to 20, and any word exceeding 20 characters is treated as being 20 characters long.

4 Model

In this section, we describe the sequence tagging model’s architecture in detail. Figure 2 illustrates the model architecture.

Figure 2: Overview of model architecture for sequence tagging experiments. Question mark indicates that the component is optional.

4.1 Sequence Tagging with Bidirectional RNN

In sequence tagging tasks, such as POS tagging or NER, both future and past input tokens are available to the model. Bidirectional RNNs [Graves and Schmidhuber2005]

can efficiently make use of future and past features over a certain time frame. We use Long Short-Term Memory (LSTM)

[Hochreiter and Schmidhuber1997] for our RNN cell, which is better at capturing long-term dependencies than vanilla RNN [Kim et al.2016]

. Output of the forward and backward RNN layers are summed to form the feature vector of each time-step. Each word is tagged based on this feature vector, using either a softmax layer or CRF layer. To capture a more abstract and higher-level representation in different layers, a densely connected layer can be added before and after the Bi-LSTM layers. The input to this network at each time-step is the concatenation of the character-level feature vector and a pre-trained word vector (described in section

5).

4.2 Conditional Random Field

Even though a Bi-LSTM layer can efficiently extract features for each time-step utilizing past and future inputs, the prediction is made on each time-step, independent of past and future tag outputs. The Conditional Random Field (CRF) layer overcomes this limitation by considering state transition probability, thereby decoding the most probable output tag sequence

[Kim et al.2014, Kim et al.2015]. It has been shown that adding a CRF layer on top of a Bi-LSTM network can lead to statistically significant performance increases [Reimers and Gurevych2017]. We also test a variant of our model using CRF as the final layer to perform tag sequence prediction.

4.3 Stacking RNNs with Residual Connection

Increasing the depth of the neural network architecture has proven to be an effective way of improving performance. However, naively stacking layers can lead to adversarial effects due to the degradation problem. Residual connection

[He et al.2016] has shown to be an effective way to tackle this issue by creating a shortcut between layers. The same strategy is adopted to our model when there are more than one Bi-LSTM layers, in which case the input is added to the Bi-LSTM layer’s output.

4.4 Dropout

Dropout is a popular and effective way of regularizing neural network models, by randomly dropping nodes [Srivastava et al.2014]. In our model, Inverted dropout is applied to all densely connected layers for regularization. For the Bi-LSTM layers, variational recurrent dropout [Gal and Ghahramani2016] is used, since naive dropout can deteriorate performance. The word embedding matrix is regularized using the method proposed in gal2016theoretically, i.e. dropping words at random.

5 Training Details

Pre-trained Word Embeddings Utilizing word embeddings pre-trained on large unlabeled text has shown to be one of the most effective ways to increase performance on various NLP tasks. Our model uses the GloVe [Pennington et al.2014] 300-dimensional vectors trained on the Common Crawl corpus with 42B tokens as word level features, as this resulted in the best performance in preliminary experiments. Words that do not appear in the training data are replaced with a special Out-of-Vocabulary (OOV) token. To train the vector of this token, we randomly swap words with OOV tokens while training with a 0.01 probability, as in lample2016neural. The word vector is then concatenated with the character-level feature vector and fed into the subsequent layer.

Freezing Embeddings It is common practice to fine-tune the pre-trained word vectors through the training process. However, preliminary experiments have revealed that fine-tuning the word vectors results in lower performance than freezing the vectors, especially in the early stages of training. We hypothesize that randomly initialized weights in the model act as noise and degrade the pre-trained word vectors. To circumvent this issue, the embeddings are frozen for the first phase of training so that they are not affected by untrained weights. We use 20% for all experiments.

Dynamic Batch Size keskar2016large showed that small batch sizes lead to more global and flat minimizers, while large batch sizes lead to more local and sharp minimizers. Therefore, starting from a small batch size and increasing it during training would result in a more global, but sharp minimizer. While having similar effect to learning rate decay, this strategy also has a benefit of accelerating the training as the batch size grows [Smith et al.2017]. Adopting this method, we start from a fixed initial batch size, and increase the batch size by a factor of two on each quarter of the course of training.

Tagging Scheme It is reported that more complicated tagging schemes such as IOBES does not have statistically significant advantage over BIO scheme [Reimers and Gurevych2017], thus we adopt the BIO scheme for all experiments.

Parameter Optimization Our network is trained by minimizing the cross entropy loss over the tags for the softmax model, or maximizing the log-likelihood of the tag sequence for the CRF model. The objective function is optimized using the gradient-based optimization algorithm Adam [Kingma and Ba2014]

. For all experiments, we implement the model using the TensorFlow 

[Abadi et al.2016] library.

Hyperparameter Tuning Most hyperparameters, with the following exceptions, are tuned on the development sets. Hyperparameters of the character-CNN and character-RNN models are adopted from ma2016end and lample2016neural, respectively. The chosen hyperparameters for all experiments are summarized in Appendix A.

6 Evaluation

We evaluate the effectiveness of the proposed method using three of the most well-studied and common English sequence tagging tasks - Slot tagging, POS tagging, and NER. Note that to test the generalizability of the proposed method, we do not perform any preprocessing for all experiments. Details on each task and baseline models are described in this section. Table 1 summarizes the statistics of each task.

6.1 Slot Tagging

For slot tagging, we use the Airline Travel Information System (ATIS) dataset. This dataset has 84 types of slot labels and 127 possible tags with BIO tagging scheme. Since this corpus lacks a development set, 20% of the training data is randomly sampled and used as the development set for tuning the hyperparameters. This task’s performance is measured in F1-score, which is calculated using the publicly available conlleval.pl script.

6.2 Part-of-Speech Tagging

For POS tagging, we use the Wall Street Journal (WSJ) portion of the Penn TreeBank dataset [Marcus et al.1993] and adopt the standard split for part-of-speech tagging experiments - section 0-18 as training data, section 19-21 as development data, and section 22-24 as test data. This dataset contains 45 different POS tags. Model performance is measured by token-level accuracy.

6.3 Named Entity Recognition

For NER, the English portion of the CoNLL-2003 shared task [Tjong Kim Sang and De Meulder2003] is used for evaluation. This dataset contains four different types of named entities, which results in nine possible tags with BIO tagging scheme and an ’O’ tag. Like slot tagging, the final performance is measured in F1-score using the same conlleval.pl script.

Dataset ATIS PTB WSJ CoNLL2003
Sentences Tokens Sentences Tokens Sentences Tokens
Training 4978 56591 38219 912344 14987 204567
Develop. - - 5527 131768 3466 51578
Test 893 9198 5462 129654 3684 46666
Table 1: Corpus statistics of each task.

6.4 Baseline Models

Character-level CNN and character-level RNN are the most effective and widely adopted methods for character-level feature extraction, and thus are suitable as strong baseline methods. We implement these two methods to use them as baselines for comparison. The CRF layer has the effect of making the model robust to architectural differences [Reimers and Gurevych2017]. Since the goal of baseline experiments is to evaluate the effect of difference in character-level feature generation methods, we use the softmax layer instead of the CRF layer for these experiments. Every aspect of the sequence tagging model except the character-level feature generation method is identical for all baseline experiments.

7 Results and Discussion

7.1 Experimental Results

For a more in-depth analysis of the performance of the proposed method and two baselines, we train each model 20 times with different initial parameters, which are randomly initialized [Reimers and Gurevych2017]. Table 2

summarizes the mean performance with standard deviation in parentheses. Performance distribution is also visualized using a violin plot in Figure 

3.

7.1.1 Slot Tagging

On the task of tagging semantic slots using the ATIS dataset, the proposed method shows the best results in terms of both performance and variability. Our method has the highest mean F1-score of 96.28. Furthermore, it has the lowest standard deviation across all runs, which means it is robust to parameter initialization. On the contrary, both CNN and RNN models have lower performance and higher variability compared to the proposed method.
Analyzing the violin plot reveals that there are also differences in score distribution. While CNN models tend to have a low F1-score on average with occasional high peaks, RNN models have higher F1-score in general but suffer from a large performance drop with poor parameter initialization. This could be one of the reasons why models using CNN seem to have superior performance when only the best performance is reported. On the other hand, our model does not result in peaks or serious drops in performance with different seed values, which makes it more suitable for real-world applications.

Method Task
Slot POS NER
Char-CNN 96.22 (SD 0.08) 97.68 (SD 0.03) 89.08 (SD 0.20)
Char-RNN 96.25 (SD 0.09) 97.68 (SD 0.03) 90.15 (SD 0.14)
Char-Dense (Ours) 96.28 (SD 0.07) 97.69 (SD 0.02) 90.10 (SD 0.13)
Table 2: Comparison with baseline models.
(a) Slot
(b) POS
(c) NER
Figure 3:

Score distributions for all experiments. Quartiles marked with dashed lines.

Figure 4: Sentence processing speed in terms of number of sentences per second.

7.1.2 Part-of-Speech Tagging

The proposed method also achieves the best results on the POS tagging task. Similar to the slot tagging task, our method shows the highest mean accuracy of 97.69 with the lowest standard deviation of 0.02. For the baseline models, CNN and RNN performed on par.
CNN-based models have higher variability with high peak performance on this task also, as shown in the violin plot. Similar to the slot tagging task, our method shows the lowest variability, which supports the robustness of this method.

7.1.3 Named Entity Recognition

On the NER task, the RNN-based model has a slightly better F1-score (90.15) than the proposed method (90.10). However, our method consistently shows the lowest standard deviation, like as the other tasks, at 0.13. By analyzing the violin plot, we can see that the RNN again shows occasional performance drops for certain cases of poor weight initialization. Unlike the other two tasks, the model utilizing CNN has a relatively poor F1-score and does not show any peaks in performance.

7.2 Training Speed

To compare the efficiency of three models, average training speed (i.e. number of sentences processed per second) is presented in Figure 4. All trainings are performed utilizing a single GeForce GTX 1080 Ti GPU, and the RNN model is implemented using the highly efficient cuDNN LSTM API. It is clear that the proposed method has the highest training speed, followed by CNN and RNN. On average, our method was able to process around 867 sentences per second, which is 6.29% and 16.32% higher than CNN and RNN, respectively.

Slot POS NER
Approach F1 Approach Acc. Approach F1
mesnil2015using 94.73 toutanova2003feature 97.24 ando2005framework 89.31
yao2014spoken 94.85 manning2011part 97.32 collobert2011natural 89.59
liu2015recurrent 94.89 shen2007guided 97.33 huang2015bidirectional 90.10
yao2014spoken 95.08 sun2014structure 97.36 chiu2015named 90.77
peng2015recurrent 95.25 moore2015improved 97.36 ratinov2009design 90.80
vu2016bi 95.56 hajivc2009semi 97.44 lin2009phrase 90.90
vu2016sequential 95.61 sogaard2011semisupervised 97.50 passos2014lexicon 90.90
kurata2016leveraging 95.66 tsuboi2014neural 97.51 lample2016neural 90.94
zhu2017encoder 95.79 huang2015bidirectional 97.55 luo2015joint 91.20
zhai2017neural 95.86 choi2016dynamic 97.64 ma2016end 91.21
Char-Dense
w/o CRF (Ours)
96.36
Char-Dense
w/o CRF (Ours)
97.73
Char-Dense
w/o CRF (Ours)
90.28
Char-Dense
w/ CRF (Ours)
96.62
Char-Dense
w/ CRF (Ours)
97.65
Char-Dense
w/ CRF (Ours)
91.13
Table 3: Comparison with state-of-the-art approaches in the literature.

7.3 Comparison with Published Results

For comparison with published results, we summarize the performance of our best models along with state-of-the-art approaches in Table 3. The proposed method was able to surpass the previous state-of-the-art result on the ATIS dataset with a large margin, even without the CRF layer. With the help of CRF, our method obtains a new state-of-the-art result with a 96.62 F1-score.
For the POS tagging task with PTB WSJ dataset, we obtain a new state-of-the-art result with a 97.73% accuracy with the model without a CRF layer. Interestingly, utilizing a CRF layer on this model degraded the performance on this task whereas it helped with the other two tasks. We hypothesize that this is due to the fact that unlike the other two tasks where there are many hard constraints between labels (e.g. an O tag cannot be followed by I- tags), the label dependencies are more ”soft” on POS tagging task. In the latter case, it is possible that naively taking label transition probability into account could have a negative impact on performance.
On the task of recognizing named entities, we obtain a result that is comparable to state-of-the-art with a 91.13 F1-score when a CRF layer is used. Like in slot tagging task, utilizing CRF lead to a significant increase in performance. It is notable that all results from our method are achieved without depending on any hand-crafted or language/task-specific features (e.g. capitalization, character type, gazetteer), whereas most previous approaches utilizes one or more type of such features. This fact supports the generalizability of the proposed method.

8 Conclusion and Future Work

In this paper, we proposed a fast and effective method of using a densely connected network to automatically generate character-level features. With extensive evaluation, it is shown that this method is robust to parameter initialization and has high processing speed compared to conventional methods such as CNN or RNN. This method has also high generalizability and this is supported by the fact that we were able to obtain superior performance without any task or language specific features.

We plan to explore the followings as future work: 1) In this work, we focused on clean text where there are minimal semantic or syntactic errors. We would like to test the robustness of this method against such errors to evaluate whether this method is suitable for real-world applications. 2) Adopting the proposed method and analyzing the effectiveness on other NLP tasks such as neural machine translation or automatic text summarization could also be worth investigating.

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), South Korea, under the ITRC (Information Technology Research Center) support program (”Research and Development of Human-Inspired Multiple Intelligence”) supervised by the IITP (Institute for Information & Communications Technology Promotion). Additionally, this work was supported by the National Research Foundation of Korea (NRF) grant funded by the South Korean government (MSIP) (No. NRF-2016R1A2B2015912).

References

Appendix A Hyperparameters

Group Hyperparameter Slot POS NER
Char-CNN Window size 3 3 3
Number of filters 30 30 30
Character dimension 30 30 30
Char-RNN Layer size 50 50 50
Character dimension 50 50 50
Char-Dense Layer size 50 50 50
Number of pieces per word 2 2 2
Word-level Pre-trained word embeddings GloVe 300d GloVe 300d GloVe 300d
RNN layer size 350 350 350
RNN layer depth 2 3 3
Pre-RNN layer size 350 350 None
Post-RNN layer size 350 350 None
Dropout keep probability Char-Dense 0.7 0.7 0.7
Word feature 0.9 0.9 0.9
Word-level RNN layer 0.5 0.5 0.5
Pre/post-RNN layers 0.5 0.5 0.5
Training Initial batch size 8 16 16

Number of epochs

100 100 100
Table 4: Chosen hyperparameters for all experiments.