Recently, methods based on attention mechanisms have been used in various fields of natural language processing(Bahdanau et al., 2014; Rush et al., 2015). In tasks which deal with binary relations that take two words as arguments, e.g., dependency parsing, a good number of these models have achieved high performance (Kiperwasser and Goldberg, 2016; Hashimoto et al., 2017; Dozat and Manning, 2016).
Biaffine transformation is a method to incorporate an attention mechanism into binary relations proposed by Dozat and Manning (2016) (following them, we call this method biaffine classifier
). They achieved the state-of-the-art performance among graph-based dependency parsers for the English Penn Treebank. In addition, the state-of-the-art transition-based parser on English Penn Treebank uses a biaffine classifier to evaluate the probability distribution of a word coming into the stack at a step point(Ma et al., 2018).
While biaffine transformation has rich expressiveness in modeling binary relations, its number of parameters in the weight matrix (bilinear term) is (where
is the number of dimensions). This redundant parameters can result in high degree of freedom of the model, thus causing overfitting especially when a large number of training samples are not available(Nickel et al., 2015).
In this paper, we attempt to reduce the redundancy by introducing the assumption of either symmetry or circularity in the weight matrix at a biaffine classifier. With either assumption, we can vectorize the matrix and reduce the space complexity to. Additionally, the time complexity becomes in the case of symmetry and
in the case of circularity with the fast Fourier transform. Furthermore, while the expressiveness of the model based on symmetry is restricted111Denote the score of the bilinear term by when a word pair , has a dependency relation. In this case, if the bilinear term has symmetry, . Therefore, this is not appropriate for expression of a directed edge. , one based on circularity is able to express asymmetry relations.
In our experiments, we imposed constraints on the biaffine classifiers of the deep biaffine parser and examined the effect on the accuracy of the model. For our experiments, we used the dependency parsing dataset from the CoNLL 2017 shared task (Zeman et al., 2017). We chose four languages which have relatively rich training examples and four languages which have fewer. From our experiments, we found that the method with the circularity assumption outperformed the baseline in most of the languages.
2 Deep Biaffine Parser
Models introduced in this paper are based on the Deep Biaffine Parser proposed by (Dozat and Manning, 2016). They achieved the state-of-the-art accuracy on the CoNLL 2017 shared task for Universal Dependencies (Zeman et al., 2017).
This model receives a sequence of words and POS tags, and calculates the probability of an arc between each pair of words as well as a syntactic function label for each arc. For evaluation of scores, it uses Long Short-Term Memories (LSTMs), Multi-Layer Perceptrons (MLPs) and biaffine classifiers.
In the following sections, first, we explain the biaffine transformation which is the essential part of a biaffine classifier while skipping explanation about LSTM and MLP for the sake of simplicity. Then, we describe the overview of the model. It is worth noting that the structure of the model is different from that in (Dozat and Manning, 2016) in that it utilizes character level information. This is because we used an updated version of the model that was made for the shared task (Dozat et al., 2017).
2.1 Biaffine Transformation
For the dependency parsing score functions, we use the biaffine transformation shown below to model binary relations. Here, stands for concatenation of vectors. The first term on the right side represents relatedness score, and the second term the score of and appearing independently. is bias term.
2.2 Structure of the Model
We show the structure of this model below.
First, this model takes two sequences: words and POS tags. It uses a unidirectional LSTM to encode each words’ character-level information into a vector. It then sums this vector with a separate token-level word embeddings.
It then concatenates the vectors obtained above with POS embeddings and encodes them via a three layer bidirectional LSTM. denotes a vector made by concatenating the hidden states (not including cell states) from the LSTMs of both directions which corresponds to the th word .
These outputs are then transformed with MLPs. Here, we use distinct MLPs for dependents and heads in the prediction of arcs and labels.
The scores of arcs between each word pair are calculated using a biaffine transformation.
It evaluates a score of assigning a label : the number of labels) on the arc between the dependent word and its predicted head word . The equation is defined below.
Here, a distinct weight matrix , weight vector and bias are used for each label. The first term on the right side of the equation (4) represents the score of assigning the label to the arc between dependent and head . The second term expresses the score of the label when the dependent and head are given independently.
In our experiment, and account for about 17% of the parameters in the deep biaffine parser. By reducing these parameters, we can expect not only improved memory efficiency but also less overfitting.
3 Proposed Methods
3.1 Symmetric Matrix Constraint
This method assumes that the weight matrices of the bilinear terms are symmetric matrices and thus are diagonalizable. As a result, we can transform the bilinear term of the score functions into a “triple inner product” of two input vectors and a weight vector.
3.1.1 Diagonalization of the weight matrix in the bilinear term
When a matrix
is symmetric, it can be diagonalized by an orthogonal matrixas below:
consists of the eigenvalues ofand represents the diagonal matrix whose diagonal elements are . With this property, we can rewrite the bilinear term as follows:
where, and , assuming is learned implicitly. is a “triple inner product” of and defined by . Consequently, the symmetry constraint on the matrix can reduce the number of weight parameters from to .
3.1.2 Simultaneous Diagonalization
When a set of symmetric matrices forms a commuting family, they can be diagonalized by the same orthogonal matrix (Liu et al., 2017). So we assume the weight matrices of the scoring functions (4) form a commuting family. Namely, we assume:
With this assumption, all weight matrices can be diagonalized simultaneously. Therefore, the vectors and can be mapped by the same orthogonal matrix for all score functions.
3.1.3 Score Functions
Based on the above, under the assumption that all weight matrices are symmetric, we substitute the bilinear term in the biaffine transformations with a triple inner product. First, the score function for arc is defined as follows:
where, . Unlike in (3), the second term of this function contains the arc-dep vector because we confirm that it improves the performance.
Scoring functions for labels are defined as follows:
where, . We eliminate the bias term in (4) because we confirm that it does not affect the performance.
3.2 Circulant Matrix Constraint
Nickel et al. (2015)
used a circulant matrix for the bilinear transformation in knowledge graph completion model(Nickel et al., 2011) to reduce the number of parameters and improve the computational efficiency. Following this method, we assume the weight matrices of the bilinear term in the biaffine transformations are circulant and propose new scoring functions based on that.
3.2.1 Bilinear Transformation Using a Circulant Matrix
We define the circulant matrix for a vector as follows:
where, . . Then, we replace the bilinear term with one where the weight matrix is a circulant matrix with parameters:
3.2.2 Score Functions
We propose score functions that employ (9) as the bilinear term in the biaffine transformation. The score function for an arc is then defined as follows:
The score functions for labels are defined as follows:
3.2.3 Efficient Computation Using Fast Fourier Transformation
In this section, we explain how to compute (9) efficiently using a fast Fourier transformation (FFT). We denote an -point discrete Fourier transformation (DFT) matrix as . Then, any circulant matrix can be diagonalized as follows (Gray et al., 2006):
Here, , , , and all of them are dimensional complex vectors. is the operator which takes the real parts of its argument. With this transformation, we can compute the bilinear transformation with a circulant matrix in using a FFT.
The DFT of an -dimensional vector , , is conjugate symmetric if and only if is a real vector (Hayashi and Shimbo, 2017). In our experiment, we initialize with the DFT of a real vector and update it in complex space. We update “frequency” domain (complex space) vectors using only the operations which have correspondence to “time domain” vectors. Thus, as described in Hayashi and Shimbo (2017), the conjugate symmetry of vectors are kept while learning because their initial values satisfy it.
3.2.4 Expressiveness of the Bilinear Transformation Using Circulant Matrices
In this section, we explain about the expressiveness of circulant matrices in relation to an arbitrary matrix . Trouillon et al. (2017) show that for any , there exists a normal matrix such that . Further, as with a symmetric matrix, a normal matrix can be diagonalized as follows:
is a unitary matrix,is the conjugate transpose of and is the complex vector which consists of the eigenvalues of . The bilinear transformation whose weight matrix is replaced with this, can be transformed into (3.2.3). The unitary matrix is a bijective function, so the input vectors , are learned as their one-to-one correspondent , , assuming that the unitary matrix is learned implicitly. Note that to simultaneously diagonalize the normal matrices whose real parts are the weight matrices in the scoring functions for labels, we have to assume that they form a commuting family as with the discussion in 3.1.2.
4 Related work
In recent years, various graph-based parsers with attention mechanisms have been proposed.
Kiperwasser and Goldberg (2016) incorporated the attention mechanism used in machine translation (Bahdanau et al., 2014) into their graph-based parser. Their model receives vectors which are made by concatenating LSTM outputs corresponding to each word and its head candidates. Similarly, Hashimoto et al. (2017) proposed a graph-based parser where they substitute the MLP-based classifier in (Kiperwasser and Goldberg, 2016) with the bilinear one in their multi-task neural model, although they still use the MLP-based one in prediction of labels. Accordingly, Dozat and Manning (2016) modified the model by Kiperwasser and Goldberg (2016)
using a biaffine classifier instead of an MLP-based one which enables the model to express not only the probability of a word receiving a particular word as dependent but also the prior probability of a word being a head.
Likewise, in transition-based parsing literature, the state-of-the-art parser on the English Penn TreeBank by Ma et al. (2018) uses an attention mechanism based on a biaffne classifier which calculates the probability distribution of the next word which comes into the stack at each time step, with LSTM outputs corresponding to each word in the input sentence. The models proposed in this paper can be incorporated into these models.
Parameter Reduction in Neural Networks
Recently, numerous methods toward parameter reduction of neural networks have been proposed.
As a similar approach to proposed methods, there is a method where a projection matrix is decomposed into smaller matrices by lower-rank approximation (Lu et al., 2016). In addition, Ishihara et al. (2018)Socher et al., 2013) and analyzed the effects of parameter reduction. Although the paper (Ishihara et al., 2018) is similar to the present paper in that their methods address parameter reduction in the bilinear term, our work is different in that we apply it to deep biaffine parser.
There are some methods to reduce parameters in a projection matrix by sharing them. Cheng et al. (2015) used circulant matrices in the fully connected layers. Our models are different from theirs in that we use circulant matrices in the bilinear terms. Chen et al. (2015) perform parameter reduction with a hash kernel and Sindhwani et al. (2015) with special matrices like toeplitz matrices. While these can be also used for bilinear terms, the methods based on real diagonalization and circulant matrices are superior to them in computational efficiency.
Hinton et al. (2015) proposed a model called distillation and were able to train a model which was more compact than the original one. However, it needs a lot of time for training since it needs to be trained again for distillation. Hubara et al. (2016) achieved a significant reduction of parameters by the quantization, but the reported accuracy is inferior to the original model. Theoretically, these methods can be combined with the proposed methods.
|Treebank||Baseline||Symmetry Matrix||Circulant Matrix|
|Reduced Samples||Baseline||Symmetry Matrix||Circulant Matrix|
|0 / 4||91.05||89.42||90.95(-0.1)||89.22(-0.2)||91.04(-0.01)||89.31(-0.11)|
|1 / 4||90.32||88.57||90.05(-0.27)||88.30(-0.27)||90.29(-0.03)||88.49(-0.08)|
|2 / 4||88.98||87.08||89.15(+0.17)||87.17(+0.09)||88.72(-0.26)||86.63(-0.45)|
|3 / 4||87.24||85.08||87.27(+0.03)||85.06(-0.02)||87.59(+0.35)||85.38(+0.3)|
5.1 Dataset and Implementation
We compared the models described above on several languages in the CoNLL 2017 shared task for Universal Dependency Parsing dataset. We chose four languages which have relatively abundant training examples: UD_Chinese, UD_Czech, UD_English and UD_German. And we also selected four languages which have fewer training examples: UD_French-ParTUT, UD_Galician-TreeGal, UD_Latin and UD_Slovenian-SST.
As a baseline model, we used the dependency parser by Timothy Dozat222https://github.com/tdozat/Parser-v2 which achieved the highest accuracy on the shared task. The structures of the proposed models are based on the baseline model; we changed only the classifier part.
We only modified two hyper-parameters: we used no pretrained embeddings and initialized word embeddings with a uniform distribution. These settings remain the same throughout all experiments unless otherwise stated.
We use gold word segmentation and gold POS tags while word segmentation and POS tagging are included in the shared task. We excluded these two tasks because the objective of this research is to show the effects of the proposed methods on biaffine classifiers which are not used for those tasks.
We show the results of the baseline and two proposed methods in Table 1. The method based on circulant matrices outperformed the others on almost all languages except for English where the baseline model achieved the best accuracy and French-ParTUT where the method based on symmetric matrix did so in UAS. Interestingly, the method based on symmetric matrices underperformed the baseline on most languages. This might be because of the restricted expressiveness of a symmetric weight matrix in comparison to a circulant one especially in that the former is not appropriate for expressing asymmetric relations.
|Number of Parameters||Percentage|
|Baseline||Symmetry Matrix||Circulant Matrix|
|Sum with shared parts||3157637||2632100||2636200|
|Difference from the baseline||0.0%||-16.64%||-16.51%|
In this section, we examine the robustness to overfitting of the proposed methods.
6.1 Relaxation of Overfitting
First, to further examine how numbers of parameters affect the models, we conducted experiments on UD_English treebank reducing the number of training examples by a quarter at a time.
The results of this experiment are shown in Table 2. Our methods performed better on smaller datasets where the number of training examples are less than or equal to a half of the original number of examples. This indicates that our methods not only became robust to overfitting through parameter reduction but also achieved high generalizability.
Second, we ran the baseline model with classifier dimensions reduced from 400 to 200 for the arc classifier and 100 to 50 for the label classifier and compared it with the proposed methods on languages from small treebanks. As shown in Table 3, simply reducing the number of dimensions hindered the accuracies in some languages while the method based on circulant matrix consistently outperformed the baseline model with the original number of dimensions in all of these languages from small treebanks as shown in Table 3. This result indicates the effectiveness of the proposed method on smaller datasets.
6.2 Parameter Reduction
Table 4 shows the proportion of the total parameters which each part of the baseline model account for. While LSTMs occupy the largest portion of the model, the second largest part is the classifiers which account for about 17% of the parameters. Table 5 indicates that the proposed methods were able to reduce the number of parameters by more than 16%.
6.3 Parsing Speed
To test the parsing speed, we used an NVidia GTX1080 GPU and parsed the test dataset of UD_English. As mentioned in Section 3, both proposed methods are superior to the baseline model in terms of time complexity. Actually, the methods based on symmetry took 13.86 seconds, circularity 15.06 and the baseline 17.76 seconds. These results are in accordance with theoretical time complexity.
In this paper, we reduced the number of parameters in the weight matrices in biaffine classifiers based on the assumption of symmetry or circularity and examined their effects on the CoNLL 2017 shared task for Universal Dependency Parsing dataset. As a result, the method based on circulant matrices outperformed the baseline model in most of languages with about 16% parameter reduction. As future work, the L1 regularization method for CompLex (Trouillon et al., 2016) proposed by (Manabe et al., 2018) may be integrated into our methods to further reduce the number of parameters in the bilinear function. The script for the bilinear functions proposed here is provided on the GitHub page below333https://github.com/TomokiMatsuno/PACLIC32/blob/master/my_linalg.py.
We are grateful to Michael Wentao Li for proofreading the present paper. We thank the anonymous reviewers. This work was supported by JSPS KAKENHI Grant Number JP18K11457 and JST CREST Grant Number JPMJCR1513, Japan.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.
Chen et al. (2015)
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen.
Compressing neural networks with the hashing trick.
International Conference on Machine Learning, pages 2285–2294, 2015.
Cheng et al. (2015)
Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N.
Choudhary, and Shih-Fu Chang.
An exploration of parameter redundancy in deep networks with
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2857–2865, 2015. doi: 10.1109/ICCV.2015.327. URL https://doi.org/10.1109/ICCV.2015.327.
- Chu and Liu (1965) Chu and Liu. On the shortest arborescence of a directed graph. Science Sinica, Vol.14, 1965.
- Dozat and Manning (2016) Timothy Dozat and Christopher D. Manning. Deep biaffine attention for neural dependency parsing. CoRR, abs/1611.01734, 2016. URL http://arxiv.org/abs/1611.01734.
- Dozat et al. (2017) Timothy Dozat, Peng Qi, and Christopher D Manning. Stanford’s graph-based neural dependency parser at the conll 2017 shared task. Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 20–30, 2017.
- Edmonds (1967) Jack Edmonds. Optimum branchings. JOURNAL OF RESEARCH of the National Bureau of Standards, 71(4), 1967.
- Gray et al. (2006) Robert M Gray et al. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, 2(3):155–239, 2006.
- Hashimoto et al. (2017) Kazuma Hashimoto, caiming xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1923–1933. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/D17-1206.
- Hayashi and Shimbo (2017) Katsuhiko Hayashi and Masashi Shimbo. On the equivalence of holographic and complex embeddings for link prediction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pages 554–559, 2017. doi: 10.18653/v1/P17-2088. URL https://doi.org/10.18653/v1/P17-2088.
- Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
- Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations constrained to +1 or -1. CoRR, abs/1609.07061, 2016. URL http://arxiv.org/abs/1609.07061.
- Ishihara et al. (2018) Takahiro Ishihara, Katsuhiko Hayashi, Hitoshi Manabe, Masashi Shimbo, and Masaaki Nagata. Neural tensor networks with diagonal slice matrices. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 506–515, 2018. URL https://aclanthology.info/papers/N18-1047/n18-1047.
- Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. Simple and accurate dependency parsing using bidirectional lstm feature representations. Transactions of the Association for Computational Linguistics, 4:313–327, 2016. URL http://aclweb.org/anthology/Q16-1023.
- Liu et al. (2017) Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical inference for multi-relational embeddings. CoRR, abs/1705.02426, 2017. URL http://arxiv.org/abs/1705.02426.
Lu et al. (2016)
Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath.
Learning compact recurrent neural networks.In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5960–5964. IEEE, 2016.
- Ma et al. (2018) X. Ma, Z. Hu, J. Liu, N. Peng, G. Neubig, and E. Hovy. Stack-Pointer Networks for Dependency Parsing. ArXiv e-prints, May 2018.
Manabe et al. (2018)
Hitoshi Manabe, Katsuhiko Hayashi, and Masashi Shimbo.
Data-dependent learning of symmetric/antisymmetric relations for
knowledge base completion.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16211.
- Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816, 2011.
- Nickel et al. (2015) Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge graphs. CoRR, abs/1510.04935, 2015. URL http://arxiv.org/abs/1510.04935.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
Sindhwani et al. (2015)
Vikas Sindhwani, Tara N. Sainath, and Sanjiv Kumar.
Structured transforms for small-footprint deep learning.In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3088–3096, 2015.
- Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 926–934, 2013.
- Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2071–2080. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045609.
- Trouillon et al. (2017) Théo Trouillon, Christopher R. Dance, Éric Gaussier, Johannes Welbl, Sebastian Riedel, and Guillaume Bouchard. Knowledge graph completion via complex tensor factorization. Journal of Machine Learning Research, 18:130:1–130:38, 2017. URL http://jmlr.org/papers/v18/papers/v18/16-563.html.
- Zeman et al. (2017) Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. Conll 2017 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19. Association for Computational Linguistics, 2017. doi: 10.18653/v1/K17-3001. URL http://www.aclweb.org/anthology/K17-3001.