Ancient Chinese were used for thousands of years. There is a huge amount of books and articles written in ancient Chinese. However, both the form and grammar of ancient have been changed. Chinese historians and littérateurs have made great effort in translating such literatures into contemporary Chinese, a big part of which are publicly available on the Internet. However, there is still a big gap between these literatures and parallel corpora, because most of the corpora are coarsely passage-aligned, the orders of sentences are different. To train an automatic translating model, we need to build a sentence-aligned corpus first.
Translation alignment is an important pre-step for machine translation. Most of previous work focuses on how to apply supervised algorithms on this task using features extracted from text.Gale and Church (1993); Haruno and Yamazaki (1997) proposed to use statistical or dictionary information to build alignment corpus. Resnik (1998, 1999) proposed to extract parallel corpus from the Internet with a system called Strands. Wang and Ren (2005) proposed to use the logarithmic linear model for Chinese-Japanese clause alignment. Besides features such as sentence lengths, matching patterns, Chinese character co-occurrence in Japanese and Chinese is also taken into consideration. Lin and Wang (2007); Liu and Wang (2012) adapted this method to ancient-contemporary Chinese translation alignment based on the observation that Chinese character co-occurrence also exists in ancient-contemporary Chinese.
The method above works well, however, these supervised algorithms require a large parallel corpus to train, which is not available in our circumstance. The previous algorithms did not make good use of the characteristics of ancient-contemporary Chinese pair. To overcome these shortcomings, we design an unsupervised algorithm for sentence alignment based on the observation that differently from bilingual corpus, ancient-contemporary sentence pairs share many common characters in order. We evaluate our alignment algorithm on an aligned parallel corpus with small size, the experimental results show that our simple algorithm works very well (F1 score 99.4), which is even better than the supervised algorithms.
Deep learning has achieved great success in tasks like machine translation. Sutskever et al. (2014a) proposed a sequence to sequence (seq-to-seq) model that generates good translation results on machine translation. Bahdanau et al. (2014) proposed to use attention mechanism to allow the decoder to extract phrase alignment information from the hidden states of the encoder. Most of the existing NMT systems are based on the Seq2Seq model (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014b) and the attention mechanism. Some of them have variant architectures to capture more information from the inputs (Su et al., 2016; Xiong et al., 2017; Tu et al., 2016), and some improve the attention mechanism (Luong et al., 2015b; Meng et al., 2016; Mi et al., 2016; Jean et al., 2015; Feng et al., 2016; Calixto et al., 2017), which also enhanced the performance of the NMT model. Experimental results show that a copy mechanism can improve performance of seq-to-seq model remarkably on this task. We show some experimental results in the Experiment section. Other mechanisms are also implied to improved the performance of machine translation (Lin et al., 2018; Ma et al., 2018).
Our contributions lie in the following two aspects:
We propose a simple yet effective unsupervised algorithm to build the sentence-aligned parallel corpora out of passage-aligned parallel corpora.
We propose to apply sequence to sequence model and copy mechanism to deal with the translating task. Experimental results show that our method can achieve the BLEU score of 26.41 (ancient to contemporary) and 35.66 (contemporary to ancient).
2 Proposed Method
2.1 Unsupervised Algorithm for Sentence Alignment
Given a pair of aligned passages (the source language sentences and target language sentences ), the objective of sentence alignment is to extract a set of matching sentence pairs out of the two passages. Each matching pair consists of several sentence pairs like , which implies that and form a parallel pair.
Translating ancient Chinese into contemporary Chinese has the characteristic that every word of ancient Chinese tends to be translated into contemporary Chinese in order, which usually includes the same original character. Therefore, the correct aligned pairs usually have the maximum sum of lengths of the longest common subsequence (LCS) for each matching pair.
Let be the length of the longest common subsequence of a matching pair of aligned sentence groups consisting of source language sentences and target language sentences . We use the dynamic programming algorithm to find the maximum score and its corresponding alignment result. Let be the maximum score that can be achieved with partly aligned sentence pairs until , respectively:
We only consider cases where one sentence is matched with no more than 5 sentences.
2.2 Neural Machine Translation Model
Sequence-to-sequence model was first proposed to solve machine translation problem. The model consists of two parts, an encoder and a decoder. The encoder is bound to take in the source sequence and compress the sequence into hidden states. The decoder is used to produce a sequence of target tokens based on the information embodied in the hidden states given by the encoder. Both encoder and decoder are implemented with Recurrent neural networks (RNN).
To deal with the ancient-contemporary translating task, we use the encoder to convert the variable-length character sequence into a set of hidden representationswith Equation 2,
where is a function of RNN family, is the input at time step . The decoder is another RNN, which generates a variable-length sequence token by token, through a conditional language model,
where is the embedding matrix of target tokens,
is the last predicted token. In the decoder, the context vectoris calculated based on the hidden states of the decoder at time step and all the hidden states in the encoder, which is also known as the attention mechanism. Instead of the normal global attention, we apply local attention (Luong et al., 2015a; Tjandra et al., 2017). Because most of the time ancient and contemporary Chinese have similar word order, when calculating the context vector , we calculate a pivot position in the hidden states
of the encoder, and calculate the attention probability in the window around the pivot instead of the whole sentence.
Machine translation model treats ancient and contemporary Chinese as two languages, however, in this task, contemporary and ancient Chines share many common characters. Therefore, we treat ancient and contemporary Chinese as one language and share the character embedding between source language and target language.
2.3 Copy Mechanism
As is stated above, ancient and contemporary Chinese share many common characters and most of the name entities use the same representation. Copy mechanism (Gu et al., 2016) is very suitable in this situation, where the source and target sequence share some of the words. We apply pointer-generator framework in our model, which follows the same intuition as the copy mechanism. The output probability is calculated as,
where is dynamically calculated based on the hidden state , is the same as traditional seq-to-seq model, is the attention scores at -th time.
The encoder and decoder networks are trained jointly to maximize the conditional probability of the target sequence. We use cross entropy as the loss function. We use characters instead of words because characters have independent meaning in ancient Chinese and the number of characters is much lower than the number of words, which makes the data less sparse and greatly reduces the number of OOV.
3.1 Sentence Alignment
We crawl passages and their corresponding contemporary Chinese version from the Internet. After proofreading a sample of these passages, we think the quality of the passage-aligned corpus is satisfactory. To evaluate the algorithm, we crawl a relatively small sentence-aligned corpus consisting of 90 aligned passages with 4,544 aligned sentence pairs. We proofread them and correct some mistakes to guarantee the correctness of this corpus.
We implement log-linear model on contemporary-ancient Chinese sentence alignment as a baseline. Following the previous work, we implement this model with combination of three features, sentence lengths, matching patterns and Chinese character co-occurrence (Wang and Ren, 2005; Lin and Wang, 2007; Liu and Wang, 2012).
We split the data into training set (2,999) and test set(1,545) to train the log-linear model. Our unsupervised method does not need training data. Both these two methods are evaluated on the test set. Our unsupervised model gets an F1-score of 99.4%, which is better than the supervised baseline, 99.2% (shown in Table 1).
We find that the small fraction of data (0.6%) that our method makes mistakes are mainly because the change of punctuation. For example, in ancient Chinese, there is a comma “,” after “异哉，” (How strange!), while in contemporary Chinese, “怪啊！” (How strange!), an exclamation mark “!” is used, which makes the sentence an independent sentence. Since the sentence is short and there is no common character, our method fails to align the sentences correctly. However, such problems also exist in supervised models.
3.2 Translating Result
We run experiments on the data set built by our proposed unsupervised algorithm. The data set consists of 57,391 ancient-contemporary Chinese sentence pairs in total. We split the sentence pairs randomly into train/dev/test dataset with sizes of 53,140/2125/2126 respectively.
We run experiments on both translating directions and use BLEU score to evaluate our model. Experimental results show our model works well on this task (Table 3). Compared with the basic seq-to-seq model, copy mechanism gives a large improvement, because the source sentences and target sentence share some common representations, we will also give an example in the Case study. Local attention gives small improvement over the traditional global attention, this can be attributed to shrinking the attention range, because most of the times, the mapping between ancient and contemporary Chinese is clear. A more sophisticated attention mechanism may further improve the performance.
|+ Local Attention||26.95||36.34|
From Table 4 we can see that the results are very sensitive to the scale of the training data size. Therefore, our unsupervised method of building large sentence aligned corpora is necessary.
|Source||六月辛卯，中山王焉薨。 (On Xinmao Day of the sixth lunar month, Yan, King of Zhongshan, passed away.)|
|Target||六月十二日，中山王刘焉逝世。 (On twelfth of the sixth lunar month, Liu Yan, King of Zhongshan, passed away.)|
|Seq2seq||六月十六日，中山王刘裕去世。 (On sixteenth of the sixth lunar month, Liu Yu, King of Zhongshan, died.)|
|Proposal||六月二十二日，中山王刘焉逝世。 (On twenty-second of the sixth lunar month, Liu Yan, King of Zhongshan, passed away.)|
Under most circumstances, our models can translate sentences between ancient Chinese and contemporary Chinese properly. For instance, our models can translate “薨 (pass away)” into “去世 (pass away)” or “逝世 (pass away)”, which are the correct forms of expression in contemporary Chinese. And our models can even complete some omitted characters . For instance, the family name “刘(Liu)” in “中山王刘焉(Liu Yan, King of Zhongshan)” was omitted in ancient Chinese because “中山王 (King of Zhongshan)” was a hereditary peerage offered to “刘(Liu)” family. And our model completes the family name “刘(Liu)” when translating.
For proper nouns, the seq2seq baseline model can fail sometimes while the copy model can correctly copy them from source language. For instance, the seq2seq baseline model translates “焉 (Yan)” into “刘裕 (Liu Yu, a more famous figure in the history)” because “焉 (Yan)” is relatively low-frequent words in ancient Chinese. However, the copy model learns to copy these low frequency proper nouns from source sentences directly.
Translating dates between ancient and contemporary Chinese calender requires background knowledge of the ancient Chinese Lunar calendar, and involves non-trivial calculation that even native speakers can not translate correctly without training. In the example, “辛卯 (Xinmao Day)” is the date presented in ancient form, our model fails to translate it. Our model fails to transform between the Gregorian calendar and the ancient Chinese lunar calendar and choose to generate a random date, which is expected because of the difficulty for such problems.
4 Conclusion and Future Work
In this paper, we propose an unsupervised algorithm to construct sentence-aligned sentence pairs out of a passage-aligned corpus using the characteristic that sentences from two styles of Chinese share many characters. Using this algorithm, we build a large sentence-aligned corpus to train our translation model, which solves the low resource problem for translating between ancient-contemporary Chinese. We propose to apply sequence to sequence model with attention and copy mechanism to automatically translate between two styles of Chinese sentences. The experimental results show that our method can yield very good translating results.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Calixto et al. (2017)
Iacer Calixto, Qun Liu, and Nick Campbell. 2017.
Doubly-attentive decoder for multi-modal neural machine translation.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1913–1924.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014, pages 1724–1734.
Feng et al. (2016)
Shi Feng, Shujie Liu, Nan Yang, Mu Li, Ming Zhou, and Kenny Q. Zhu. 2016.
Improving attention modeling with implicit distortion and fertility for machine translation.In COLING 2016, pages 3082–3092.
- Gale and Church (1993) William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19:75–102.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. CoRR, abs/1603.06393.
- Haruno and Yamazaki (1997) Masahiko Haruno and Takefumi Yamazaki. 1997. High-performance bilingual text alignment using statistical and dictionary information. Natural Language Engineering, 3(1):1–14.
- Jean et al. (2015) Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In ACL 2015, pages 1–10.
- Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP 2013, pages 1700–1709.
- Lin et al. (2018) Junyang Lin, Xu Sun, Xuancheng Ren, Shuming Ma, Jinsong Su, and Qi Su. 2018. Deconvolution-based global decoding for neural machine translation. CoRR, abs/1806.03692.
- Lin and Wang (2007) Zhun Lin and Xiaojie Wang. 2007. Chinese ancient-modern sentence alignment. In International Conference on Computational Science (2), volume 4488 of Lecture Notes in Computer Science, pages 1178–1185. Springer.
Liu and Wang (2012)
Ying Liu and Nan Wang. 2012.
Sentence alignment for ancient and modern Chinese parallel corpus.
Emerging Research in Artificial Intelligence and Computational Intelligence, pages 408–415, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Luong et al. (2015a) Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025.
- Luong et al. (2015b) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015b. Effective approaches to attention-based neural machine translation. In EMNLP 2015, pages 1412–1421.
- Ma et al. (2018) Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. CoRR, abs/1805.04871.
- Meng et al. (2016) Fandong Meng, Zhengdong Lu, Hang Li, and Qun Liu. 2016. Interactive attention for neural machine translation. In COLING 2016, pages 2174–2185.
- Mi et al. (2016) Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016. Supervised attentions for neural machine translation. In EMNLP 2016, pages 2283–2288.
- Resnik (1998) Philip Resnik. 1998. Parallel strands: A preliminary investigation into mining the web for bilingual text. volume cmp-lg/9808003.
- Resnik (1999) Philip Resnik. 1999. Mining the web for bilingual text. In ACL. ACL.
- Su et al. (2016) Jinsong Su, Zhixing Tan, Deyi Xiong, and Yang Liu. 2016. Lattice-based recurrent neural network encoders for neural machine translation. CoRR, abs/1609.07730.
- Sutskever et al. (2014a) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014a. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Sutskever et al. (2014b) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014b. Sequence to sequence learning with neural networks. In NIPS, 2014, pages 3104–3112.
- Tjandra et al. (2017) Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Local monotonic attention mechanism for end-to-end speech recognition.
- Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In ACL 2016.
- Wang and Ren (2005) Xiaojie Wang and Fuji Ren. 2005. Chinese-japanese clause alignment. In CICLing, volume 3406 of Lecture Notes in Computer Science, pages 400–412. Springer.
- Xiong et al. (2017) Hao Xiong, Zhongjun He, Xiaoguang Hu, and Hua Wu. 2017. Multi-channel encoder for neural machine translation. CoRR, abs/1712.02109.