Automatic Transferring between Ancient Chinese and Contemporary Chinese

03/05/2018 ∙ by Zhiyuan Zhang, et al. ∙ Peking University 0

During the long time of development, Chinese language has evolved a great deal. Native speakers now have difficulty in reading sentences written in ancient Chinese. In this paper, we propose an unsupervised algorithm that constructs sentence-aligned ancient-contemporary pairs out of the abundant passage-aligned corpus. With this method, we build a large parallel corpus. We propose to apply the sequence to sequence model to automatically transfer between ancient and contemporary Chinese sentences. Experiments show that both our alignment and transfer method can produce very good result except for some circumstances that even human translators can make mistakes without background knowledge.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ancient Chinese were used for thousands of years. There is a huge amount of books and articles written in ancient Chinese. However, both the form and grammar of ancient have been changed. Chinese historians and littérateurs have made great effort in translating such literatures into contemporary Chinese, a big part of which are publicly available on the Internet. However, there is still a big gap between these literatures and parallel corpora, because most of the corpora are coarsely passage-aligned, the orders of sentences are different. To train an automatic translating model, we need to build a sentence-aligned corpus first.

Translation alignment is an important pre-step for machine translation. Most of previous work focuses on how to apply supervised algorithms on this task using features extracted from text.

Gale and Church (1993); Haruno and Yamazaki (1997) proposed to use statistical or dictionary information to build alignment corpus. Resnik (1998, 1999) proposed to extract parallel corpus from the Internet with a system called Strands. Wang and Ren (2005) proposed to use the logarithmic linear model for Chinese-Japanese clause alignment. Besides features such as sentence lengths, matching patterns, Chinese character co-occurrence in Japanese and Chinese is also taken into consideration. Lin and Wang (2007); Liu and Wang (2012) adapted this method to ancient-contemporary Chinese translation alignment based on the observation that Chinese character co-occurrence also exists in ancient-contemporary Chinese.

The method above works well, however, these supervised algorithms require a large parallel corpus to train, which is not available in our circumstance. The previous algorithms did not make good use of the characteristics of ancient-contemporary Chinese pair. To overcome these shortcomings, we design an unsupervised algorithm for sentence alignment based on the observation that differently from bilingual corpus, ancient-contemporary sentence pairs share many common characters in order. We evaluate our alignment algorithm on an aligned parallel corpus with small size, the experimental results show that our simple algorithm works very well (F1 score 99.4), which is even better than the supervised algorithms.

Deep learning has achieved great success in tasks like machine translation. Sutskever et al. (2014a) proposed a sequence to sequence (seq-to-seq) model that generates good translation results on machine translation. Bahdanau et al. (2014) proposed to use attention mechanism to allow the decoder to extract phrase alignment information from the hidden states of the encoder. Most of the existing NMT systems are based on the Seq2Seq model (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014b) and the attention mechanism. Some of them have variant architectures to capture more information from the inputs (Su et al., 2016; Xiong et al., 2017; Tu et al., 2016), and some improve the attention mechanism (Luong et al., 2015b; Meng et al., 2016; Mi et al., 2016; Jean et al., 2015; Feng et al., 2016; Calixto et al., 2017), which also enhanced the performance of the NMT model. Experimental results show that a copy mechanism can improve performance of seq-to-seq model remarkably on this task. We show some experimental results in the Experiment section. Other mechanisms are also implied to improved the performance of machine translation (Lin et al., 2018; Ma et al., 2018).

Our contributions lie in the following two aspects:

  • We propose a simple yet effective unsupervised algorithm to build the sentence-aligned parallel corpora out of passage-aligned parallel corpora.

  • We propose to apply sequence to sequence model and copy mechanism to deal with the translating task. Experimental results show that our method can achieve the BLEU score of 26.41 (ancient to contemporary) and 35.66 (contemporary to ancient).

2 Proposed Method

2.1 Unsupervised Algorithm for Sentence Alignment

Given a pair of aligned passages (the source language sentences and target language sentences ), the objective of sentence alignment is to extract a set of matching sentence pairs out of the two passages. Each matching pair consists of several sentence pairs like , which implies that and form a parallel pair.

Translating ancient Chinese into contemporary Chinese has the characteristic that every word of ancient Chinese tends to be translated into contemporary Chinese in order, which usually includes the same original character. Therefore, the correct aligned pairs usually have the maximum sum of lengths of the longest common subsequence (LCS) for each matching pair.

Let be the length of the longest common subsequence of a matching pair of aligned sentence groups consisting of source language sentences and target language sentences . We use the dynamic programming algorithm to find the maximum score and its corresponding alignment result. Let be the maximum score that can be achieved with partly aligned sentence pairs until , respectively:

(1)

We only consider cases where one sentence is matched with no more than 5 sentences.

2.2 Neural Machine Translation Model

Sequence-to-sequence model was first proposed to solve machine translation problem. The model consists of two parts, an encoder and a decoder. The encoder is bound to take in the source sequence and compress the sequence into hidden states. The decoder is used to produce a sequence of target tokens based on the information embodied in the hidden states given by the encoder. Both encoder and decoder are implemented with Recurrent neural networks (RNN).

To deal with the ancient-contemporary translating task, we use the encoder to convert the variable-length character sequence into a set of hidden representations

with Equation 2,

(2)

where is a function of RNN family, is the input at time step . The decoder is another RNN, which generates a variable-length sequence token by token, through a conditional language model,

(3)
(4)

where is the embedding matrix of target tokens,

is the last predicted token. In the decoder, the context vector

is calculated based on the hidden states of the decoder at time step and all the hidden states in the encoder, which is also known as the attention mechanism. Instead of the normal global attention, we apply local attention (Luong et al., 2015a; Tjandra et al., 2017). Because most of the time ancient and contemporary Chinese have similar word order, when calculating the context vector , we calculate a pivot position in the hidden states

of the encoder, and calculate the attention probability in the window around the pivot instead of the whole sentence.

Machine translation model treats ancient and contemporary Chinese as two languages, however, in this task, contemporary and ancient Chines share many common characters. Therefore, we treat ancient and contemporary Chinese as one language and share the character embedding between source language and target language.

2.3 Copy Mechanism

As is stated above, ancient and contemporary Chinese share many common characters and most of the name entities use the same representation. Copy mechanism (Gu et al., 2016) is very suitable in this situation, where the source and target sequence share some of the words. We apply pointer-generator framework in our model, which follows the same intuition as the copy mechanism. The output probability is calculated as,

(5)

where is dynamically calculated based on the hidden state , is the same as traditional seq-to-seq model, is the attention scores at -th time.

The encoder and decoder networks are trained jointly to maximize the conditional probability of the target sequence. We use cross entropy as the loss function. We use characters instead of words because characters have independent meaning in ancient Chinese and the number of characters is much lower than the number of words, which makes the data less sparse and greatly reduces the number of OOV.

3 Experiments

3.1 Sentence Alignment

We crawl passages and their corresponding contemporary Chinese version from the Internet. After proofreading a sample of these passages, we think the quality of the passage-aligned corpus is satisfactory. To evaluate the algorithm, we crawl a relatively small sentence-aligned corpus consisting of 90 aligned passages with 4,544 aligned sentence pairs. We proofread them and correct some mistakes to guarantee the correctness of this corpus.

We implement log-linear model on contemporary-ancient Chinese sentence alignment as a baseline. Following the previous work, we implement this model with combination of three features, sentence lengths, matching patterns and Chinese character co-occurrence (Wang and Ren, 2005; Lin and Wang, 2007; Liu and Wang, 2012).

We split the data into training set (2,999) and test set(1,545) to train the log-linear model. Our unsupervised method does not need training data. Both these two methods are evaluated on the test set. Our unsupervised model gets an F1-score of 99.4%, which is better than the supervised baseline, 99.2% (shown in Table 1).

Model Precision Recall F1
Log-linear 99.2 99.1 99.2
Proposal 99.4 99.4 99.4
Table 1: Experimental results on sentence alignment. For the log-linear model, we choose the length, pattern and co-occurrence as the three most useful features.

We find that the small fraction of data (0.6%) that our method makes mistakes are mainly because the change of punctuation. For example, in ancient Chinese, there is a comma “,” after “异哉,” (How strange!), while in contemporary Chinese, “怪啊!” (How strange!), an exclamation mark “!” is used, which makes the sentence an independent sentence. Since the sentence is short and there is no common character, our method fails to align the sentences correctly. However, such problems also exist in supervised models.

3.2 Translating Result

We run experiments on the data set built by our proposed unsupervised algorithm. The data set consists of 57,391 ancient-contemporary Chinese sentence pairs in total. We split the sentence pairs randomly into train/dev/test dataset with sizes of 53,140/2125/2126 respectively.

Language Vocabulary OOV Rate
Ancient 5,870 1.37%
Contemporary 4,993 1.03%
Table 2: Vocabulary statistics. We include all the characters in the training set in the vocabulary.

We run experiments on both translating directions and use BLEU score to evaluate our model. Experimental results show our model works well on this task (Table 3). Compared with the basic seq-to-seq model, copy mechanism gives a large improvement, because the source sentences and target sentence share some common representations, we will also give an example in the Case study. Local attention gives small improvement over the traditional global attention, this can be attributed to shrinking the attention range, because most of the times, the mapping between ancient and contemporary Chinese is clear. A more sophisticated attention mechanism may further improve the performance.

Method An-Con Con-An
Seq-to-Seq 23.10 31.20
+ Copy 26.41 35.66
+ Local Attention 26.95 36.34
Table 3: Evaluation result (BLEU) of translating between An (ancient) and Con (contemporary) Chinese in test dataset
Language 5,000 10,000 20,000 53,140
An-Con 3.00 9.69 16.31 26.95
Con-An 2.40 10.14 18.47 36.34
Table 4: Evaluation result (BLEU) of translating between An (ancient) and Con (contemporary) Chinese with different number of training samples.

From Table 4 we can see that the results are very sensitive to the scale of the training data size. Therefore, our unsupervised method of building large sentence aligned corpora is necessary.

Source 六月辛卯,中山王焉薨。 (On Xinmao Day of the sixth lunar month, Yan, King of Zhongshan, passed away.)
Target 六月十二日,中山王刘焉逝世。 (On twelfth of the sixth lunar month, Liu Yan, King of Zhongshan, passed away.)
Seq2seq 六月十六日,中山王刘裕去世。 (On sixteenth of the sixth lunar month, Liu Yu, King of Zhongshan, died.)
Proposal 六月二十二日,中山王刘焉逝世。 (On twenty-second of the sixth lunar month, Liu Yan, King of Zhongshan, passed away.)
Table 5: Example of translating from An (ancient) to Con (contemporary) Chinese.

Under most circumstances, our models can translate sentences between ancient Chinese and contemporary Chinese properly. For instance, our models can translate “薨 (pass away)” into “去世 (pass away)” or “逝世 (pass away)”, which are the correct forms of expression in contemporary Chinese. And our models can even complete some omitted characters . For instance, the family name “刘(Liu)” in “中山王刘焉(Liu Yan, King of Zhongshan)” was omitted in ancient Chinese because “中山王 (King of Zhongshan)” was a hereditary peerage offered to “刘(Liu)” family. And our model completes the family name “刘(Liu)” when translating.

For proper nouns, the seq2seq baseline model can fail sometimes while the copy model can correctly copy them from source language. For instance, the seq2seq baseline model translates “焉 (Yan)” into “刘裕 (Liu Yu, a more famous figure in the history)” because “焉 (Yan)” is relatively low-frequent words in ancient Chinese. However, the copy model learns to copy these low frequency proper nouns from source sentences directly.

Translating dates between ancient and contemporary Chinese calender requires background knowledge of the ancient Chinese Lunar calendar, and involves non-trivial calculation that even native speakers can not translate correctly without training. In the example, “辛卯 (Xinmao Day)” is the date presented in ancient form, our model fails to translate it. Our model fails to transform between the Gregorian calendar and the ancient Chinese lunar calendar and choose to generate a random date, which is expected because of the difficulty for such problems.

4 Conclusion and Future Work

In this paper, we propose an unsupervised algorithm to construct sentence-aligned sentence pairs out of a passage-aligned corpus using the characteristic that sentences from two styles of Chinese share many characters. Using this algorithm, we build a large sentence-aligned corpus to train our translation model, which solves the low resource problem for translating between ancient-contemporary Chinese. We propose to apply sequence to sequence model with attention and copy mechanism to automatically translate between two styles of Chinese sentences. The experimental results show that our method can yield very good translating results.

References