Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling

12/11/2019 ∙ by Yu Wan, et al. ∙ 0

As a special machine translation task, dialect translation has two main characteristics: 1) lack of parallel training corpus; and 2) possessing similar grammar between two sides of the translation. In this paper, we investigate how to exploit the commonality and diversity between dialects thus to build unsupervised translation models merely accessing to monolingual data. Specifically, we leverage pivot-private embedding, layer coordination, as well as parameter sharing to sufficiently model commonality and diversity among source and target, ranging from lexical, through syntactic, to semantic levels. In order to examine the effectiveness of the proposed models, we collect 20 million monolingual corpus for each of Mandarin and Cantonese, which are official language and the most widely used dialect in China. Experimental results reveal that our methods outperform rule-based simplified and traditional Chinese conversion and conventional unsupervised translation models over 12 BLEU scores.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Dialect refers to a variant of a given language, which could be defined by factors of regional speech patterns, social class or ethnicity [23]. Except for pronunciation, a dialect is also distinguished by its textual expression [32]. For instance, Mandarin (Man) and Cantonese (Can) are the official language and the most widely used dialect of China, respectively [18]. As seen in Fig. 1, although the sentences have absolutely same semantic meaning, they have distinct attributes with respect to the expression on text. Correspondingly, in this task we attempt to build automatic translation system for dialects.

Figure 1: An example of Can-Man translation.

An intuitive way is to leverage advanced machine translation systems which have recently yielded human-level performance with the use of neural networks 

[8, 19]

. Nevertheless, contrast with traditional machine translation, there are two main challenges in dialect translation. First, the success of supervised neural machine translation depends on large-scale training parallel data, while dialect translation is not equipped such kind of prerequisite. This makes our task fall into unsupervised learning category 

[1, 14, 16]. Second, dialects are closely related and, despite their differences, often share similar grammar, e.g. morphology and syntax [7]. The extraction of commonality is beneficial to unsupervised mapping [15] and model robustness [10], in the meanwhile, preserving the explicit diversity plays a crucial role in our dialect translation. Consequently, it is challenging to balance the commonality and diversity for dialect translation thus to improve its performance.

We approach the mentioned problems by proposing unsupervised neural dialect translation model, which is merely trained using monolingual corpus and sufficiently leverage commonality and diversity of dialects. Specifically, we train an advanced NMT model Transformer [30] with denoising reconstruction [31] and back-translation [27], which aim at building common language model and mapping different attributes, respectively. We introduce several strategies into translation model for balancing the commonality and diversity: 1) parameter-sharing that forces dialects to share the same latent space; 2) pivot-private embedding which models similarities and differences at lexical level; and 3) layer coordination which enhances the interaction of features between two sides of translation.

In order to evaluate the effectiveness of the proposed model, we propose monolingual dialect corpus which consists of 20 million colloquial sentences for each of Man111For simplification, we regard official language as a dialect. and Can. The sentences are extracted from conversations and comments in forums, social medias as well as subtitles, and carefully filtered during data preprocessing.222Our codes and data are released at: https://github.com/NLP2CT/Unsupervised˙Dialect˙Translation. Empirical results on two directions of Man-Can translation task demonstrate that the proposed model significantly outperforms existing unsupervised NMT [16] with even fewer parameters. The quantitative and qualitative analyses verified the necessity of commonality and diversity modeling for dialect translation.


Neural machine translation (NMT) aims to use a neural network to build a translation model, which is trained to maximize the conditional distribution of sentence pairs [3, 28, 30]. Given a source sentence

, conditional probability of its corresponding translation

is defined as:


where indicates the -th target token.

denotes the parameters of NMT model, which are optimized to minimize the following loss function over the training corpus



Such kind of auto-regressive translation process is generally achieved upon the encoder-decoder framework [29]. Specifically, the inputs of encoder and decoder are obtained by looking up source and target embeddings according to the input sentences X and Y, respectively:


where indicates the dimensionality. The encoder is composed of a stack of identical layers. Given the input layer , the output of the -th layer can be formally expressed as:


The decoder is also composed of a stack of identical layers. Contrary to the encoder which takes all the tokens into account, the decoder merely summarizes the forward representations in the input layer at each decoding step, since the subsequent representations are invisible. Besides, the generation process considers the contextual information of source sentence, by feeding the top layer of the encoder . Accordingly, the -th representation in -th decoding layer is calculated as:



indicates the attention model 


which has recently been a basic module to allow a deep learning model to dynamically select related representations as needed. Finally, the conditional probability of the

-th target word is calculated using a non-linear function :

(a) Conventional NMT model.
(b) The proposed model.
Figure 2: Illustration of (a) conventional NMT model and (b) the proposed model. As seen, we propose pivot-private embedding, which learns commonality () and diversity ( and ) at lexical level. Besides, the decoder attends to source representations layer by layer, rather than merely from the topmost layer.

In this section, we propose unsupervised neural dialect translation. We first serve the dialect translation as an unsupervised learning task to tackle with the low-resource problem. Moreover, concerning the commonality and diversity between dialects, we introduce pivot-private embedding and layer coordinating to improve the dialect translation model.

Dialect Translation with Unsupervised Learning

Despite the success of NMT over past years, the performance of a NMT model relies on large-scale parallel training corpus [27, 1]. As a low-resource translation task, dialect translation fails at leveraging conventional training strategy, since parallel resources are normally inaccessible. The scarcity of bilingual corpus leads to extraordinary challenging on building translation models for dialects. On the contrary, monolingual corpora is relatively easier to be collected. Partially inspired by recent studies on unsupervised NMT [14, 1, 16], we propose to build dialect translation model with unsupervised learning which merely depends on monolingual data. Generally, most of the features with respect to dialects are similar, while only a few of the surface information is different. To this end, we propose to divide the training process into two parts: 1) commonality modeling which learns to capture general features of all dialects; and 2) diversity modeling which builds connections between different expressions.

Commonality Modeling

This procedure aims at offering our model the ability to extract the universal features of two dialects. Intuitively, the commonality modeling can be trained by reconstructing two dialects using one model. artetxe2018unsupervised artetxe2018unsupervised and lample2018unsupervised lample2018unsupervised suggest that denoising autoencoding is beneficial to the language modeling. More importantly, it can avoid our model from severely copying the input sentence to the output. Contrary to artetxe2018unsupervised artetxe2018unsupervised and lample2018unsupervised lample2018unsupervised who employ distinct model for each language, we train one model for both the two dialects, thus to encourage different dialects to be modeled under a common latent space. Consequently, the loss function is defined as:


where and are monolingual corpora for two dialects, and denote noised inputs.333We add noises to inputs by swapping, dropping and blanking words following lample2018unsupervised lample2018unsupervised, except that we swap two words rather than three, which shows better empirical results in our experiments. As seen, the two reconstruction models are shared with the same parameter .

Diversity Modeling

Although there is marginal difference between dialects, the transfer of diversity is the key problem of dialect translation. Contrast to supervised NMT model which learns relevance between source and target using parallel data, dialect translation model fails to directly establish the functional mapping from source latent space to target one. An alternative way is to exploiting back-translation [27, 9]. Specifically, and are first translated to their candidate translation and , respectively. The mapping of cross-dialect latent spaces can be learned by minimizing:


Finally, the loss function in Equation 2 is modified as:


where are hyper-parameters balancing the importance of commonality and diversity modeling, respectively.

Pivot-Private Embedding

An open problem in unsupervised NMT is the initialization of the translation model, which plays a crucial role in the iteratively training [14, 1] and affects the final performance of the unsupervised learning [16]. For two languages with different vocabularies, an usual solution in recent studies is to map the same tokens which are then cast as seeds for aligning other words [1, 14]. For example, artetxe2018unsupervised artetxe2018unsupervised employ unsupervised bilingual word embeddings [2], while lample2018phrase lample2018phrase utilize the representations of shared tokens [24] in different languages to initialize the lookup tables. Fortunately, dialect translation dispels this problem since most of tokens are shared among dialects. Therefore, we propose pivot and private embedding, in which, the former learns to share a part of the features while the latter captures the word-level characteristics in different dialects.

Pivot Embedding

Since vocabularies in different dialects are almost the same, we join monolingual corpora of two dialects and extract all the tokens in it. In order to build the connections between source and target, we assign pivot embedding with dimensions as the initial alignments:


where the function of looking up embedding shares parameters across dialects.

Private Embedding

Except the common features, there also exists differences between dialects. We argue that such kind of difference mainly lies in the word-level surface information. To this end, we introduce private embedding for each translation side to distinguish and maintain the characteristics in dialects:


Contrary to pivot embedding, and are assigned distinct parameters. Thus, the final input embedding in Equation 3 and 4 are modified as:


where is the concatenation operator. Note that, since each token has and dimensions for the associate pivot embedding and private embedding, the final input is still composed of

-dimensional vector.

, and are all pretrained, and co-optimized under the translation objective. In this way, we expect that pivot embedding can enhance the commonality of translation model, while private embedding raises the ability to capture the diversity of different dialects [21].

Layer Coordination

Recent studies have pointed out that multiple neural network layers are able to capture different types of syntactic and semantic information [25, 20]. For example, Peters:2018:NAACL Peters:2018:NAACL demonstrate that higher-level layer states capture the context-dependent aspects of word meaning while lower-level states model the aspects of syntax, and simultaneously exposing all of these signals is highly beneficial. To sufficiently interact these features, an alternative way is to perform attention from a decoder layer to its corresponding encoder layer, rather than merely from the topmost layer. Accordingly, the -th decoding layer (Equation 6) is changed to:


This technique has been proven effective [12, 34, 11] upon NMT tasks via shortening the path of gradient propagation, thus stabilizes the training of a extremely deep model. However, the improvements on traditional translation tasks become marginal when we apply layer coordination to the models with less than 6 layers [12]. We attribute this to the fact that directly interacting lexical and syntactic level information between different languages affects the diversity modeling of them, since it forces the two languages to share the same latent space layer by layer. Different from prior studies, our work focuses on a pair of languages which have extremely similar grammar. We examine whether layer coordination is conductive to commonality modeling of dialects and the translation quality.


In this section, we first introduce the Can and Man datasets collected for our experiments, then show adequate rudimentary statistical results upon training corpora.

Monolingual Corpora

The lack of Can monolingual corpora with strong colloquial features is serious obstacle in our research. Existing Can corpora, such as HKCanCor [22] and CANCORP [18], all have the following shortcomings: 1) they were collected in rather early years, the linguistic features of which vary from the current ones due to language evolution; and 2) they are scarce for data-intensive unsupervised training. Due to the fact that colloquial corpora possess more distinguished linguistic features of Can, we collect Can sentences among domains including talks, comments and dialogues from scratch.444https://www.wikipedia.org, https://www.cyberctm.com, http://discuss.hk and https://lihkg.com. In order to maintain the consistency of training sets, Man corpora are also derived from same domains as Can from ChineseNlpCorpus and Large Scale Chinese Corpus for NLP.555https://github.com/brightmart/nlp˙chinese˙corpus and https://github.com/SophonPlus/ChineseNlpCorpus.

Dialect # Sents Vocab size Unique
Can 20M 9,025 541
Man 20M 8,856 372
Table 1: Statistics of two monolingual corpora after preprocessing. We conduct experiment at character-based level, and the joint vocabulary size is exactly 9,397.

Parallel Corpus

We collect adequate parallel corpora for the development and evaluation of models. Parallel sentence pairs from dialogues are manually selected by native Can and Man speakers. Consequently, 1,227 and 1,085 sentence pairs are selected as development and test set, respectively.

Data Preprocessing & Statistics

As there is no well-performed Can segment toolkit, we conduct all the experiments at character level. In order to share the commonality of both languages and reduce the size of vocabularies, we convert all the texts into simplified Chinese.666We also attempt to transform all the texts into traditional characters. It does not work well since some simplified characters has multiple corresponding traditional characters and such kind of one-to-many mapping results in ambiguity and data sparsity. For reasons of computational efficiency, we keep the sentences whose length lies between 4 and 32, and remove sentences composing characters with low frequencies. Finally, each of Man and Can monolingual training corpora consists of 20M sentences. The statistics of training set are concluded in Tab. 1. As seen, Can has larger vocabulary size and more unique characters than Man. To identify the commonality and diversity of Can and Man, we compute the Spearman’s rank correlation coefficient [37] between two vocabulary rankings by their frequencies within each corpus. The coefficient score of two full vocabularies is (), meaning that the overall relation is significantly strong. While the coefficient score of the 250 most frequent tokens is (), indicating that the relation is significantly weak. These results cater to our hypothesis that dialects share considerate commonality with each other, but possess diversity upon most frequent tokens.

Model CanMan ManCan # Params (M)
Character-level Rule-based Transition 42.18 42.27 -
Unsupervised Style Transfer [13] 41.97 42.03 14.40
Unsupervised PB-SMT [16] 42.12 42.20 -
Unsupervised NMT [16] 42.90 42.39 39.08
Layer Coordination 48.45 43.11 39.08
Pivot-Private Embedding 52.74 46.69 36.65
Pivot-Private Embedding + Layer Coordination 54.95 47.45 36.65
Table 2: Experimental results on unsupervised dialect neural machine translation. # Params (M): number of parameters in million. We can see that layer coordination provides improvement over baseline on both directions, and pivot-private embedding improves the result further by almost 10 BLEU scores on CanMan. Combining both layer coordination and pivot-private embedding gives the best result, exceeding 12 and 5 BLEU scores than baseline NMT system on two directions, respectively.


Experimental Setting

We use Transformer [30] as our model architecture, and follow the base model setting for our model dimensionalities. We refer to the parameter setting of  lample2018phrase lample2018phrase, and implement our approach on top of their source code.777https://github.com/facebookresearch/UnsupervisedMT

We use BLEU score as the evaluation metric. The training of each model was early-stopped to maximize BLEU score on the development set.

All the embeddings are pretrained using fasttext [5],888https://github.com/facebookresearch/fastText and pivot embeddings are derived from concatenated training corpora. In the procedure of training, is set to 1.0, while is linearly decayed from 1.0 at the beginning to 0.0 at the step being 200k.


We compare our model with four systems:

  • We collect simple Can-Man conversion rules and regard character-level transition as one of our baseline systems.

  • Our model is built upon unsupervised NMT methods, we choose one of the most widely used architecture [16] as our baseline system.

  • Moreover, unsupervised phrase-based statistical MT [16] has comparable performance to its NMT counterpart. Therefore, we also take unsupervised PB-SMT model into account.

  • For reference, we also examine whether a style transfer system [13] can handle dialect translation task.

Overall Performances

Tab. 2 lists the experimental results. As seen, character-level rule-based translation system performs comparably with conventional unsupervised NMT system. This is in accord with lample2018phrase lample2018phrase that training process of unsupervised NMT is vulnerable, because no aligned information between languages can be afforded to model training. Relatively, character transition rules offer adequate aligned references to conduct the fairish results. Besides, the unsupervised PB-SMT model performs slightly worse than NMT system, a possible reason is that the model is hard to extract a well-performed phrase table from colloquial data [17]. We also evaluate a style transfer system [13]. The model underperforms unsupervised NMT baseline, indicating that, to some extent, style transfer is not adequate for dialect translation.

As to our proposed methods, layer coordination improves the performance by more than 5 BLEU scores at CanMan direction, proving that sharing coordinate information at the same semantic level among dialects is effective. Besides, using pivot-private embedding further gives a higher increase of nearly 10 BLEU scores as well as reducing the model size, verifying that jointly modeling commonality and diversity of both dialects is both effective and efficient. Furthermore, combining both of above can give us more than 12 BLEU scores improvement than baseline NMT system, revealing that both pivot-private embedding and layer coordination are complementary to each other. As to the ManCan direction, we can also observe improvements of our proposed methods. Translating Man to Can is more difficult since it contains more one-to-many character-level transition cases than its reversed direction. Despite this, our best approach still gains 5 BLEU scores improvement than baseline systems on ManCan translation, revealing the universal effectiveness of our proposed method.

Model CanMan ManCan
Baseline 1.80 0.44 2.57 0.50
Our Model    2.50 0.87    3.16 0.61
Table 3: Human assessment on our experimental results. : improvement is strongly significant ().

Human Assessment

Since BLEU metric may be insufficient to reflect the quality of oral sentences, we randomly extract 50 Can Man and 50 Man Can examples from test set for human evaluation, respectively. Each example contains source sentence, translated sentences from Unsupervised NMT model (“baseline”) and our proposed model. Each native speaker is asked to present a score ranging from 1 to 4 to determine the translation quality of each translated result within each example. Each of the reported result is the average score assessed by 10 native speakers. As seen in Tab. 3, results prove that proposed method significantly outperforms baseline NMT system () in both CanMan and ManCan directions.

Effectiveness of Pivot-Private Embedding

Figure 3: Model performances with various pivot embedding dimensionalities upon dev set. # Params (M): number of parameters in million. We can observe that applying adequate dimensionality to pivot embedding is effective, rather than non-sharing any dimension among two dialects (dimensionality is 0) or sharing all dimensions (dimensionality is 512).

To investigate the effectiveness of pivot-private embedding, we also conduct further research on the dimensionality of pivot embedding. As seen in Fig. 3, adequately sharing part of word embedding among dialects can greatly improve the effect, while using two independent sets of embedding for dialects, or sharing all dimensions of embedding leads to poor results. This indicates the importance of balancing the commonality and diversity for dialect translation. Moreover, the more the dimensionalities assigned to pivot embedding, the fewer the parameters required by models. We argue that using pivot-private embedding is not only an efficient way to augment the ability of dialect translation system to model diversity, but also offer an alternative way to relieve the effect of over-parameterization.

Comparing to the model with the dimensionality being 128, the model with 256 pivot embedding dimensions yields comparable results on the two translation directions, while assigns fewer parameters. Consequently, we apply 256 as our default setting for pivot embedding dimensionality.

Effectiveness of Layer Coordination

Figure 4: Learning curves of models upon dev set. Model with layer coordination (w) reaches its convergence at around step 240k, while model without (w/o) at around step 200k. As seen in this figure, applying layer coordination improves the performance of dialect translation model, as well as significantly stabilizes the training process.
(a) ManCan
(b) CanMan
Figure 5: Experiments on number of shared encoder/decoder layers upon dev set. Here w and w/o denotes with and without layer coordination, respectively. From both figures, we can see that even without any shared layer, model with layer coordination can also be trainable rather than without. Models without layer coordination gain significant improvement upon sharing adequate layers for two dialects, while the performances decrease if all layers are shared. As to proposed layer coordination, the more layers shared for two dialects, the higher performance models can possess.

Layer coordination intuitively interacts features from all dialects, helping the model to capture the commonality of linguistic features at coordinate semantic level [25]. he2018layer he2018layer reveal that layer coordination offers more aligned features at the same level, from lexical, through syntactic, to semantic. In this section, we investigate how layer coordination effects on translation quality.

Stability Analysis

We first visualize the convergence of models with and without layer coordination. From Fig. 4 we can observe that the model with layer coordination gains a steady training process, whereas training process of model without layer coordination is fragile, especially drop nearly 5 BLEU scores upon dev set at the middle term. We attribute this to the fact that layer coordination provides coordinate semantic information [12], which is beneficial to our dialect translation task with respect to commonality modeling. Since the two dialects share similar features, each decoder layer can leverage more fine-grained information from source side at the same semantic level, instead of only exploiting top-level representations.

Parameter Sharing

For further investigation, we also conduct analyses on the effect of shared layers. As visualized in Fig. 5, baseline system performs worse when the number of shared layer is less than 1, and models with 3 layers shared performs better. This is consistent with findings in lample2018phrase lample2018phrase who suggest to share higher 3 layers in encoder and lower 3 ones in decoder. Considering the proposed model, sharing more layers for Can and Man translation on both directions is profitable, and model with all layers shared gives the best performance on both directions. This demonstrates that Can and Man have more similar characteristics in numerous aspects of linguistics than distant languages [1, 14], and layer coordination also contributes to the balance of commonality and diversity modeling upon dialect translation task.

Related Work

In this section, we will give an account of related research.

Dialect Translation

To the best of our knowledge, related studies on dialect translation have been carried upon a lot of languages. For example, in Arabic [4] and Indian [6], applying syllable symbols is effective for sharing information across languages. Compared to these tasks, our work mainly focus on handling problems in Can and Man translation task. Can and Man have little syllable information in common, as even the same character can be widely divergent in aspect of pronunciation [18, 32]. To push the difference further, a set of Can characters is quite rarely to be seen in Man, because Can is a dialect that without formal regulation of written characters [18]. Moreover, younger Can speakers more likely refer to use phonetic labels (e.g. “d” responses to “di”) or homophonetic character symbols instead of ground truth, which raises intractable issues when building the translation model.

Unsupervised Learning

Our work refers to quantitative researches on unsupervised machine translation [14, 1, 16], which compose a well-designed training schedule for unsupervised translation tasks. The difference between our research and theirs mainly lies in the similarity of involved languages, where dialects in our research are far similar with each other than those in unsupervised NMT tasks.

Moreover, our research is closely related to studies on style transfer [13, 26]. There are two main differences between our task and style transfer. Firstly, the source and target sides in style transfer task belong to the same language, where the difference mainly contributed by style, e.g. sentiment [13], while dialect translation has to identically guarantee the semantics between two sides. Secondly, there are more commonalities between source and target in style transfer than that in dialect translation. The former focus on the transition of different styles, the two sides can sometimes be distinguished by only a few words. Nevertheless, dialects have wide discrepancies which vary from vocabulary and word frequency to syntactic structure.

Methodologically, compare to the mentioned studies, we motivated by similarity and difference between dialects and propose pivot-private embedding and layer coordination to jointly balance commonality and diversity.

Conclusions and Future Work

In this study, we investigate the feasibility of building a dialect machine translation system. Due to the lack of parallel training corpus, we approach the problem with unsupervised learning. Considering the characteristics in dialect translation, we further improve our translation model by contributing pivot-private embedding and layer coordination, thus enriching the mutual linguistic information sharing across dialects (Can-Man). Our experimental results confirm that our improvements are universally-effectiveness and complementary to each other. Our contributions are mainly in:

  • We propose dialect translation task, and conduct massive examples of monolingual sentences with respect to dialects of spoken Man and Can;

  • We apply an unsupervised learning algorithm to accomplish Can-Man dialect translation task. We leverage commonality and diversity modeling to strengthen the translation functionality among dialects, including pivot-private embedding and layer coordination;

  • Our approach outperforms conventional unsupervised NMT system over 12 BLEU scores, achieving a considerable performance and a new benchmark for the proposed Can-Man translation task.

In the future, it is interesting to validate our principles, i.e. commonality and diversity modeling, into other tasks, such as conventional machine translation and style transfer. Another promising direction is to incorporate linguistic knowledge into unsupervised learning procedure, e.g. phrasal pattern [33], word order information [35] and syntactic structure [36].


This work was supported in part by the National Natural Science Foundation of China (Grant No. 61672555), the Joint Project of the Science and Technology Development Fund, Macau SAR and National Natural Science Foundation of China (Grant No. 045/2017/AFJ), the Science and Technology Development Fund, Macau SAR (Grant No. 0101/2019/A2), and the Multi-year Research Grant from the University of Macau (Grant No. MYRG2017-00087-FST). We thank all the reviewers for their insightful comments.


  • [1] M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2018) Unsupervised Neural Machine Translation. In ICLR, Cited by: Introduction, Dialect Translation with Unsupervised Learning, Pivot-Private Embedding, Parameter Sharing, Unsupervised Learning.
  • [2] M. Artetxe, G. Labaka, and E. Agirre (2017) Learning Bilingual Word Embeddings With (Almost) No Bilingual Data. In ACL, Cited by: Pivot-Private Embedding.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR, Cited by: Preliminary, Preliminary.
  • [4] L. H. Baniata, S. Park, and S. Park (2018) A Neural Machine Translation Model for Arabic Dialects That Utilizes Multitask Learning (MTL). Computational intelligence and neuroscience 2018. Cited by: Dialect Translation.
  • [5] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching Word Vectors With Subword Information. TACL 5. Cited by: Experimental Setting.
  • [6] S. Chakraborty, A. Sinha, and S. Nath (2018) A Bengali-Sylheti Rule-Based Dialect Translation System: Proposal and Preliminary System. In I3CS, Cited by: Dialect Translation.
  • [7] J. K. Chambers and P. Trudgill (1998) Dialectology. 2 edition, Cambridge Textbooks in Linguistics, Cambridge University Press. External Links: Document Cited by: Introduction.
  • [8] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, P. Niki, M. Schuster, Z. Chen, Y. Wu, and M. Hughes (2018) The Best Of Both Worlds: Combining Recent Advances In Neural Machine Translation. In ACL, Cited by: Introduction.
  • [9] S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding Back-Translation at Scale. In EMNLP, Cited by: Diversity Modeling.
  • [10] O. Firat, K. Cho, and Y. Bengio (2016) Multi-Way, Multilingual Neural Machine Translation With A Shared Attention Mechanism. In NAACL, Cited by: Introduction.
  • [11] J. Hao, X. Wang, B. Yang, L. Wang, J. Zhang, and Z. Tu (2019) Modeling Recurrence for Transformer. In NAACL, Cited by: Layer Coordination.
  • [12] T. He, X. Tan, Y. Xia, D. He, T. Qin, Z. Chen, and T. Liu (2018) Layer-Wise Coordination Between Encoder and Decoder for Neural Machine Translation. In NIPS, Cited by: Layer Coordination, Stability Analysis.
  • [13] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward Controlled Generation of Text. In ICML, Cited by: Table 2, 4th item, Overall Performances, Unsupervised Learning.
  • [14] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised Machine Translation Using Monolingual Corpora Only. In ICLR, Cited by: Introduction, Dialect Translation with Unsupervised Learning, Pivot-Private Embedding, Parameter Sharing, Unsupervised Learning.
  • [15] G. Lample, A. Conneau, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word Translation Without Parallel Data. In ICLR, Cited by: Introduction.
  • [16] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-Based & Neural Unsupervised Machine Translation. In EMNLP, Cited by: Introduction, Introduction, Dialect Translation with Unsupervised Learning, Pivot-Private Embedding, Table 2, 2nd item, 3rd item, Unsupervised Learning.
  • [17] F. P. D. T. O. Laurens, P. McFetridge, J. D. N. P. M. Maricela, C. L. Pidruchney, and S. MacDonald (1997) A Lexicalist Approach to the Translation of Colloquial Text. In TMI, Cited by: Overall Performances.
  • [18] T. Lee and C. Wong (1998) CANCORP: The Hong Kong Cantonese Child Language Corpus. Cahiers de Linguistique Asie Orientale 27. Cited by: Introduction, Monolingual Corpora, Dialect Translation.
  • [19] J. Li, Z. Tu, B. Yang, M. R. Lyu, and T. Zhang (2018) Multi-Head Attention with Disagreement Regularization. In EMNLP, Cited by: Introduction.
  • [20] J. Li, B. Yang, Z. Dou, X. Wang, M. R. Lyu, and Z. Tu (2019) Information Aggregation for Multi-Head Attention with Routing-by-Agreement. In NAACL, Cited by: Layer Coordination.
  • [21] X. Liu, D. F. Wong, Y. Liu, L. S. Chao, T. Xiao, and J. Zhu (2019) Shared-Private Bilingual Word Embeddings for Neural Machine Translation. In ACL, Cited by: Private Embedding.
  • [22] K. K. Luke and M. L. Wong (2015) The Hong Kong Cantonese Corpus: Design and Uses. Journal of Chinese Linguistics 25. Cited by: Monolingual Corpora.
  • [23] J. Lyons (1981) Language and Linguistics. Cambridge University Press. Cited by: Introduction.
  • [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed Representations of Words and Phrases and Their Compositionality. In NIPS, Cited by: Pivot-Private Embedding.
  • [25] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep Contextualized Word Representations. In NAACL, Cited by: Layer Coordination, Effectiveness of Layer Coordination.
  • [26] S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style Transfer Through Back-Translation. In ACL, Cited by: Unsupervised Learning.
  • [27] R. Sennrich, B. Haddow, and A. Birch (2016) Improving Neural Machine Translation Models With Monolingual Data. In ACL, Cited by: Introduction, Diversity Modeling, Dialect Translation with Unsupervised Learning.
  • [28] R. Sennrich, B. Haddow, and A. Birch (2016) Neural Machine Translation of Rare Words with Subword Units. In ACL, Cited by: Preliminary.
  • [29] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to Sequence Learning With Neural Networks. In NIPS, Cited by: Preliminary.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. In NIPS, Cited by: Introduction, Preliminary, Experimental Setting.
  • [31] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and Composing Robust Features with Denoising Autoencoders. In ICML, Cited by: Introduction.
  • [32] T. Wong and J. Lee (2018) Register-Sensitive Translation: A Case Study of Mandarin and Cantonese (Non-Archival Extended Abstract). In AMTA, Cited by: Introduction, Dialect Translation.
  • [33] M. Xu, D. F. Wong, B. Yang, Y. Zhang, and L. S. Chao (2019) Leveraging Local and Global Patterns for Self-Attention Networks. In ACL, Cited by: Conclusions and Future Work.
  • [34] B. Yang, J. Li, D. Wong, L. S. Chao, X. Wang, and Z. Tu (2019) Context-Aware Self-Attention Networks. In AAAI, Cited by: Layer Coordination.
  • [35] B. Yang, L. Wang, D. F. Wong, L. S. Chao, and Z. Tu (2019) Assessing the Ability of Self-Attention Networks to Learn Word Order. In ACL, Cited by: Conclusions and Future Work.
  • [36] B. Yang, D. F. Wong, L. S. Chao, and M. Zhang (2019) Improving Tree-based Neural Machine Translation with Dynamic Lexicalized Dependency Encoding. Knowledge-Based Systems, pp. 105042. Cited by: Conclusions and Future Work.
  • [37] V. Zhelezniak, A. Savkov, A. Shen, and N. Hammerla (2019) Correlation Coefficients and Semantic Textual Similarity. In NAACL, Cited by: Data Preprocessing & Statistics.