Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

09/02/2017
by   Hassan Sajjad, et al.
0

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

READ FULL TEXT
research
04/11/2017

What do Neural Machine Translation Models Learn about Morphology?

Neural machine translation (MT) models obtain state-of-the-art performan...
research
11/15/2017

Bridging Source and Target Word Embeddings for Neural Machine Translation

Neural machine translation systems encode a source sequence into a vecto...
research
04/14/2017

How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse?

This paper investigates the robustness of NLP against perturbed word for...
research
12/20/2018

How Much Does Tokenization Affect in Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...
research
06/07/2019

Shared-Private Bilingual Word Embeddings for Neural Machine Translation

Word embedding is central to neural machine translation (NMT), which has...
research
10/05/2022

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation

Language modelling and machine translation tasks mostly use subword or c...

Please sign up or login with your details

Forgot password? Click here to reset