In the US alone, there are approximately 900,000 hearing-impaired people whose primary mode of conversation is sign language. For these people, communication with non-signers is a daily struggle, and they are often disadvantaged when it comes to finding a job, accessing health care, etc. There are a few emerging technologies designed to translate sign language to English in real time, but most of the current research attempts to convert raw signs into English words. This aspect of the translation is certainly necessary, but it does not take into account the grammatical differences between signed languages and spoken languages. In this paper, we outline our bidirectional translation system that converts sentences from American Sign Language (ASL) to English, and vice versa.
To perform machine translation between ASL and English, we utilize a generative approach. Specifically, we employ an adjustment to the IBM word-alignment model 1 (IBM WAM1)[ibm] where we define language models for English and ASL, as well as a translation model, and attempt to generate a translation that maximizes the posterior distribution defined by these models. Using these models, we are able to quantify the concepts of fluency and faithfulness of a translation between languages.
2 Related Work
In examining papers and projects related to this topic, we have discovered that translation between signed and spoken languages is still largely an open problem. The few emerging technologies that attempt to tackle this issue are only capable of translating single words or short phrases of ASL into English. In fact, most approaches focus on analyzing videos of signs and converting these into words, which a number of recent CS229 projects have done. However, several teams at the University of Pennsylvania have taken a more grammatical approach [zhao2000machine] to the problem of English-to-ASL translation but not in the other direction. There has also been an Italian team that has designed a tentative translation system for Italian sign language,[mazzei2013deep] but there are very few parallels with respect to grammatical structures that we can draw for this project. In short, there has been relatively little emphasis in the literature on the conversion between the grammars of the languages, which is what we examine in this project.
3 Task Definition
Let us formalize the notation for this task. For the sake of simplicity, we will use ASL as our “source language” and English as our “target language.” In general, will be used to refer to the sentence in ASL, where is a sequence of signs and is the length of the sentence. Each sign is represented by a word with uppercase letters. We also include special ”gesture tokens,” which represent gestures or hand movements (e.g., [point], [head_shake]) that are not signs in and of themselves but rather modify other signs in the sentence. As a special case, which we address later, commas are also treated as individual sign tokens. Similarly, we will use to refer to the English sentence, where is a sequence of words and is the length of the sentence.
In the backward direction, the input is a grammatically correct English sentence and the output is a sequence of signs represented by words. An example for this input is
To evaluate translations generated by our system as well as the baselines in both directions, we use the BLEU-2 score,[papineni2002bleu] which takes in a predicted sentence and a reference sentence, both in the target language. Let
be the the ratio of the number of shared n-grams to the total number of n-grams in the predicted sentence. Then for a predicted sentenceand reference sentence , the BLEU-N score is given by
Thus BLEU-2 is essentially the ratio of shared unigrams and bigrams between the predicted translation and the gold-standard translation to the number that appear in the predicted translation. The exponential factor at the beginning of the above expression is called the brevity penalty. Because the rest of the BLEU expression measures the precision of the prediction, it does not penalize the prediction for leaving out words. The role of the brevity penalty is to penalize predicted sentences that are shorter than their reference counterparts. As a concrete example, consider translating to the English with the correct reference . The BLEU-2 score of this prediction is
The BLEU-2 metric is used primarily to rate translations at a corpus level, so it does not perform as well when used to evaluate translations of single sentences. Despite that, the BLEU-2 score should be a sufficient evaluation metric for the purposes of this project, since it is still the most common machine translation evaluator and is relatively simple to interpret.
Our data consists of 579 ASL/English translation pairs scraped from a repository from lifeprint.com,[aslcorpus] a website dedicated to ASL education. Each pair is a single accurate translation of an ASL sentence or question into English. Many of the ASL sentences contain gesture tokens. These pairs are not specifically designed to be translations from English to ASL and in general, translation between languages is not symmetric. However, we make the simplifying assumption that this translation is symmetric primarily because we do not have a thorough enough understanding of ASL to translate from English to ASL.
To test our algorithms, we split the data into two main sets–a training set (80% of original data) and a testing set (remaining 20%). We also have a third set, called the development set, that we use to tune our hyperparameters. This set is constructed by pulling out 10 examples from the training set and treating them as our development set. A more standard convention is to use a 70%-30% split for training and testing data, but we decided to use a larger training set size because we use these signs as our entire corpus for our ASL language model. We also use small subsets of the data to run tests to assess the performance of specific aspects of our system. For example, we construct a test set from the original test set to judge the performance of ”comma-trigram” feature of our ASL language model. We construct another test set to handle gesture tokens. To assess these tests, we use the BLEU-2 metric described in the task definition.
5 Technical Approach
Below, we detail the baselines, oracles, and the IBM word alignment method for this problem.
5.1 Baseline and Oracle
Because of the bidirectional nature of the problem, we have two baselines. For a baseline in English-to-ASL direction, we use a unigram cost function over English to detect the “most important” words in the sentence and directly translate each word into ASL, preserving the order of these words. To pick the most important words, our baseline selects the highest cost words above some threshold based on our unigram cost function. In this case, a higher cost implies a lower frequency in the English language, suggesting higher importance. For our baseline in the other direction, we directly translate each sign into English and insert small helper words (such as “a,” “the,” “and,” etc.), treating this as a search problem that tries to minimize the cost of the English sentence using a bigram cost function. To test these baselines, we utilized the same BLEU score as we did for the results generated by our IBM word alignment system. This approach allowed us to easily compare the baseline results and IBM WAM results, which are reported in the ’Results’ section.
For our oracles in both directions, we have someone familiar with both languages translate each sentence into the other language. These translations are adequate oracles because they make the most sense from a human standpoint. Furthermore, the oracles serve as appropriate gold standard translations that can be used in the BLEU score calculation.
5.2 Custom WAM
The IBM WAM1 is a common machine translation model that we customize to take advantage of common constructions in ASL and allow us to modify how the translation model and language models are constructed. The custom model is meant to give us flexibility in our translations.
We now describe the general workflow of our model in the ASL to English direction. This workflow applies to the other direction as well, except with switched with : 1). For each potential translation, we calculate 2). Then, given this channel input, we calculate in the “noisy channel” 3). Finally, we choose theis represented by the language model while the is represented by the translation model. We will derive this formulation below.
For concreteness, we will use a running translation example in the descriptions of our models and algorithms. Consider the following gold-standard translation from ASL to English:
To derive the models that we will use in the translation, we consider the following optimization problem (note that once again we are writing this derivation in the ASL to English direction but the same derivation applies to the other direction):
Given a sentence S in ASL, we want to find an English sentence such that:
The probability represents the fluency of in English, i.e. how much the translation makes sense to a native speaker, while represents the faithfulness of the translation from to , i.e. if the translation reflects the actual meaning of what is being said. We determine using a language model for English and using a translation model from English to ASL. (Note that in our overall translation from ASL to English, we model the probability of a given translation from English to ASL.)
5.2.2 Language Model
To compute for a given English sentence , we use an -gram cost function. For instance, in our example sentence above in the case where ,
where is a 3-gram cost function. We have imported -gram cost functions from http://www.ngrams.info/ [ngrams] for .
There are a couple of caveats with the above approach when trying to create a language model for ASL, which is necessary for translation in the other direction (i.e. to calculate ). The first is that our ASL language model is basically a unigram model that has additional components. In addition to each sign having an associated probability, which is a simple unigram model, we add a comma trigram construction. This construction incorporates into the language model the following grammatical structure that is commonly found in ASL,
In other words, an infrequent (uncommon) noun will be followed by a comma, which is followed by a phrase that generally describes or refers to the noun. The phrase after the comma usually starts with a more frequent (common) word. The way the comma trigram encapsulates this is by increasing the unigram probability of the second word and decreasing the unigram probability of the first word. This effectively favors a big difference between the word before the comma and the word after the comma, which in turn favors an uncommon word being translated before the comma and a common word after the comma. The second caveat is that we incorporate gesture tokens, which are contained in brackets, into our corpus. Our language model does not, however, treat these gesture tokens any differently than regular signs.
5.2.3 Translation Model
Now, we wish to estimate the probabilitygiven a trained translation model. We represent this model as a mapping from ”sign-English” pairs to probabilities (of those pairs) using a dictionary in python. Training this model requires us to introduce a set of alignment variables where represents the index of the word in the English sentence that translates to the th word in . We also allow these variables to take on a value of 0, which represents a “null” word in the English sentence. This null word allows for the possibility that there is no word in the English sentence that directly translates to some sign in . In the case of our example above, if specifies that “learning” in translates to “LEARN” in , then , since “LEARN” is the third word in and “learning” is in the fourth position of . Thus, can be thought of as a many-to-one mapping from words in to signs in . (In general, this mapping would need to be many-to-many since we sometimes require phrase-to-phrase translation, but for our purposes the many-to-one assumption is reasonable because generally speaking, multiple English words map to only one sign.) Then, to compute , we marginalize out the alignment variables
Now, let be the number of words in . For fixed , we assume that each possible alignment for each length is equally likely, which gives us that . This probability can be thought of as a normalization factor for the probability, . Indeed, is given by,
where is the probability of translating as . Therefore, this gives us the following for ,
After training on our dataset, constructing a language model for each language using the corpa we have processed, and generating a parameterized translation model using the EM algorithm, we can use our decoder to find the optimal translation. Both the EM and decoding algorithms are described in detail in the next section.
The two primary algorithms used in our implementation are an EM algorithm to train our translation model which defines and our decoding algorithm. The decoding algorithm is a modified version of the decoding in IBM WAM1, which uses a variant of beam search to find
First we describe the EM algorithm. Our language model maintains a mapping from pairs , where is a sign and is an English word, to probabilities . Given our training data containing a list of translation pairs , we can rewrite our estimate of a single transition probability as
is the alignment vector for training example. To estimate given the data, we need to find the maximum likelihood of this probability. However, with the introduction of these latent variables, it is not possible to find this likelihood in a closed form. Therefore, we use the EM algorithm to estimate this likelihood by computing the probability of each possible alignment given a distribution of for each and in the E-step and then adjusting this distribution based on these alignment probability estimates in the M-step. The algorithm runs as follows:
Our decoding algorithm is a modification of the conventional decoding algorithm used in IBM WAM1. We frame the problem of finding the best English translation as a type of search problem. Let be the length of . The algorithm is a variant on BEAM search that maintains a list of priority queues of hypotheses. Each hypothesis consists of a list of English words each with a corresponding sign from . Initially, all priority queues are empty, with the exception of , which contains the empty hypothesis. Then we iterate through the priority queues, and for the top hypothesis in the current queue, we generate some number of new hypotheses by adding a single English word as the next word in . We place each new hypothesis in , where is the number of signs in that have been translated to English words in . Each hypotheses is prioritized according to
where is the English sentence contained in and W is the language model weight whose purpose is explained below. The aspect related to BEAM search is that we only consider the first hypotheses in each before moving on to . Thus, we effectively prune hypotheses that do not appear among the first for a given number of translated signs.
where newhyp() is the new hypothesis formed by translating to and newhyp is the maximum according to the priority measure. Once the algorithm reaches the final priority queue, all signs have been translated. This implies that the top hypothesis in the priority queue is the hypothesis that translates all signs in to English (possibly to the NULL word) and maximizes , i.e., the hypothesis with English sentence
5.2.5 Algorithms Commentary
One of the most important aspects of the algorithm is that a hypothesis can generate a new hypothesis that is placed in the same queue as if it is generated by translating a sign already translated in . This is important because it allows us to translate a single sign to multiple English words. In particular, this allows the translated English sentence to be longer than the input sign sentence, which is the case with most English translations.
Another important note concerns the language model weight. If this weight is set to 1, the priority is simply , i.e., the value we are trying maximize. However, we find that if we try to maximize this value, the language model probabilities ”outweigh” the probabilities of the translation model. In effect, the sign sentence is likely to be translated as a set of common English n-grams unrelated to the original sentence. For instance, when we attempt to translate as , our top result is . However, when we introduce this language model weight with , we produce the translation . Although this translation might not be as common in English as , it certainly preserves the meaning of the original sentence much better.
Just as with BEAM search, our decoding algorithm is not guaranteed to converge to the optimal result (in fact, this problem in NP-complete), but in many cases it provides good results.
|Comma Trigram Test|
|Queue Size||Language Model Weight||BLEU-2 Score|
|Queue Size||Language Model Weight||BLEU-2 Score|
|Queue Size||Language Model Weight||BLEU-2 Score|
As mentioned in the Data section, we split the dataset into three separate sets: a training set, a development set, and a test set. For both directions, we train our models on a training set, tune the hyperparameters (i.e. the language model weight, the queue size , and the type of n-gram model we are using for the language model [unigram, bigram, trigram]) on a development set, and finally find the average BLEU-2 score on a test set. Furthermore, our metric for measuring the accuracy of translation on the test set or the development set is the average of individual BLEU-2 scores of the translations. Thus, any BLEU-2 score that is reported in tables or figures will be the mean BLEU-2 score.
6.1 Hyperparameter Tuning Results
To find the optimal hyperparamaters, we ran several experiments. The initial set of experiments were run to find the optimal hyperparameters for translation from ASL to English. First, we found the BLEU-2 score on the development set by keeping the queue size constant and adjusting the language model weight using the bigram English model. Next, we performed the same experiment except using the trigram English model. Results of these experiments are reported in tables 2 and 3. We find that the optimal language model weight is 0.1, the optimal queue size for ASL to English translation is 20, and the optimal language model type is a trigram model. Using this combination gives us a BLEU-2 score of 0.276 (see Table 3), which is the highest BLEU-2 score that we obtained while tuning hyperparameters.
To provide some intuition on the choice of experiments, we reasoned that we could converge on the optimal hyperparameter combination using a method similar to coordinate ascent, where we optimize a single parameter while keeping the other parameters constant. Then, after finding the optimal value for the parameter, repeat the process with the other parameters while still maintaining the optimal values for the parameters that have been already processed. Using the (20,0.1,’Trigram’) combination, we ran our system on the test set; the BLEU-2 score, reported in Table 1, is 0.1202.
The subsequent experiments involved finding the optimal hyperparameters for translation from English to ASL. Using a similar methodology to the one used in the other translation direction, we ran one set of experiments where the queue size was kept constant while the language model weight was adjusted. Because we only used a unigram ASL language model while translating in this direction, running only this set of experiments is sufficient to find the optimal hyperparameter pair. Results of the experiment are reported in Table 4. We find that the optimal language model weight is 0.1 and the optimal queue size is 20. Using the (20,0.1) combination and running the system on the test set, we obtain a BLEU-2 score, reported in Table 1, of 0.1802.
6.2 Experiments for ASL Constructions
The last two sets of experiments test how our system performs on translating specific ASL constructions. Firstly, we perform an experiment on translating only the common comma construction described above in Modeling, by filtering for these test examples in the test set and using the optimal ASL to English hyperparameters when running on the filtered test set. The BLEU-2 score result for this test is .1201 (see Table 1).
Finally, the last experiment tests how our system translates sign sequences with gesture tokens. We create a test set for this experiment in a similar fashion to what we did for comma constructions, by filtering the original test set to only include sign sequences with gesture tokens. The BLEU-2 score for this test is .1232 (see Table 1).
As a sanity check for our results, it is important to note that our translation system outperforms the baseline implementations in both directions. We also observe a difference in performance between the two translational directions. The ASL-to-English direction produces an average BLEU-2 score of 0.12, while the English-to-ASL gives a higher score of 0.18. The behavior is consistent with our expectations going into the project since in going from English to ASL, we essentially filter out unnecessary information, whereas the other direction requires generating information in effect.
After running our tuning experiments on the hyperparameters, we noticed a few interesting trends. First, instances in which we use a trigram model for English tend to produce higher BLEU-2 results than with corresponding bigram English models, particularly when the maximum queue size in the decoding algorithm is large. This makes sense because the trigram model inherently captures more information than the bigram model. Furthermore, we see from the two graphs above that for large maximum queue sizes, instances with small language model weights perform better. We can explain this as follows: For large maximum queue sizes, after the first few iterations of the decoding algorithm, we will see hypotheses with common words. If we do not sufficiently dampen the influence of the language model, we will begin to see hypotheses with high priorities resulting from common ngrams, and these will steer the sentence generation away from the original meaning of the sign sentence. Therefore, we conclude that models with high maximum queue sizes and low language model weights generally perform best.
Although it is useful to look at how changing these hyperparameters changes the model, it is perhaps even more important to recognize the shortcoming of our current model. The first issue is with our corpus, which by most measures is far too small to perform accurate machine translation. We also lack large databases containing sign language as text which limited our language model for ASL. Unfortunately, our ASL language model was far too crude to produce accurate results. We oversimplified the model in several ways, in particular with our comma trigram structure and treatment of gestures as equivalent to other signs. Because we are not fluent in ASL, let alone fully understand its underlying structure, we had difficulty in designing its language. In fact, to our knowledge, no one else has designed a text-based language model for any sign language. Finally, machine translation for single sentences is inherently a difficult task, since the sentences lack context. Despite these many challenges and difficulties, our system still translates many sentences effectively between the two languages in both directions.
8 Conclusion & Future Work
While we would like our translations to be as accurate as possible, it is important to note that it is virtually impossible, even for an oracle, to come up with a perfect translation. In any language, words and sentences carry contextual meaning that might be impossible to express exactly in another language. The problem of translating between sign languages and natural languages is extremely difficult, and we are likely the first to address it exactly this way. Although we faced numerous challenges, we have shown that this approach is a reasonably effective improvement to our baseline algorithms. With a more thorough understanding of ASL and its grammatical structure, along with a larger training corpus, our approach has the potential to be an effective system of translation between ASL and English.