Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages

by   Xin Tang, et al.
Carnegie Mellon University

Measuring the semantic similarity between two sentences (or Semantic Textual Similarity - STS) is fundamental in many NLP applications. Despite the remarkable results in supervised settings with adequate labeling, little attention has been paid to this task in low-resource languages with insufficient labeling. Existing approaches mostly leverage machine translation techniques to translate sentences into rich-resource language. These approaches either beget language biases, or be impractical in industrial applications where spoken language scenario is more often and rigorous efficiency is required. In this work, we propose a multilingual framework to tackle the STS task in a low-resource language e.g. Spanish, Arabic , Indonesian and Thai, by utilizing the rich annotation data in a rich resource language, e.g. English. Our approach is extended from a basic monolingual STS framework to a shared multilingual encoder pretrained with translation task to incorporate rich-resource language data. By exploiting the nature of a shared multilingual encoder, one sentence can have multiple representations for different target translation language, which are used in an ensemble model to improve similarity evaluation. We demonstrate the superiority of our method over other state of the art approaches on SemEval STS task by its significant improvement on non-MT method, as well as an online industrial product where MT method fails to beat baseline while our approach still has consistently improvements.



There are no comments yet.


page 2


Improving Multilingual Neural Machine Translation For Low-Resource Languages: French-, English- Vietnamese

Prior works have demonstrated that a low-resource language pair can bene...

Breaking Down Multilingual Machine Translation

While multilingual training is now an essential ingredient in machine tr...

Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers

We investigate different approaches to translate between similar languag...

Ensemble-based Transfer Learning for Low-resource Machine Translation Quality Estimation

Quality Estimation (QE) of Machine Translation (MT) is a task to estimat...

An Empirical Study of Factors Affecting Language-Independent Models

Scaling existing applications and solutions to multiple human languages ...

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Semantic Textual Similarity (STS) is a fundamental task in many Natural Language Processing applications such as question answering, machine translation, semantic search, etc.

[Cer et al.2017]

. For its importance, there has been a growing interest in developing solutions for the task from both academia and industry. In particular, deep learning techniques have been used extensively in STS under supervised settings

[Yin et al.2016, Pang et al.2016]. The common approach is to take advantage of pretrained word embeddings such as Word2Vec [Mikolov et al.2013]

, therefrom a deep neural network is used to extract the sentence representations as well as the interactions between them. Subsequently, a final Multi-Layer Perceptron (MLP) is trained from the representations and interactions to fit the STS label. Despite achieving outstanding performances, this approach requires large amounts of labeling, which restricts its applicability in settings with insufficient labeling, such as low-resource languages like Spanish, Arabic and Thai.

Existing approaches to STS in low-resource languages mostly leverage machine translation (MT) techniques. One possible approach is to translate the target sentences to a resource-rich language where a well-trained semantic similarity model can be obtained [Tian et al.2017]. Even though this MT based approach provides strong baselines in the past SemEval tasks, it also poses several drawbacks. First, the translation quality depends highly on the quality of input sentence. SemEval data sets are collected from formal writing sources such as books and newspapers, which can be translated from English to other languages quite accurately. In practice, it is common to observe sentences with informal writing styles including typos, slang or abbreviation. The translation quality of these sentences often degrades [Belinkov and Bisk2018]. Besides possible semantic loss introduced by translation, MT based method clearly abates the efficiency in online services. Another approach [Tian et al.2017] is to incorporate various language independent features such as sentence length and lexical similarity to achieve an ensemble model which also takes more computing resource in real time.

In this work, we propose a shared encoder framework to perform STS in a target language with insufficient labeling by utilizing annotated data in rich-resource languages. More specifically, we expand a basic monolingual framework for STS to a multilingual one, where an encoder is shared in both languages. In order to alleviate language discrepancy, inspired by machine translation techniques [Artetxe et al.2018], we conduct a bi-directional translation task on the shared encoder, together with a shared decoder for both languages. Meanwhile, this translation framework also allows self-translation, which is similar to denoising auto-encoder and can help reserve the original semantics. Due to the shared encoder, one sentence can be encoded into different language semantic spaces by prepending the target language token to the sentence. Finally, a shared Multi-Layer Perceptron (MLP) is trained to fit the pair similarity.

We conduct experiments on an off-line public data set SemEval and industrial data sets from real-world online service. On SemEval data set, we calculate the similarity of Spanish and Arabic pairs and our method consistently beats other non-MT state of art approaches. On the online service in Thai and Indonesian, our method has been deployed and verified in a spoken language data set that its performance prevails MT based one.

Our main contributions are as follows:

  • We propose a multilingual shared encoder framework to perform STS task in low-resource settings, which effectively leverages annotated data in rich-resource languages, and achieves promising results with little supervision.

  • We employ bi-directional machine translation task to obtain a multilingual encoder. This approach alleviates language discrepancies, as well as captures general-purpose semantic distributions.

  • This paper also provide a data augmentation like way by exploiting the shared encoder to obtain sentence representations on different language semantic space and shows its effectiveness on further improving semantic similarity evaluation.


In this paper, we are focusing on Semantic Textual Similarity (STS) task, where we are given two sentences and and the goal is to predict their similarity. Our approach includes two stages of training. We firstly learn a shared multilingual sentence encoder through translation task on large parallel corpus. Then, sentence textual semantic task is trained based on the pretrained multilingual encoder with labeled data from both rich-resource language and low-resource language.

Pretrained Multilingual Encoder

Our shared multilingual encoder architecture depends on a shared encoder machine translation model. In this section, we describe how we design and train the shared encoder translation model.

Shared Encoder Model Architecture

We adopt the state-of-the-art transformer architecture [Vaswani et al.2017] for our translation model. To simplify our explanation, Figure 1 shows a special example of shared encoder translation model: a bidirectional translation model. Inspired by [Johnson et al.2017], the model consists of a shared encoder and a shared decoder. The encoder uses a shared vocabulary for both languages, while the decoder uses two separated vocabularies, one for each language.

Given a sentence to be translated, the model firstly encodes the sentence with a shared encoder. Then the shared decoder will choose a vocabulary to generate the translation. The decoder chooses the vocabulary based on the language token in the beginning of the input sentence.

As we expected, the bidirectional translation model supports not only source-to-target, target-to-source translation, but also source-to-source and target-to-target self translation (the following section will describe how we train the model with such ability). With the ability of self translation, the shared encoder model can learn consistent distributions for different languages.

Figure 1: A special case of shared encoder translation model: bidirectional model. Given an input sentence of either language or , the model firstly uses a shared vocabulary to get the embedding of each word within the sentence, then the shared encoder will encode the input sentence to get the semantic representation. Lastly, the shared decoder chooses vocabulary or based on the language token in the input sentence, and then generate the final translation.

Shared Encoder Model Training

Similar to previous section, we introduce our shared encoder model training mechanism by a bidirectional model as an example. We follows the state-of-the-art NMT training scheme. The major differences between our shared encoder model training scheme and the Transformer model  [Vaswani et al.2017] lie on the usage of bilingual training data, subword technique [Sennrich, Haddow, and Birch2015], source/ target side vocabularies, and training loss. Suppose we need to train a shared encoder model for two languages and :

  • Training data. We divide the training data of the bidirectional model into four portions: source-to-target (), target-to-source (), source-to-source (), and target-to-target (). For each portion, we add an extra language token in the beginning of each source-side sentence to distinguish which language the input sentence will be translated into. Specifically, we de-noised the and data to make sure the model will not ”simply copy the input” [Vincent et al.2010].

  • Subword. We train the encoder side byte pair encoding (BPE) on the mixing of and . For the decoder-side, we train BPE model separately for each language.

  • Vocabularies. Following the scheme of subword training, we train one encoder-side vocabulary and two separated decoder-side vocabularies for a bidirectional translation model.

  • Training loss. Equation 1 shows the training loss of shared encoder model, which is a linear combination of losses of different target language.


    where . is the number of language directions, which is 4 in a bidirectional translation case. For example, , , , represent the loss of translation direction , , , and

    , respectively. Each loss can be calculated by a cross-entropy based loss function 

    [Cho et al.2014, Sutskever, Vinyals, and Le2014]

    , which is widely-used in neural machine translation.

Semantic Textual Similarity Network

We encode each sentence to the same length of hidden states as the input by using the pretrained sentence encoder in the previous Section.

where and . We further apply intra-sentence and inter-sentence attentions to obtain sentence representations.

Figure 2: An overview of our shared encoder STS model architecture.

Intra-sentence Attention

We employ various aggregation methods, namely, max, mean and self-attention, and find self-attention works best. For self-attention aggregation, we adopt the attention mechanism from [Wang et al.2017]:


Inter-sentence Attention

Inter-sentence attention follows the approach described in [Wang, Mi, and Ittycheriah2016]

. We firstly calculate the semantic matching vectors for each sentence pair by soft aligning elements of one sentence to another:

where function is a feed forward network.

After semantic matching phase, sentence vectors and its semantic matching vectors are compared and decomposed into similar components and dissimilar components :

where is orthogonal decomposition to decompose into the parallel and perpendicular components with respect to .

Similar components and dissimilar components are further passed through to obtain comparison representations and followed by an average pooling layer to aggregate all the features to obtain the inter-sentence representations:


Representation Layer

Given the intra and inter sentence representations, we concatenate the following terms as the sentence pair representation : absolute difference and element-wise multiplication between two intra-sentence representations; two inter-sentence representations:


Output Layer

The output layer exploits a fully connected neural network with two layers. The first layer uses 300 units with activation function. The second layer has K units output combined with

to produce probability distribution

on similarity labels. And the loss for STS task is computed as KL divergence between the predicted and the true probability distribution which is verified in [Tai, Socher, and Manning2015] to have better performance than squared error objective. And the predicted rating is reconstructed from by multiplying it with where for SemEval task.


Ensemble with Multilingual Sentence Representations

For STS task in low-resource language , one approach to augment data is to translate all the data set to another language especially to a rich-resource language . Different models can then be trained with data sets in different languages and ensembled together to achieve better performance. This approach has been shown to be able to bring significant improvement on STS task in [Duma and Menzel2017] where sentence representations are trained and calculated in different languages. And their average on consine similarity is used to measure the semantic similarity. However, this approach requires to call external translation service multiple times to translate sentences into different languages and thus incur large amount of extra response time in practice.

One beneficial brought by the pretrained shared encoder is that for each sentence, multiple output representations can be easily obtained without calling external translation service by simply prepending target language token to sentence. For example, with prepended token as language , the encoder output will be in semantic space; with prepended token as , the encoder output will be in semantic space.


Hence, besides the intra and inter sentence representations, we can further exploit this nature of shared encoder to obtain language-wise features. We use the representations in different language semantic spaces to calculate their prediction of probability distributions on semantic similarity level and ensemble their predictions by linear combination as the final prediction:


where , is the number of languages that the shared encoder can support. And finally we calculate KL divergence between and as similarity task loss :

Figure 3: Ensemble of predictions from sentence representations in different language semantic spaces.


Data Set

We firstly verify our approach on public benchmark data set SemEval 2017 task 1 track-1(ar-ar) and track-3(es-es). And then it is also validated on an Indonesian and Thai data set of an industrial application.

SemEval Data Set

SemEval-2017 task 1 requires to measure the relatedness of two sentences as score ranging from 0 for no meaning overlap to 5 for meaning equivalence. Table 1 shows the statistics of this data set. There are above 10,000 annotated data for English collected from the past SemEval STS task (2012-2015) while just around 1000 pairs for Arabic and Spanish which represents low-resource cases. Since SemEval provide no dev set for model evaluation and selection, we randomly select 20% pairs from Spanish and Arabic training data as dev sets and exclude them from training.

All the training data is simply lower-cased, tokenized by white space and then byte pair encoded [Sennrich, Haddow, and Birch2015] to reduce the vocabulary size needed for translation pretraining. No hand-crafted feature is added.

Language Pair Train Dev Test
AR-AR 864 217 250
ES-ES 1244 311 250
EN-EN 12000 1592 250
Table 1: Statistics of SemEval 2017 data set

Industrial Data Set

We also examine this approach on an e-commerce chatbot scenario that the target language Indonesian has about 4,000 labeled similarity pairs and Thai has about 10,000 labeled similarity pairs while English instance has already accumulated above 0.2 million. The data set was constructed based on online chatlog of a QA chatbot. If the answer of chatbot is correct, the user query and the knowledge title of the answer is labeled as similar. Otherwise, the label is dissimilar. We split the data set by date, saving 3 days of data as development set and 3 days of data as test set such that the data set can reflect the actual data distribution online.

The key difference between industrial data set and SemEval data set is that industrial data set contains large amount of abbreviations, spelling errors and grammar errors. Table 2 shows sentence from Indonesian data set and its translations by different translators. From the translation results, we can tell that translator introduce the errors like keeping the abbreviation untranslated or translating the misspelled word wrongly to other meaning.

Example sentences
Original Bgm cara sy memesa n bgm pembayran nya mks
Bing BGM how sy n his pembayran bgm ordering mks
Google How do I manage how to pay it?
Human How can I order and how to pay it?
Table 2: Comparison of example Indonesian user query and its translations

Machine Translation Data Set

We used Paracrawl data 222Provision of Web-Scale Parallel Corpora for Official European Languages, 111, which contains about 16 million parallel sentence pairs for English Spanish model training. OpenSubtitle 2018 portion of OPUS  [Tiedemann2012] is used for English Arabic model training, which contains about 31.9 million parallel sentence pairs. Both the test data consist of 1,000 randomly sampled bilingual sentence pairs from the corresponding training data. We used 50,000 BPE operations for source vocabularies and 30,000 for target vocabularies for both models. Top 50,000 and 30,000 tokens are kept for source and target vocabularies.

Parameter Setting and Evaluation Metrics

Transformer sentence encoder uses the transformer-base setting from [Vaswani et al.2017]

: 512 hidden size, 512 embedding size, 2048 filter size, 8 heads for multihead attention, 6-layer encoder and 6-layer decoder. We train all the models with a batch size of 4096 tokens, and 0.0003 learning rate with Adam optimizer. We use case-insensitive BLEU-4 as our evaluation metrics for machine translation.

When training semantic similarity task, the parameters of transformer encoder are fixed and only STS task specific parameters are trainable. All the feed-forward neural network is two layers with tanh as activation function. Adam optimizer is used with 0.0003 learning rate and batch size of 16. Early stop is used by observing the evaluation metrics on development set. We choose the model performed best on development set for later evaluation on test set.

The evaluation metrics for SemEval task is the pearson coefficient score between the gold rate and the predicted score .


We include two Bag of Words models as baseline since BoW model is well known to perform strongly on semantic similarity task as they capture word identity information: (i) one-hot embedding average. Sentence representation is obtained by taking each dimension as whether an individual word appears in the sentence; (ii) fasttext word embedding average. Fasttext pretrained embedding [Bojanowski et al.2017] is used in this setting. Both methods use cosine value over two sentence representations to measure similarity.

Machine Translation Results

We firstly describe our results on machine translation in this section. Table 3 shows the MT evaluation on different language pairs. As we can see from the paper, all the models can get over 0.95 BLEU score on self-translation, which means our shared model can effectively learn the information of input sentence. Compared with single models (one specific translation direction), our shared models can achieve comparable or even higher BLEU score on bidirectional translation.

Language Pairs Source Target BLEU-shared BLEU-single
EN-ES EN ES 0.4722 0.4744
ES EN 0.3592 0.3753
EN EN 0.9678 NIL
ES ES 0.9884 NIL
EN-AR EN AR 0.0978 0.1201
AR EN 0.3474 0.3415
EN EN 0.9890 NIL
AR AR 0.9683 NIL
Table 3: Machine translation results. For each language pair, there are 4 translation directions. For example, EN-ES represents the 4 translation directions related to English and Spanish: ENES, ESEN, ENEN, and ESES, respectively. BLEU-shared and BLEU-single represent the BLEU score of our shared translation model and single translation model baseline on each translation direction, respectively. As it is meaningless to train a self-translation single model, statistics about self-translation of single models are not shown in this table.

Comparison with Unsupervised Methods

From the experiment, BoW models give strong baselines for SemEval 2017 ar-ar, es-es and en-en test sets. One-hot embedding performs better than fasttext embedding on all three languages meaning that the lexical overlap is a quite strong feature for judging similarity on SemEval task. BoW baseline model also outperforms non-MT method of HCTI on Arabic and Spanish since for these two languages, there are very few training data about 1000 pairs. The full model with translation task pretrained encoder has significantly higher results than the two BoW baselines in all languages. However, among all the tasks, Arabic has the smallest improvement from 0.604 to 0.650 only. We argue that this may result from the discrepancy between English and Arabic is larger than English and Spanish which we can tell from the BLEU score of translation task is lower for En-Ar compared to En-Es.

Comparison with Supervised Methods

In this work, we focus on methodology that does not require translating low-resource language to rich-resource language in inference time otherwise it will increases response time and largely depends on the translation quality of a third party system. However, from published results, we can see that MT based method achieve the highest scores on all tasks since it benefits from the large amount of training data in resource rich language which is English for SemEval task. Our model can achieve the same performance for Spanish as 0.825 which is quite close to 0.826 of MT method. However, for Arabic, the gap is still large. For HCTI non-MT method, it even cannot beat baseline BoW model for Arabic and Spanish test data which reaffirms that the training data of Arabic and Spanish is insufficient to train a supervised model from scratch. With the same insufficient training data of Spanish and Arabic, our approach shows significant improvement about 0.2 increasing on Pearson correlation score by incorporating more training data from English with MT pretrained encoder. From above comparison, we can conclude our approach can successfully transfer the knowledge in resource rich language to resource low language especially when two languages are closed to each other and can be easily translated to each other.

BoW Baseline
One-hot 0.604 0.711 0.727
FastText 0.549 0.686 0.559
MT Method
HCTI 0.713 0.826 0.811
non-MT Method
HCTI 0.437 0.671 0.815
Our model 0.650 0.825 0.817
Table 4: Pearson correlation coefficients comparison on STS 2017 task 1 ar-ar, es-es and en-en. HCTI model[Shao2017] includes both word embeddings and hand-crafted features as its input to the model.
FastText 0.617 0.598
MT Method
Decomposable Attention 0.507 0.528
Our Model
w/o Pretrained 0.663 0.696
Pretrained 0.758 0.782
Table 5: Result on Industrial Indonesian and Thai Data Set using AUC as evaluation metrics. Decomposable attention refers to [Parikh et al.2016]

Industrial Application Result

The result shows the limitation of MT based method which translates resource low language to resource rich language and predict. MT method performs well on SemEval task since the data set contains no abbreviation, spelling error and grammar error which make it easier to translate. However as illustrated in table 1, industrial data contains large amount of informal and incorrect words which largely affect the quality of translation. Table 5 shows MT method even performs much worse than the word average baseline. For our approach, with MT pretrained, the AUC score can be improved from 0.617 to 0.758 for Indonesian, 0.598 to 0.782 for Thai. In both industrial data set we have consistently improvement.

Impact of Number of Finetuned Layers

For above experiments, when training semantic similarity task, parameters of pretrained sentence encoder are fixed. We also observed the impact that if we have different number of layers of sentence encoder trainable. We unfreeze starting from the last layer, last 2 layers and finally to all 6 layers since last layer contains least general knowledge [Yosinski et al.2014]. The result shows that the performance fluctuates and decrease a lot if all layers are trainable. Unfreezing the last several layers can improve the performance for some settings but it does not have a consistent pattern across all languages. This observation is not consistent with the one in [Radford et al.2018] that supervised finetuning the pretrained language model in target task can bring extra benefits as the number of layers increases. We argue that it is because STS task has relative small data set about 10k and it is more prone to overfit.

Figure 4: Effects of number of finetuned transformer encoder layers. The variable number means that we unfreeze the layers starting from no layer, the last layer, and finally to all layers.

Ablation Study

We perform an ablation to study the contribution of different tasks of our methodology. We train multiple models with missing different training data and part of the model: rich-resource training data, low-resource training data and multilingual sentence representations.

Without pretraining the sentence encoder, we can see that for Spanish training data only setting, the Pearson score is very low compared to English training data only setting since the small number of Spanish training data is inadequate to train a supervised model. And the performance does not improve when combining English and Spanish training data together for model without pretraining. For each setting, after pretraining, the performance has significant improvement especially for Spanish training data only from 0.188 to 0.727 which also exceeds non-MT HCTI method. This confirms that MT pretraining task can help to improve the performance when supervised training data is insufficient. And for the pretrained model, adding English data and multilingual representations also further contribute to improve the performance.

w/o Pretrained Pretrained
En only 0.668 0.766
Es only 0.188 0.727
En + Es 0.661 0.796
+multilingual repr 0.6738 0.825
Table 6: Ablation study on SemEval 2017 task es-es

Related Works

In this paper, we focus on the problem of semantic textual similarity, which is widely applied in many scenarios. However, most researches work on the supervised settings of monolingual language. [He, Gimpel, and Lin2015] adopts CNN to capture features at multiple granularities for comparing sentence representations by using multiple similarity metrics. While in [Wang, Mi, and Ittycheriah2016], not only the similar parts of two input sentences but also the dissimilar parts are taken into account by decomposing and composing lexical semantics over sentences. Although outperforming many traditional methods, these prior works rarely consider the impact of the other sentence when deriving the sentence representation. Until in [Yin et al.2016], an attention-based model is proposed to use the content of one sentence to guide the representation of the other, in which an attention feature matrix is learned to influence the convolution filters. Different to the previous methods, [Pang et al.2016]

takes into account the rich interaction structures in the text matching process since the interaction structures are compositional hierarchies in which higher level signals are obtained by composing low level signals. All these methods are based on the convolutional neural network and trained with a large scale labeled data. However, only a few labeled data is provided in our scenario and we have to solve the low-resource problem.

Universal sentence encoder is also a popular research topic in recent years. Most works are usually based on a multi-task learning framework. In [Cer et al.2018], two variants of encoding models, Transformer [Vaswani et al.2017] and Deep Averaging Network [Iyyer et al.2015]

, allow for the trade-offs between accuracy and efficiency of diverse tasks, such as sentiment analysis and natural language inference.

[Subramanian et al.2018] exploits the effectiveness of inductive biases in the context of a simple one-to-many multi-task learning framework. In their work, a single recurrent sentence encoder is shared across multiple tasks, which are skip-thoughts, machine translation, natural language inference and constituency parsing. As shown in the above works, sentence-based approaches are universal to different tasks while not to different languages.

In recent years, more and more studies [Peters et al.2018, Howard and Ruder2018, Radford et al.2018] show that pretraining a universal encoder with large-scale unlabeled data and then finetuning on a task-specific network with supervision are effective for most tasks. All these studies leverage universal language model as the unsupervised pretraining model to capture more linguistic information which are useful to many downstream tasks. Our method in this paper is closest to these frameworks. However there are still two differences. We aim to build up a multilingual sentence encoder by taking a shared encoder translation model as the pretraining model. Multiple sentence representations can be generated by our framework and naturally ensembled to fit the target task. To the best of our knowledge, this paper is the first to study the multilingual sentence encoder for semantic textual similarity.


In this paper we propose a solution to improve the multilingual semantic textual similarity in low-resource languages by using a shared sentence encoder. The shared encoder is pretrained via a bi-directional and self denoising task to enable its multilinguality. By using this shared encoder, we can obtain various sentence representations for a sentence in different language-specific semantic space, and utilize them in an ensemble model for better performance in similarity evaluation. Experimental results show that our model consistently beats state-of-the-art non-MT approaches, and even reach the same performance of MT-based methods in Spanish task. It is noteworthy that our framework is a generic approach to construct multilingual sentence representation requiring no language specific prepossessing and hand-crafted features.



  • [Artetxe et al.2018] Artetxe, M.; Labaka, G.; Agirre, E.; and Cho, K. 2018. Unsupervised neural machine translation. In International Conference on Learning Representations.
  • [Belinkov and Bisk2018] Belinkov, Y., and Bisk, Y. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations.
  • [Bojanowski et al.2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146.
  • [Cer et al.2017] Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
  • [Cer et al.2018] Cer, D.; Yang, Y.; Kong, S.; Hua, N.; Limtiaco, N.; John, R. S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; Sung, Y.; Strope, B.; and Kurzweil, R. 2018. Universal sentence encoder. CoRR abs/1803.11175.
  • [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • [Duma and Menzel2017] Duma, M.-S., and Menzel, W. 2017. Sef uhh at semeval-2017 task 1: Unsupervised knowledge-free semantic textual similarity via paragraph vector. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 170–174.
  • [He, Gimpel, and Lin2015] He, H.; Gimpel, K.; and Lin, J. J. 2015. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 17-21, 2015, 1576–1586.
  • [Howard and Ruder2018] Howard, J., and Ruder, S. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume 1, 328–339.
  • [Iyyer et al.2015] Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; and Daumé III, H. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, volume 1, 1681–1691.
  • [Johnson et al.2017] Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5:339–351.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems. December 5-8, 2013, Lake Tahoe, Nevada, United States., 3111–3119.
  • [Pang et al.2016] Pang, L.; Lan, Y.; Guo, J.; Xu, J.; Wan, S.; and Cheng, X. 2016. Text matching as image recognition. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

    , 2793–2799.
  • [Parikh et al.2016] Parikh, A.; Täckström, O.; Das, D.; and Uszkoreit, J. 2016.

    A decomposable attention model for natural language inference.

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2249–2255.
  • [Peters et al.2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In Proc. of NAACL.
  • [Radford et al.2018] Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training.
  • [Sennrich, Haddow, and Birch2015] Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine translation of rare words with subword units. arXiv:1508.07909.
  • [Shao2017] Shao, Y. 2017. Hcti at semeval-2017 task 1: Use convolutional neural network to evaluate semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 130–133.
  • [Subramanian et al.2018] Subramanian, S.; Trischler, A.; Bengio, Y.; and Pal, C. J. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. CoRR abs/1804.00079.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems.
  • [Tai, Socher, and Manning2015] Tai, K. S.; Socher, R.; and Manning, C. D. 2015.

    Improved semantic representations from tree-structured long short-term memory networks.

    Beijing, China: Association for Computational Linguistics.
  • [Tian et al.2017] Tian, J.; Zhou, Z.; Lan, M.; and Wu, Y. 2017. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 191–197. Vancouver, Canada: Association for Computational Linguistics.
  • [Tiedemann2012] Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In Chair), N. C. C.; Choukri, K.; Declerck, T.; Dogan, M. U.; Maegaard, B.; Mariani, J.; Odijk, J.; and Piperidis, S., eds., Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA).
  • [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010.
  • [Vincent et al.2010] Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of machine learning research

  • [Wang et al.2017] Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 189–198. Vancouver, Canada: Association for Computational Linguistics.
  • [Wang, Mi, and Ittycheriah2016] Wang, Z.; Mi, H.; and Ittycheriah, A. 2016. Sentence similarity learning by lexical decomposition and composition. In 26th International Conference on Computational Linguistics, December 11-16, 2016, Osaka, Japan, 1340–1349.
  • [Yin et al.2016] Yin, W.; Schütze, H.; Xiang, B.; and Zhou, B. 2016. ABCNN: attention-based convolutional neural network for modeling sentence pairs. TACL 4:259–272.
  • [Yosinski et al.2014] Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, 3320–3328.