1 Introduction
In modern search engines, crosslingual information retrieval tasks are becoming prevalent and important. For example, when searching for products on a shopping website in English, immigrants tend to use their native languages to form the queries and would like to see the most desired products which are in English. Another example is in international trading, investors might use English to describe their product to search the sentiment in other languages on online forums from different countries, in order to understand the customers’ attitude towards it. Despite that these tasks can be naturally formulated as information retrieval (IR) tasks and resolved by monolingual methods, the rising need of crosslingual IR techniques also requires robust models to deal with queries and documents from different languages.
Early studies on documents retrieval mostly rely on lexical matching, which is errorprone in crosslingual tasks as vocabularies and language styles usually change across different languages, as well the contextual information is largely lost. With the recent surge of deep neural networks (DNNs), researchers are able to go beyond lexical matching by building neural architectures to represent textual information of query and documents by vector representations via nonlinear transformations, which have shown great successes in many applications
(Salakhutdinov and Hinton, 2009; Huang et al., 2013; Shen et al., 2014; Palangi et al., 2016).Despite the success of those aforementioned advances, several existing difficulties have not been well addressed, e.g. exploding gradients
and convergence guarantee. The most widely used optimization method for training DNNs is stochastic gradient descent, which updates model parameters by taking gradients of the weights. Meanwhile, cosine similarity, a commonly used measure of relevance between query and documents, is not stable as its gradient may go to infinity when the Euclidean norm of the representation vector is close to zero, leading to
exploding gradients and resulting in irrational training. Figure 1demonstrates the gradients of cosine similarity and our proposed smooth cosine similarity. In addition, in most of the previous works, the loss functions are either heuristically designed or marginbased
(Rennie and Srebro, 2005), resolving to account for the particular property of the model but lacking interpretability and convergence guarantee. These new scenarios pose a new challenge towards the model: the robustness of the crosslingual information process needs to be well addressed.To tackle these issues, we introduce an endtoend robust framework that achieves high accuracy in the crosslingual information retrieval task with different document languages. Particularly, for each query from language , we are given a set of documents from a different language and the degrees of relevance , where the entries in typically belong to ordered class
. The goal of Learningtorank is to return an ordered or unordered, depending on the evaluation metrics, subset of the documents
that are more relevant to the query. Most common evaluation metrics, such as the Normalized Discounted Cumulative Gain (NDCG) and Precision, however, are discontinuous and thus cannot be directly optimized. As a result, researchers usually assume an unknown continuous relevance score , where a higher value means a larger and higher relevance between the query and the document. To optimize the model, typically a surrogate loss function is used, which is easier to optimize using the relevance score . Subsequently a subset of the documents are being chosen by ordering from high to low, to serve the ranking purpose.Our contributions in this paper can be summarized as follows:

[noitemsep, leftmargin=*]

First, we propose a novel measure of relevance between queries and documents, Smooth Cosine Similarity (SCS), whose gradient is bounded such that exploding gradients can be avoided, stabilizing the model training.

Second, we propose a smooth loss function, Smooth Ordinal Search Loss (SOSL), and provide theoretical guarantees on the generalization error bound for this proposed framework.

Third, we empirically show significant gains with our approaches over other document search models under commonly used ranking metrics on the crosslingual document retrieval task, by conducting experiments in a variety of languages.
2 Related Works
Document Retrieval
Researchers have applied machine learning methods to a variety of document retrieval tasks.
Deerwester et al. (1990) proposed LSI that maps a query and its relevant documents into the same semantic space where they are close by grouping terms appearing in similar contexts into the same cluster. Salakhutdinov and Hinton (2009) proposed a semantic hashing (SH) method which used nonlinear deep neural networks to learn features for information retrieval.Siamese Neural Networks was first introduced by Bromley et al. (1993), where two identical neural architectures receive different types of input vectors (e.g., query and document vectors in information retrieval tasks). Huang et al. (2013)
introduced deep structured semantic models (DSSM) which projected query and document into a common lowdimensional space using feedforward neural network models, where they chose the cosine similarity between a query vector and a document vector as their relevance score.
Shen et al. (2014) and Palangi et al. (2016) extended the feedforward structure in DSSMto convolutional neural networks (CNN) and recurrent neural networks (RNN) with Long ShortTerm Memory (LSTM) cells. Differing from previous works that used clickthrough data with only two classes,
Nigam et al. (2019) proposed a loss function that differentiates three classes (relevant, partially relevant, and irrelevant) in document search with huge amount of real commercial data.CrossLingual Information Retrieval
Traditionally, CrossLingual Information Retrieval (CLIR) is conducted in two steps in a pipeline: machine translation followed with monolingual information retrieval (Nie, 2010). However, this approach requires a welltrained translation model and usually suffers from translation ambiguity (Zhou et al., 2012). The error propagation from machine translation may even deteriorate the retrieval results. As an alternative, pretrained word embeddings (Mikolov et al., 2013; Pennington et al., 2014) have led to a surge of improved performance on many language tasks, which learns word representations of different languages on large scale text corpora. Nevertheless, the training objective of these embeddings are different from IR tasks, thus their direct application may be limited.
Generalization Error
There has been previous works on the generalization error of learntorank models. Lan et al. (2008) analyzed the stability of pairwise models and gave querylevel generalization error bounds. Lan et al. (2009) provided a theoretical framework for ranking algorithms and proved generalization error bounds for three listwise losses: ListMLE, ListNet and RankCosine. Chapelle and Wu (2010) introduced annealing procedure to find the optimal smooth factor on an extension of surrogate loss SoftRank, and derived a generally applicable bound on the generalization error of querylevel learningtorank algorithms. Tewari and Chaudhuri (2015) proved that several loss functions used in learningtorank, such as crossentropy loss, have no degradation in generalization ability as document lists become longer. However, theoretical analysis toward search models with neural architectures is still very limited.
3 Smooth Neural Document Retrieval
In this section, we propose a novel Smooth CrossLingual Document Retrieval framework. This framework consists of three parts. First, we use neural models to encode queries and documents from different languages, and represent them by lowdimensional vectors. Second, we propose a smooth cosine similarity to indicate the relevance score , which avoids gradient explosion and therefore stabilizes the training process. Finally, we introduce Smooth Ordinal Search Loss for optimizing .
3.1 Text Representation
We use embeddings as low dimensional dense vectors to represent both queries and documents. Differing from monolingual tasks, in crosslingual document retrieval one can rarely observe common tokens between queries and documents. Therefore, crosslingual document retrieval usually requires embeddings built from different vocabularies as queries and documents are from two different languages (Sasaki et al., 2018). This requirement naturally expands the size of parameters of the model if we regard embeddings as part of model parameters and finetune them during training, thus the retrieval model itself demands a higher stability.
Queries and documents are represented numerically. A query is tokenized into a list of words of length . For example, the tokenization of “Apple is a fruit” is [“Apple”, “is”, “a”, “fruit”] of length 4. Similarly, is an expression of a document of length from document language. For simplicity, we use A and B to represent query language and document language, respectively. We also choose the same dimension for both word embeddings and . Therefore, we encode the query and the document by embedding representations and , where the ith column of is the word embedding from for the token in the ith position in the query, and the jth column of is the word embedding from for the token in the jth position in the document.
3.2 Neural Model Architecture
With the embedding representations and , which can be of different sizes due to different number of tokens in the query and the document, we apply neural models onto them to obtain vectors of the same size. Carefully designed neural models are able to perform dimension reduction regarding the sequence length, and thus project the raw texts into the same Euclidean space of the same dimension for both the queries and the documents, regardless of the number of tokens in them. Thus, we can quantify the relevance between a query and a document as they are in the same space by calculating standard metrics, such as cosine similarity. For example, if the embedding size is , then a query of length can be represented by , and the representations of two documents of length and are and , respectively. In such case, , and are in different space. To mitigate this difficulty that identifies quantitative comparison of the relevance of two pairs, and , a well designed neural model is used to transform both and into .
In order to achieve this goal, we propose to use average pooling over the columns of and
, followed by a nonlinear activation function
tanh for both query and document models. For a query , the query model takes embedding as input and outputs a final representation . Similarly, the document model also generates a final representation into the same space using . There are other modeling choices such as LSTM and CNN. We compare the performance of different neural models in the experiments.The benefits of this model are in two folds: on one hand, despite nonlinearity, the models are smooth and Lipschitz continuous with respect to embedding parameters, and therefore benefits from convergence of generalization error during training; on the other hand, using average pooling avoids extra parameters in the model, thus simplifies the model space and reduces tuning efforts during training, in accordance to the findings in Nigam et al. (2019)’s work.
3.3 Smooth Cosine Similarity
Cosine similarity has been widely used to find the relevance between queries and documents in information retrieval. Given two vectors and of same size in Euclidean space, cosine similarity measures how similar these two vectors are irrespective of their norms, i.e. and . However, the norms of the vectors play crucial roles when calculating the gradient. More specifically, the gradient goes to infinity if the norms are close to zero, and results in unstable weights update during training. This phenomenon is also known as exploding gradients. The intuition is that for a vector of small norm, a slight disturbance can greatly change the angle between itself and another vector, i.e. the cosine similarity. The use of cosine similarity can lead to exploding gradients regardless of the model structure when gradient descent methods are used for optimization. Recently, most semantic matching models and learningtorank models are constructed based on neural architectures. Thus, these retrieval models suffer from this issue greatly since commonly they are optimized by gradient descent methods.
To increase the stability of model training, we further propose Smooth Cosine Similarity (SCS) in replace of the regular cosine similarity. We define the SCS between a query and a document as
(1) 
where is the Euclidean norm and . Under the framework of SCS, the gradient of with respect to and is upper bounded in the whole space and thus stabilizes training procedure. Moreover, by introducing this additional smoothness hyperparameter into the norm of the feature representation vectors, the similarity score not only measures the angle between vectors, but also adds information about the norm of the vectors. As a result, SCS is not orderpreserving from cosine similarity, i.e., does not necessarily imply . The choice of the hyperparameter is flexible and not sensitive to the model performance, which is further analyzed in the experiments.
Another common method to avoid exploding gradients is gradient clipping
, i.e., clipping gradients if their norm exceeds a given threshold. Our proposed SCS does not exclude gradient clipping and in fact they complement with each other. In our pilot experiments, we observe merely using gradient clipping is not sufficient in our crosslingual document retrieval setup. By adding SCS, we observe improved performance over using gradient clipping alone.
3.4 Smooth Ordinal Search Loss
In a search ranking model, it is critical to define a surrogate loss function since the ranking metrics, such as NDCG and Precision are not continuous therefore difficult to optimize. On the choice of a proper loss function, one has to consider two criteria: first, minimizing the surrogate loss on training set should imply a small surrogate loss on test set; second, a small surrogate loss on test set should imply desired ranking metrics results on test set (Chapelle et al., 2011). To deal with the first criterion, we formulate the search ranking model as an ordinal regression problem. For the second one, we propose to use Smooth Ordinal Search Loss (SOSL) as the surrogate loss.
Recall that a pair of query and document has an ordered relevance level, and the goal of a search ranking model is to select a subset of documents such that more relevant documents are ranked on the top while less relevant documents are ranked lower. Taking threeclass ranking problem as a concrete example, the pairs can be grouped into relevant, partially relevant and irrelevant. If a misranking has to exist for a relevant pair, it is preferred to rank a partially relevant document over an irrelevant document, which means not all mistakes are equal.
Inspired by the immediate threshold with margin penalty function construction (Rennie and Srebro, 2005), we propose our loss function Smooth Ordinal Search Loss (SOSL) formally as following:
(2) 
where is the smooth relevance score between a query and a document , is the ordered class label denoting the general relevance degree, and are the thresholds and is the indicator function. Differing from that of the margin penalty function used in Rennie and Srebro (2005)’s work, we choose smooth function . The thresholds are within the range in our setup, instead of the whole real line, due to the property of smooth cosine similarity.
The interpretation of this loss function is intuitive, as shown in Figure 2: if the relevance score falls into the correct segmentation (the true ordered class of a particular pair of query and document is and ), then the loss is ; otherwise the loss is the degree of the relevance violating the threshold.
4 Theoretical Analysis
Generalization error measures the difference between training error and testing error. Typically, if the generalization error of a model is bounded and converges to zero, then minimizing empirical loss on training set implies that the expected loss on unseen testing set is also minimized. Although previous works gave generalization bounds for surrogate losses in learningtorank model (Lan et al., 2008, 2009; Tewari and Chaudhuri, 2015), to the best of our knowledge, no theoretical result has been derived in search models with neural architecture. Here, we prove a generalization error bound for the commonly used SGD procedure. This error bound suggests that the generalization gap at any training step converges to zero when the number of training pairs of query and document goes to infinity. We show the detailed proof in the appendix.
The following Proposition 1 and Lemma 2 show that SOSL is both smooth and Lipschitz continuous with respect to not only relevance score but also embedding parameters.
Proposition 1.
SOSL is smooth and Lipschitz continuous with respect to .
Lemma 2.
Let be a smooth and Lipschitz continuous loss function with respect to , then in our search ranking model, is also smooth and Lipschitz continuous with respect to model parameters, i.e., embeddings.
From Lemma 2, we can assume to be Lipschitz and smooth. Next, suppose is a training set sampled from the data distribution with sample size , and assume unseen testing data is from distribution . Let be the neural models for queries and documents. By defining as the mean training error and the expected error on testing set, the following Theorem establishes the upper bound for generalization error .
Theorem 3.
(Generalization Error bound) Let be a smooth loss function bounded by . Suppose we run SGD for (with step size
), with probability at least
over the draw of ,where .
Note that in Theorem 3, the bound does not depend on thresholds . We ignore the dependency on and as both are constants. Theorem 3 suggests that for any , the generalization error after steps for SGD converges as the sample size of increases. For some proper step size, we can allow to increase with . In particular, if , then may increase at a rate of while the generalization error still converges.
5 Experiments
Datasets
We use the publicly available largescale CrossLingual Information Retrieval (CLIR) dataset from Wikipedia (Sasaki et al., 2018) for our experiments. All the queries are in English, extracted as the first sentences from English pages, with title words removed. The Relevant (MR) documents are the foreignlanguage pages having interlanguage link to the English pages; the Partially Relevant (SR) documents are those having mutual links to and from the relevant documents. Additionally, we randomly sample 40 other pages as Irrelevant (NR) documents for each query. To provide a comprehensive study, we use the document datasets of two highresource languages: French (fr), Italian (it) and two lowresource languages: Swahili (sw), Tagalog (tl). Queries are randomly split into training, validation and testing set with the rate of 3:1:1. We include the data statistics in Table 1.
Language  #Query  #SR/Q  #Documents 

French  25000  12.6  1894000 
Italian  25000  11.7  1347000 
Swahili  22793  1.5  37000 
Tagalog  25000  0.6  79000 
Evaluation Metrics
We apply the commonly used ranking metrics in our experiments for evaluation, including Precision, NDCG, MAP and MRR. For each query, the corresponding documents are sorted by their relevance score, and the metrics are averaged over all queries. represents the precision for MR document at top 1 position. represents the precision for MR and SR documents combined at top 5 positions. NDCG@5 is Normalized Discounted Cumulative Gain for top 5 documents. MAP is calculated as the mean of the average precision scores for each query. is the Mean reciprocal rank of MR document.
Experimental Setup
In our experiments, we use the pretrained Polyglot (AlRfou et al., 2013) embeddings of dimension 64 as the initialization for the corresponding languages. These embeddings are finetuned during the training. We also observe improved performance by shuffling the training set. We select the thresholds
for SOSL, via crossvalidation using grid searching. We use Adam optimizer in all experiments for optimization, with fixed learning rate 0.01. We set batch sizes as 128 and we stop after 30 epochs for all languages, resulting in about 310,000 training steps on French and 240,000 on Tagalog, while the other two languages’ training steps are inbetween. We release our code and experimental setup to benefit the community and promote researches along this topic
^{1}^{1}1https://github.com/JiapengL/multi_ling_search.Language  Loss function  MAP  

French  
0.411  0.763  0.560  0.754  0.766  0.565  0.889  
MSE  0.253  0.700  0.603  0.727  0.792  0.443  0.854  
0.254  0.704  0.604  0.729  0.795  0.445  0.856  
Italian  
0.385  0.748  0.572  0.751  0.768  0.545  0.883  
MSE  0.231  0.699  0.618  0.731  0.803  0.427  0.862  
0.232  0.705  0.619  0.734  0.806  0.430  0.863  
Swahili  
0.599  0.907  0.314  0.771  0.737  0.732  0.827  
MSE  0.351  0.851  0.360  0.724  0.738  0.558  0.771  
0.351  0.855  0.362  0.726  0.740  0.560  0.772  
Tagalog  
0.583  0.925  0.251  0.790  0.760  0.730  0.785  
MSE  0.455  0.892  0.251  0.732  0.709  0.639  0.726  
0.463  0.896  0.252  0.737  0.715  0.645  0.730 
5.1 Results
SOSL and Other Loss Functions
We first present the results of using different loss functions in different languages with smooth cosine similarity in Table 2
. We compared with three commonly used loss: Mean Square Error(MSE), Proportional Odds Loss (PO)
(McCullagh, 1980) and (Nigam et al., 2019). The hyperparameter
is fixed to 1 for all losses. We observe that SOSL outperforms other loss functions in all languages and all metrics. We attribute this success to two folds: first, SOSL encourages smoothness over the optimization of parameters and thus guarantees convergence of generalization error; second, SOSL adds no penalty on the loss when the relevance score falls into the correct segmentation. Note that the performance in lowresource languages (sw and tl) is better than highresource languages (fr and it) for some metrics, because the former languages have fewer SR documents and fewer total number of documents for each query, which makes it easier to distinguish MR and NR documents and is therefore more likely to rank the MR document on the top.To investigate the reason why SOSL performs the best, we plot the density curves of the relevance score predicted by each loss function on the training set, and illustrate them in Figure 3. All three classes are well separated in top left with SOSL. For 3partL2 loss, however, we can notice a large overlap in group NR and SR. In the plots of
and MSE losses, few MR documents are classified correctly. Besides, MR and SR documents are mixed together, with a portion of MR incorrectly classified as NR, which is unpreferable. This analysis showcases the power of SOSL to distinguish different types of documents at the end of the training.
Average Pooling and Other Neural Architectures
We also compare the average pooling with other popular neural architectures. We choose DSSMCNN model (Shen et al., 2014) and the DSSMLSTM model (Palangi et al., 2016), which are widely used in information retrieval, to compare with. After tuning these two models, we specify their hyperparameters as follows: we set 0.001 as the initial learning rate and 0.95 as the exponential decay rate for DSSMCNN and DSSMLSTM. We set the batch size 128 for DSSMCNN and stop after 30 epochs and set the batch size 64 for DSSMLSTM and stop after 15 epochs.
We set the window size as 3, i.e., word3gram for DSSMCNN, with 300 filters in the convolutional layer. A maximum pooling layer and a fully connected layer with output size 64 are stacked after the convolutional layer. In the DSSMLSTM model, we use the bidirectional LSTM model with the hidden units 64, and concatenate the first hidden state for the forward LSTM and the last hidden state for the backward LSTM, as the output of LSTM layer. This is then followed by a fully connected layer with output size 64 as the final output. To regularize the DSSMCNN and the DSSMLSTM model, we apply dropout (Srivastava et al., 2014) on the word embedding layer with 0.4 dropout rate. We observe using separate modules for queries and documents can improve the performance comparing to sharing the same model, so two modules with same structure for queries and documents are used for CNN and LSTM.
In Table 3, we compare the results of the average pooling architecture with the DSSMCNN model and the DSSMLSTM model. Both the two original studies used the data type which only contains two classes, positive and negative, therefore they designed the binary loss function accordingly. For a fair comparison, all the neural architectures are followed by the same SOSL with the same thresholds used in Table 2. In all the languages and all the evaluation metrics, average pooling performs best among the three. We attribute the success of average pooling to two reasons: first, queries typically have fewer words than documents, and do not tend to have longrange dependencies; second, the original approaches only deal with two classes, lacking the flexibility of dealing with more classes. In addition, CNN and LSTM are more difficult to optimize, as they require more computational resources, training time, and tend to overfit. Our results are in accordance with the findings in Nigam et al. (2019)’s study.
Language  Loss function  MAP  

French  Average Pooling  
DSSMCNN  0.262  0.656  0.542  0.570  0.709  0.437  0.812  
DSSMLSTM  0.335  0.718  0.560  0.716  0.748  0.503  0.846  
Italian  Average Pooling  
DSSMCNN  0.217  0.620  0.547  0.656  0.709  0.394  0.805  
DSSMLSTM  0.248  0.675  0.551  0.675  0.714  0.434  0.814  
Tagalog  Average Pooling  
DSSMCNN  0.450  0.861  0.235  0.699  0.668  0.624  0.693  
DSSMLSTM  0.526  0.907  0.242  0.748  0.709  0.688  0.736  
Swahili  Average Pooling  
DSSMCNN  0.457  0.869  0.310  0.700  0.680  0.628  0.743  
DSSMLSTM  0.527  0.897  0.319  0.738  0.713  0.685  0.772 
5.2 Impact of Hyperparameters
Smoothness factor determines the smoothness of the cosine similarity. When is 0, the loss function with respect to weights in neural models is nonsmooth and thus generalization cannot be guaranteed. To show the improvement of adding smoothness and illustrate the effect of , we vary from 0 to 2 for SOSL, in document language French. For each , we use grid search to find the best . The results are visualized in Figure 4. It is observed that model performance is improved when the model becomes smooth. Besides, within a large range of (from 0.25 to 1.75), adding smoothness factor surpasses nonsmooth cosine similarity. We see a concave curve in all metrics but the choice of is relatively nonsensible. We suggest taking within for desired output and stability.
5.3 Different number of negative samples
In real industrial searching system, ranking is usually run after documents “filtering” process, e.g., the matching stage, which could greatly reduces the number of documents to be ranked. In Figure 5, we explore the effect of different number of irrelevant (NR) documents. We create 9 different datasets where the number of NR documents per query varies from 20 to 100. The document language is the highresource language French. We sampled same number of queries for training, validating and testing datasets as the experiments discussed earlier in this paper. The average number of SR documents per query varies but is still close to 12.6. The red curve indicates, as the number of NR documents increases, the data is more “noisy” and therefore more difficult to correctly rank and predict MR document. On the other hand, NDCG, MRR and MAP are relatively resistant to the increased number of NR documents since they only decrease about 10% while the number of NR documents is more. This also validates the stability of our proposed framework.
6 Conclusion
In this study, we propose a smooth learning framework for crosslingual information retrieval task. We first suggest a novel measure of relevance between queries and documents, namely Smooth Cosine Similarity (SCS), whose gradient is bounded thus able to avoid exploding gradients, enforcing the model to be trained in a more stable way. Additionally, we propose a smooth loss function: Smooth Ordinal Search Loss (SOSL), and provide theoretical guarantees on the generalization error bound for the whole proposed framework. Further, we conduct intensive experiments to compare our approach with existing document search models, and show significant improvements with commonly used ranking metrics on the crosslingual document retrieval task in several languages. Both the theoretical and the empirical results imply the potentially wide application of this smooth learning framework.
References
 Agarwal [2008] Shivani Agarwal. Generalization bounds for some ordinal regression algorithms. In International Conference on Algorithmic Learning Theory, pages 7–21. Springer, 2008.
 AlRfou et al. [2013] Rami AlRfou, Bryan Perozzi, and Steven Skiena. Polyglot: Distributed word representations for multilingual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183–192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.
 Bousquet and Elisseeff [2002] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.

Bromley et al. [1993]
Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff
Moore, Eduard Säckinger, and Roopak Shah.
Signature verification using a “siamese” time delay neural
network.
International Journal of Pattern Recognition and Artificial Intelligence
, 7(04):669–688, 1993.  Chapelle and Wu [2010] Olivier Chapelle and Mingrui Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.
 Chapelle et al. [2011] Olivier Chapelle, Yi Chang, and TieYan Liu. Future directions in learning to rank. In Proceedings of the Learning to Rank Challenge, pages 91–100, 2011.
 Deerwester et al. [1990] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
 Hardt et al. [2015] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
 Huang et al. [2013] PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proc. of the Conference on Information and Knowledge Management (CIKM), 2013.
 Lan et al. [2008] Yanyan Lan, TieYan Liu, Tao Qin, Zhiming Ma, and Hang Li. Querylevel stability and generalization in learning to rank. In Proceedings of the 25th international conference on Machine learning, pages 512–519, 2008.
 Lan et al. [2009] Yanyan Lan, TieYan Liu, Zhiming Ma, and Hang Li. Generalization analysis of listwise learningtorank algorithms. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 577–584, 2009.
 McCullagh [1980] Peter McCullagh. Regression models for ordinal data. Journal of the Royal Statistical Society: Series B (Methodological), 42(2):109–127, 1980.
 Mikolov et al. [2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. Proc. of the Conference on Advances in Neural Information Processing Systems (NIPS), pages 1–9, 2013.
 Nie [2010] JianYun Nie. Crosslanguage information retrieval. Synthesis Lectures on Human Language Technologies, 3(1):1–125, 2010.
 Nigam et al. [2019] Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, Choon Hui Teo, Hao Gu, Bing Yin, and Ankit Shingavi. Semantic Product Search. In Proc. of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2019.
 Palangi et al. [2016] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. Deep Sentence embedding using long shortterm memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(4):694–707, apr 2016.

Pennington et al. [2014]
Jeffrey Pennington, Richard Socher, and Christopher Manning.
GloVe: Global vectors for word representation.
In
Proc. of the Conference on Empirical Methods for Natural Language Processing (EMNLP)
, 2014.  Rennie and Srebro [2005] Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1. Kluwer Norwell, MA, 2005.
 Salakhutdinov and Hinton [2009] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, jul 2009.
 Sasaki et al. [2018] Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. CrossLingual LearningtoRank with Shared Representations. In Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL), 2018.
 Shen et al. [2014] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic model with convolutionalpooling structure for information retrieval. In Proc. of the Conference on Information and Knowledge Management (CIKM), 2014.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Tewari and Chaudhuri [2015] Ambuj Tewari and Sougata Chaudhuri. Generalization error bounds for learning to rank: Does the length of document lists matter? In International Conference on Machine Learning, pages 315–323, 2015.
 Zhou et al. [2012] Dong Zhou, Mark Truran, Tim Brailsford, Vincent Wade, and Helen Ashman. Translation techniques in crosslanguage information retrieval. ACM Computing Surveys (CSUR), 45(1):1–44, 2012.
Appendix A Appendices
a.1 Proof of Lemma 2
Definition A.1.
A function is Lipschitz if for all , we have
Lemma A.1.
If function and is Lipschitz and Lipschitz, then is Lipschitz.
Here we use to be the function composition opertor, i.e. . It is easy to prove the Lemma by the definition of Lipschitz.
Definition A.2.
A function is smooth if for all , we have
Lemma A.2.
If function is Lipschitz and smooth, and is Lipschitz and smooth, then is smooth.
Proof.
We have the following inequality,
In the second inequality, we use the Lipschitz property that and . ∎
Proof of Proposition 1
Proof.
Let be the number of classes and . In Immediate Threshold with loss,
where
It is easy to see that
for a given or
for any . Also
for any and , which means
∎
Proof of Lemma 2
To have a highlevel view of the proof, we decompose the loss function into three functions, and will show that each function is Lipschitz continuous and smooth with respect to its domain.
Proof.
Let be a function of when is fixed. For a fixed , the gradient of exists, where
To show is also smooth, we calculate the Hessian matrix of . Let ,
where is a identity matrix. It is easy to see that
where is a constant only depends on . Therefore, is Lipschitz and smooth. Similarly, this is also true with respect to .
Deep Neural Networks normally do not enjoy Lipschitz continuity or smoothness due to their expressive power, which make it difficult to analyze its theoretical performance. We now show that our particular models are Lipschitz continuous and smooth for both query and document with respect to Embedding parameters.
For a text query with token length , we represent it by where be the onehot vector with the index of th token in text be 1 and the rest are 0. The output of query encoding networks is then
Note that hyperbolic tangent is smooth, and is linear in terms of . We can claim that is smooth and Lipschitz continuous with respect to .
a.2 Proof of Theorem 3
In the Ordinal Regression stage, let represent input querydocument pair and let where is the number of relevance level. Given a sequence of samples , the goal is to learn a mapping and a set of threshold , then predicting the label based on the relevance score. Let and be a distribution on, also let be the error of a sample pair for a function and thresholds , we define the expected error and empirical error as follows:
(3)  
(4) 
We can prove the following Theorem from Agarwal [2008].
Theorem A.3.
Let be an ordinal regression algorithm which, given as input a training sample , learns a realvalued function and a threshold vector . Let be any loss function in this setting such that for all training samples and all , and let be such that has loss stability with respect to . Then for any and for any distribution on , with probability at least over the draw of ,
Note that the original theorem in Agarwal [2008] is defined with over the real line , but it is trivial to apply to defined over . It is also easy to verify that since our Ordinal Regression loss is bounded. If has a rate of , then the generalization bound goes to zero as goes to infinity. Next, we will give the definition of stability and to show that DNNs optimized by SGD satisfy this requirement.
a.3 Stability
The Uniform Stability is a measurement of how the algorithm will be affected if removing one sample from the training set. Let be the training set and represents the set by removing the th element from
The formal definition is first proposed by Bousquet and Elisseeff [2002] and stated as follows,
Definition A.3.
(Uniform Stability) An algorithm has uniform stability with respect to the loss function if the following holds
Remember that our goal is to learn a function such that which is parameterized by DNNs, and optimized by SGD. The following theorem (Hardt et al. [2015], Theorem 3.8) shows the stability bound for convex loss minimization via SGD,
Theorem A.4.
Assume that the loss function is smooth and Lipschitz for every . Suppose that we run SGD for steps with monotonically nonincreasing step sizes . Then SGD satisfies uniform stability with
Comments
There are no comments yet.