As the core of understanding multimedia, semantic matching plays the role of bridge to connect different forms of content, such as text, image, video and audio, etc. Before semantic matching came into existence, the conventional keywords matching methods have been dominant for a long time, says, in Information Retrieval (IR) . They fail, however, to capture the query term’s fine-grained contextual information. The missing contextual information results in the term-mismatching problem due to the word ambiguity issue. To deal with this problem, varieties of neural IR models, which are often called semantic matching, have been proposed to incorporate context information by embedded representation 
. Some methods consider the whole document as a global context and embed it into one vector. The query term is embedded into a similar vector, and these vectors are used to calculate the relevance between term and document. Other methods consider a certain scope around the keyword as the local context. Only this local context is encoded into embedding vectors and used to compute the relevance . Both parties have made important efforts to do semantic matching, but we believe that the retrieved documents can fit the query terms even better. The global context methods fail to capture the individual interactions between the query and the document terms since the whole document is encoded into one vector. The latter group does not have this problem, but it still leaves the mismatching problem unsolved.
To remedy the shortcomings of the previous methods, in this paper, we propose a salient-context-based semantic matching model. With this model, we improve the relevance ranking in IR. Fig. 1 explains the concept of salient context with an example. We have the query terms “robot technology” and a corpus of three documents. The three boxes in the figure correspond to those three documents. The vertical lines indicate positions in the documents which are salient with respect to the two query terms, and thus give locations of the salient context.
We can observe that the highly relevant terms are clustered in the first two boxes, while they are scattered in the third. As the two corresponding documents are labeled related to “robot technology” by a human, the clustering indicates that the closer together query-related terms are located, the more relevant the document is to the query. This behavior leads us to define the locations of these clusters as the salient context. Our goal is to find the most salient context and embed it into vectors that represent the document. In this way, we eliminate the risk of single-keyword mismatching, thus addressing the shortcomings of the models mentioned earlier.
To locate the most salient context, we define a measurement of the contextual salience. It is based on the semantic similarity between the query and the salient context and is designed such that it is not influenced by low query-related terms or dominated by a single term. In addition, we use the BM25 relevance score as a representation of the global context in the final relevance function.
This paper has threefold contribution. Firstly, we analyze and demonstrate the aggregation phenomenon of highly query-related terms in relevant documents, and also define our new concept of salient context. Secondly, we propose a way to measure contextual-salience to locate the most salient context dynamically. Thirdly, rather than using the context surrounding a keyword, we propose to use the most salient context as a representation of a document, thereby eliminating the mismatching problem.
Ii-a Term-level Semantic Matching
Generally, it is important that each keyword is exactly matched. It is often particularly important when the keywords are new or rare. However, traditional keyword matching might lose to capture the fine-grained contextual information and semantically related terms. As illustrated by the example in Fig. 2, semantic relevance matching is able to highlight the terms with a high semantic relevance to the query “robotic technology” with dark green being most relevant. We can see that the semantic matching gives emphasis to semantic related terms such as “robot”, “industrial” and “application”.
Distributed representations of text, i.e. word embeddings, encapsulate useful contextual information and effectively represent the semantic information of a word. Models that use pre-trained word embeddings[5, 6, 7]
have shown better performance than those which use term co-occurrence counting between query and documents. Inspired by this, we utilize the pre-trained word embeddings as the basis for our semantic representation to model the query-document matching interaction. From the embedded vectors, We apply cosine similarity to the capture of the word-level semantic matching as given by:
where and represent the vectors for the i-th query term and the j-th document term, respectively.
Ii-B Contextual Salience
According to the query-centric assumption proposed in , the local context surrounding the location of a found query term in a document is relevant when deciding if the document is a match to the query. In Fig. 2, relevant terms cluster around the first two sentences, and in Fig. 1 we can see that these clusters are present at both the beginning, middle, and end of a document. Thus, the position of the salient context changes from document to document and therefore our salience-measure must be able to handle that shift. We use a sliding window which moves over the document from the start to the end. For a given position of the window, terms which are highly related to the query are found and thus that part of the document will stand out. The window context for the i-th query term is described as:
where is the cosine distance between the i-th query term and the j-th document term in the window, is the set of query terms, is the set of document terms in the window, and represents the cosine relevance between the i-th query term and the document terms which falls inside the window.
This approach is different from the deep learning models. As stated above, the deep learning models combine all terms in a document into one single document representation. Our representation only takes the relevant parts of the document and embeds those into a document representation. Often, only a few terms with a high windows relevance score contribute to the final document relevance. In order to filter away text noise and counteract semantic drift, we choose to only take the window contextual salience of the topsemantic relevance matches into account. Here is the processing for getting the n-maximums of the set .
The set of n-maximum members of the set is then
where =, decided by window width . is the influence factor to balance semantic interactions’ weighting in the window context.
Queries used in IR are short and without complex grammatical structures. Consequently, we need to take the term importance into account. The compositional relation between the query terms is usually the simple “and” relation when searching. Take the given query “arrested development” for example, a relevant document should refer to “arrested” and “development”, where the term “arrested” is more important than “development”. There have been many previous studies on retrieval models showing the importance of term discrimination . In the proposed model, we introduce an aggregation weight for each query term which controls how much the relevance score on that query contributes to the final relevance score:
where denotes the weight vector of the i-th query term vector , and is the query length. In our model, we set the weight vectors equal to their respective query term vector, i.e. . Putting this into Equ. 7, we get:
Here, squares each element of before summing them together. As , with being the dimension of the weight vector, the resulting scalar will be positive and equal to the square of the magnitude of . Equ. 8 is the normalized exponential, or softmax, function, with . It returns a scalar which is proportional to the normalized magnitude of the term vector, but with an emphasis on the vectors with the largest magnitudes. Thus, it regularizes the relevance score.
Ii-C Relevance Aggregation
Different from semantic-matching-based distributional word embedding, exact keywords matching avoids the risk of rare or new words in query. Hence, we linearly combining the exact keywords matching and use it as a compensation for semantic matching. Traditional IR models ,such as BM25 , is a classical weighting function employed by the Okapi system. As shown by previous TREC experimentation, BM25 usually provides very effective retrieval performance on the TREC collections. In BM25, the relevance score is based on the within-document term frequency and query term frequency. We can utilize BM25 to model relevance matching in document-level with query terms. In our paper, we apply BM25 to extend model on document-level matching and define the way to aggregate exact keywords matching interactions by integrating into BM25 linearly via a parameter . We also take into consideration of the co-occurrence of query terms within document in weighting function for the contextual salience in the document. The two formulas are defined as below:
where is the influence factor to balance BM25, decides the effects of BM25 in relevance scoring. When is 0, only contextual salience contributes the relevance scoring, (0,1) the contextual salience and BM25 contribute the relevance scoring together. is the co-occurrence of query terms within document, and the constant is a constant to balance parameter .
Iii Data Sets and Evaluation
We evaluate the proposed approach on five standard TREC collections , which are different in their sizes, contents, and topics. The TREC tasks and topic numbers associated with each collection are summarized in Table I.
|Collection Name||Topics||Topics Num.||Docs|
For all the test collections used in our experiments, we apply pre-trained GloVe word vectors111https://nlp.stanford.edu/projects/GloVe/ which are trained from a 6 billion token collection (Wikipedia 2014 plus Gigawords 5), reliable term representations can be better acquired from large scale unlabeled text collections rather than from the limited ground truth data for IR task. We use the TREC retrieval evaluation script22footnotetext: https://trec.nist.gov/trec_eval/ focusing on MAP, RP (recall precision) and P@5, P@20, NDCG@5, and NDCG@20 in our experiments. We provide the source code333source code is available on https://github.com/YuanyuanQi/CSSM_IR/ for the model as well as trained word vectors.
Table II shows the performance comparisons between the baseline model BM25 and new model CSSM on five collections over MAP, RP and P@5, P@20, NDCG@5 and NDCG@20. The percentage of how much our model outperforms BM25 is also listed. With regards to MAP and RP it indicates that, in general our model performs better than the baseline model BM25 on all five collections, especially on WT2G, Robust04 and Blog06 collections. It demonstrates the importance of semantic relevance matching and emphasizes contextual salience is helpful to locate the most relevant local context through highly semantic relevance matching. Compare the results of (linear function) and (co weighting function), three datasets show improvements, the co-occurrence information of query terms in document can offer positive connection with contextual salience in the model. The experiment results prove that our model can encode the critical contextual semantic information in our relevance ranking function for the IR.
Table III shows the performance on Robust04 collection with comparison of deep learning based methods recently proposed in[5, 6, 7]. Our performance is better than DRMM, PACRR, DRMM-PACRR, slightly better than ABEL-DRMM and ABEL-DRMM+MV with less extra model training data. Compare with POSIT-DRMM and POSIT-DRMM+MV which encode multiple views (MV) of terms (context-sensitive term encodings, pre-trained term embeddings, and one-hot term encodings), our model utilizes pre-trained term embeddings alone. We mainly take into account of two reasons. First, according to our scoring function, directly applying multiple views of terms is hard to balance the input dimensions differences, one-hot vector is high dimensional and sparse term embedding. Second, it needs sacrifice efficient to take training data to explicitly tune context-sensitive term encodings in model. In addition, without model parameters tuning, our model retrieval time costing is less than all supervised deep learning based models in the table, works as efficiently as BM25.
In this paper, we propose a semantic-matching based method to locate the most salient context for understanding a piece of multimedia content. We propose to prioritize the action of locating the semantic salient context in the relevance calculation. On the basis of the prioritization, we define a measurement of contextual salience to quantify the relevance of a document towards a query. Furthermore, we apply the proposed method in IR, and it shows promising improvements over the strong BM25 baseline and several neural relevance matching models. Finally, extensive comparisons between several neural relevance matching models and our approach suggest that explicitly modelling the salient query-related context in document is helpful to improve the effectiveness of relevance ranking for IR. Our idea of understanding content by locating the most salient context provides a new perspective in multimedia content analysis, and the proposed semantic-matching based method can be applied to other forms of multimedia content. The proposed method provides an efficient and explainable relevance ranking solution which can be generalized to other forms of multimedia content as well.
This work was supported by Beijing Natural Science Foundation (4174098), National Natural Science Foundation of China (61702047), National Natural Science Foundation of China (61703234) and the Fundamental Research Funds for the Central Universities (2017RC02).
-  M. Christopher, R. Prabhakar and S. Hinrich, “Introduction to information retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010.
-  O. K. Dilek and Zhang et al., “Neural Information Retrieval: At the End of the Early Years,” Information Retrieval Journal, Norwell, vol. 21, pp. 111–182, 2018.
Y. Shen, X. He, J. Gao, D. Li and M. Grégoire, “Learning semantic representations using convolutional neural networks for web search,” In International Conference on World Wide Web, Seoul, pp. 373–374,2014.
-  H. Kai, Y. Andrew, B. Klaus and de Melo, Gerard, “Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval,” In Proceedings of Web Search and Data Mining, Los Angeles, 2018.
-  J. Guo, Y. Fan, Q. Ai and C. W Bruce, “A deep relevance matching model for ad-hoc retrieval,” In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indiana, pp. 55–64, 2016.
-  H. Kai, Y. Andrew, B. Klaus and de Melo, Gerard, “Position-Aware Representations for Relevance Matching in Neural Information Retrieval,” In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, pp. 799–800, 2017.
M. Ryan, B. Georgios-Ioannis and A. Ion, “Deep relevance ranking using enhanced document-query interactions,” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, pp. 1849–1860, 2018.
-  H. Wu, L., Robert W P, K.F. Wong and K. K L, “A retrospective study of a hybrid document-context based retrieval model,” Information processing & management, vol. 43, no. 5, pp. 1308–1331, 2007.
-  H. Fang, T. Tao and C. Zhai, “Diagnostic evaluation of information retrieval models,” ACM Transactions on Information Systems, vol. 29, no. 2, pp. 7:1-7:42, 2011.
-  S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” In Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.