Machine learning plays a role in many aspects of modern IR systems, and deep learning is applied in all of them. The fast pace of modern-day research has given rise to many different approaches for many different IR problems. The amount of information available can be overwhelming both for junior students and for experienced researchers looking for new research topics and directions. Additionally, it is interesting to see what key insights into IR problems the new technologies are able to give us. The aim of this full-day tutorial is to give a clear overview of current tried-and-trusted neural methods in IR and how they benefit IR research. It covers key architectures, as well as the most promising future directions.
Prompted by the advances of deep learning in computer vision research, neural networks have resurfaced as a popular machine learning paradigm in many other directions of research as well, including information retrieval. Recent years have seen neural networks being applied to all key parts of the typical modern IR pipeline, such core ranking algorithms(Szummer and Yilmaz, 2011; Huang et al., 2013; Mitra et al., 2017), click models (Borisov et al., 2016a, b)2011; Lin et al., 2015), text similarity (Kenter et al., 2016; Severyn and Moschitti, 2015), entity retrieval (Van Gysel et al., 2016b, a), language modeling (Bengio et al., 2003), question answering (Weston et al., 2016; Hewlett et al., 2016), and dialogue systems (Li et al., 2016; Vinyals and Le, 2015).
A key advantage that sets neural networks apart from many learning strategies employed earlier, is their ability to work from raw input data. E.g., when given enough training data, well-designed networks can become feature extractors themselves, e.g., incorporating basic input characteristics such as term frequency (tf) and term saliency (idf)—that used to be pre-calculated offline—in their initial layers. Where designing features used to be a crucial aspect and contribution of newly proposed IR approaches, the focus has shifted to designing network architectures instead. As a consequence, many different architectures and paradigms have been proposed, such as auto-encoders, recursive networks, recurrent networks, convolutional networks, various embedding methods, deep reinforcement and deep q-learning, and, more recently, generative adversarial networks, of which most have been applied in IR settings. The aim of the neural networks for IR (NN4IR) tutorial is to provide a clear overview of the main network architectures currently applied in IR and to show explicitly how they relate to previous work. The tutorial covers methods applied in industry and academia, with in-depth insights into the underlying theory, core IR tasks, applicability, key assets and handicaps, scalability concerns and practical tips and tricks.
We expect the tutorial to be useful both for academic and industrial researchers and practitioners who either want to develop new neural models, use them in their own research in other areas or apply the models described here to improve actual IR systems.
The material in the tutorial covers a broad range of IR applications. It is structured as follows:
Preliminaries (60 minutes)
The recent surge of interest in deep learning has given rise to a myriad of architectures. Different though the inner structures of neural networks can be, there are many concepts common to all of them. This first session covers the preliminaries; we briefly recapitulate the basic concepts involved in neural systems, such as back propagation (Rumelhart et al., 1988)
, distributed representations/embeddings(Mikolov et al., 2013), convolutional layers (Krizhevsky et al., 2012), recurrent networks (Mikolov et al., 2010), sequence-to-sequence models (Sutskever et al., 2014), dropout (Srivastava et al., 2014)
, loss functions, optimization schemes like Adam(Kingma and Ba, 2015).
Semantic matching I: supervised learning (60 minutes)
The problem of matching items based on their textual descriptions arises in many IR systems. The traditional approach involves counting query term occurrences in the description text (e.g., BM25 (Robertson et al., 1995)). However, to bridge the lexical gap caused by vocabulary-related and linguistic differences many latent semantic models have been proposed (Deerwester et al., 1990; Hofmann, 1999; Blei et al., 2003; Wei and Croft, 2006), and more recently neural embedding methods (Mikolov et al., 2013). In this session we will focus on semantic matching settings where a supervised signal is available. The signal can be explicit, such as a label for learning task-specific latent representations (Lu and Li, 2013; Huang et al., 2013; Shen et al., 2014; Hu et al., 2014; Severyn and Moschitti, 2015; Kenter and de Rijke, 2015), or relevance labels and, more implicitly, clicks for neural IR methods (Mitra et al., 2016; Diaz et al., 2016; Grbovic et al., 2015; Kusner et al., 2015; Mitra et al., 2017).
Semantic matching II: Semi- and unsupervised learning (60 minutes)
How to learn semantics in the absence of relevance labels or user interaction signals? Depending on the available resources, one can choose for semi- or unsupervised matching models.
Unsupervised semantic matching methods can be categorized into two groups. First, using pre-trained word embeddings like combining traditional retrieval models with an embedding-based translation model (Zuccon et al., 2015; Ganguly et al., 2015), using pre-trained embeddings for query expansion to improve retrieval (Zamani and Croft, 2016), and representing documents as Bag-of-Word-Embeddings (BoWE) (Guo et al., 2016; Kenter and de Rijke, 2015). Second, learning representations from scratch like learning representations of words and documents (Le and Mikolov, 2014; Kenter et al., 2016) and employing them in retrieval task (Ai et al., 2016b, a), and learning representations in an end-to-end neural model for learning a specific task like entity ranking for expert finding (Van Gysel et al., 2016b) or product search (Van Gysel et al., 2016a).
In semi-supervised learning, on the other hand, queries (without relevance labels), or prior knowledge about document similarity can be used to induce pseudo-relevance labels. Furthermore, it is possible to use heuristic methods to generate weak supervision signals and to go beyond them by employing proper learning objectives and network designs(Dehghani et al., 2017).
Learning to rank (45 minutes)
Capturing the notion of relevance for ranking needs to account for different aspects of the query, the document, and their relationship. Neural methods for ranking can use both manually crafted features from query and document and combine them with regards to a ranking objective, or learn latent representations for them in situ.
Irrespective of how the query and the documents are featurized, a neural learning to rank model can be designed for different scenarios, each having its own appropriate loss function. An example is the point-wise versus pair-wise paradigm, each of which has a different objective that calibrates either scores or the relative ranking of documents, given a query. Neural learning to rank models can also be designed to be provided with different levels of supervision during training—unsupervised (Van Gysel et al., 2016b, a; Salakhutdinov and Hinton, 2009), semi/weakly-supervised (Dehghani et al., 2017; Szummer and Yilmaz, 2011), or fully-supervised using labeled (Mitra et al., 2017) or click data (Huang et al., 2013).
Modeling user behavior (45 minutes)
Modeling user browsing behavior plays an important role in the development of modern IR systems. Accurately interpreting user clicks is difficult due to various types of bias. For example, users tend to click more on results ranked on top positions (position bias) and visually salient results (attention bias). The traditional way to account for these biases is to design a Probabilistic Graphical Model (PGM) that explains relationships between click/skip events (observed variables) and examination (unobserved variables). Over the last decade many PGM-based click models have been proposed (see (Chuklin et al., 2015)
for an overview). However, these click models can model only those patterns that are explicitly encoded in their PGMs. Recently, it was shown that recurrent neural networks can learn to account for biases in user clicks directly from the click-through data, i.e., without the need for a predefined set of rules as is customary for PGM-based click models(Borisov et al., 2016a). Additionally, there are similar biases in click dwell times, which the neural approach can account for too.
Generating responses (45 minutes)
Recent inventions such as smart home devices, voice search and virtual assistants provide new ways of accessing information. They require a different response format than the classic ten blue links. Targeting this newly emerging demand, some models have been proposed to respond by generating natural language replies on the fly, rather than by (re)ranking a fixed set of items or extracting passages from existing pages.
Examples are conversational and dialog systems (Li et al., 2016; Vinyals and Le, 2015; Bordes and Weston, 2016) or machine reading and question answering tasks where the model either infers the answer from unstructured data, like textual documents that do not necessarily feature the answer literally (Hermann et al., 2015; Weston et al., 2016; Hewlett et al., 2016; Serban et al., 2016), or generates natural language given structured data, like data from knowledge graphs or from external memories (Ahn et al., 2016; Lebret et al., 2016; Mei et al., 2015; Miller et al., 2016; Graves et al., 2014).
Outlook (30 minutes)
In this session, open research questions and future directions are discussed. One of the big challenges for IR at the moment is how to process full document text using neural networks. On a higher level, it is probably desirable to learn all components of a full IR system in an end-to-end fashion.
Another challenge is maintaining long term (multiple day) search sessions or conversations. Which naturally leads to an additional open problem: how to evaluate (neural) conversational systems.
Finally, we cover recent advances, like Generative Adversarial Networks (Goodfellow et al., 2014).
Summing up, the objectives of the NN4IR tutorial are as follows:
Give an extensive overview of neural network architectures currently employed in IR, both in academia and industry.
Provide theoretical background, thereby equipping participants with the necessary means to form intuitions about various neural methods and their applicability.
Identify the IR lessons learned by employing neural methods.
Give practical tips and tricks, regarding network design, optimization, hyperparameter values, based on industry best practice.
Discuss promising future research directions.
The target audience consists of researchers and developers in information retrieval who are interested in gaining an in-depth understanding of neural models across a wide range of IR problems. The tutorial will be useful as an overview for anyone new to the deep learning field as well as for practitioners seeking concrete recipes. The tutorial aims to provide a map of the increasingly rich landscape of neural models in IR.
By the end of the tutorial, attendees will be familiar with the main architectures of neural networks as applied in IR and they will have informed intuitions of their key properties and of the insights they bring into core IR problems. We aim to provide an overview of the main directions currently employed, together with a clear understanding of the underlying theory and insights, illustrated with examples.
3. Format and detailed schedule
Table 1 gives an overview of the time schedule of the tutorial. Below we provide the details for each session.
|Semantic Matching I||60 minutes|
|Semantic Matching II||60 minutes|
|Learning to Rank||45 minutes|
|Modeling User Behavior||45 minutes|
|Generating Responses||45 minutes|
|Wrap up||15 minutes|
We bring a team team of six lecturers, all with their specific areas of specialization. Each session will have two expert lecturers who will together present the session. The initials below refer to the lecturers for this tutorial.
Preliminaries — 60 minutes (TK, MdR)
Back propagation – Given a standard feedforward network, we show the math and the intuition of back propagation. Also we will briefly touch on dropout.
Distributed representations – We show what a distributed representation is, and how distributed representations can be used.
Embedding methods – We detail how word2vec works and how it can be applied to different settings.
Convolutional networks – CNNs are primarily employed in computer vision, but can be beneficial in text classification tasks too.
Optimisation schemes – Standard back propagation with a fixed learning rate is typically replaced by more sophisticated schemes that handle learning rate annealing.
Semantic matching I: supervised — 60 minutes (AB, BM)
Short text similarity – Given two short texts, e.g., queries or sentences, how can we predict if they are semantically similar?
Word embeddings for matching – Learning embeddings from click data.
Deep neural architectures for matching – Deep Structured Semantic Model (DSSM) (Huang et al., 2013).
Learning to match using local representations – Use both local and global representations for query-document matching.
Semantic matching II: Semi- and unsupervised semantic matching — 60 minutes (MD, CVG)
Semi-supervised semantic matching – We cover how to model pseudo-labeling using prior knowledge like document similarity, or by employing heuristic methods as weak supervision signals.
Unsupervised semantic matching using pre-trained word embeddings – We show how different IR tasks benefit from using pre-trained word embeddings by pre-estimating representations for query and documents, or as warm start for representation learning during training.
Learning unsupervised representations from scratch for semantic matching – We explain how to learn representations of words and documents in an unsupervised manner without any relevance label, that satisfy the requirements of IR problems.
Learning to rank, 45 minutes (CVG, BM)
Feature-based models for representation learning – We explain how to train a ranker using featurized input, and how to feed the network with raw data to have it learn representations jointly with a downstream task.
Ranking objectives and loss functions – We describe point-wise and pair-wise settings for the ranking task and the proper loss functions for each setting.
Training under different levels of supervision – We cover how to train a neural ranker in an unsupervised way, with weak, semi- or full supervision and discuss requirements and concerns of each situation.
Modeling user behavior — 45 minutes (AB, MdR)
Biases and PGM-based click models – We introduce notions of bias in user behavior and explain how to account for them using probabilistic graphical models (PGMs).
Neural click models – We discuss weaknesses of PGM-based approaches and present an alternative based on recurrent neural networks.
Hybrid approach – We describe recent work on modeling biases in times between user actions (e.g., click dwell time) using ideas exploited in PGM-based and neural click models.
Generating responses — 45 minutes (MD, TK)
Machine reading/question answering – How is the sequence-to-sequence paradigm applied in neural QA systems.
Conversational IR/dialogue systems – Unlike QA systems, conversational systems should maintain a state of a session.
General chatbots – Chatbots bring their own set of challenges. How to stay consistent throughout the course of a conversation? How to maintain a persona?
Outlook — 30 minutes (all)
Open research questions, current challenges
Wrap up — 15 minutes (TK)
Overview of material presented and conclusion
4. Type of support materials to be supplied to attendees
Slides will be made publicly available on http://nn4ir.com.
An annotated compilation of references will list all work discussed in the tutorial and should provide a good basis for further study.
Apart from the various open source neural toolkits (Tensorflow, Theano, Torch) many of the methods presented come with implementations released under an open source license. These will be discussed as part of the presentation of the models and algorithms. We provide a list pointers to available code bases.
- Ahn et al. (2016) S. Ahn, H. Choi, T. Pärnamaa, and Y. Bengio. A neural knowledge language model. arXiv preprint arXiv:1608.00318, 2016.
Ai et al. (2016a)
Q. Ai, L. Yang, J. Guo, and W. B. Croft.
Analysis of the paragraph vector model for information retrieval.In ICTIR, pages 133–142. ACM, 2016a.
- Ai et al. (2016b) Q. Ai, L. Yang, J. Guo, and W. B. Croft. Improving language estimation with the paragraph vector model for ad-hoc retrieval. In SIGIR, pages 869–872. ACM, 2016b.
- Bahdanau et al. (2014) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2014.
- Bengio et al. (2003) Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
- Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
- Bordes and Weston (2016) A. Bordes and J. Weston. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683, 2016.
Bordes et al. (2011)
A. Bordes, J. Weston, R. Collobert, and Y. Bengio.
Learning structured embeddings of knowledge bases.
Conference on artificial intelligence, 2011.
- Borisov et al. (2016a) A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov. A neural click model for web search. In WWW, pages 531–541. International World Wide Web Conferences Steering Committee, 2016a.
- Borisov et al. (2016b) A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov. A context-aware time model for web search. In SIGIR, pages 205–214. ACM, 2016b.
- Cho et al. (2014) K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
- Chuklin et al. (2015) A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Morgan & Claypool, 2015.
- Deerwester et al. (1990) S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391–407, 1990.
- Dehghani et al. (2017) M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft. Neural ranking models with weak supervision. In Proc. SIGIR, 2017.
- Diaz et al. (2016) F. Diaz, B. Mitra, and N. Craswell. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891, 2016.
- Ganguly et al. (2015) D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In SIGIR, pages 795–798. ACM, 2015.
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
- Graves et al. (2014) A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Grbovic et al. (2015) M. Grbovic, N. Djuric, V. Radosavljevic, F. Silvestri, and N. Bhamidipati. Context-and content-aware embeddings for query rewriting in sponsored search. In SIGIR, pages 383–392. ACM, 2015.
- Guo et al. (2016) J. Guo, Y. Fan, Q. Ai, and W. B. Croft. Semantic matching by non-linear word transportation for information retrieval. In CIKM, pages 701–710, New York, NY, USA, 2016. ACM.
- Hermann et al. (2015) K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In NIPS, pages 1693–1701, 2015.
- Hewlett et al. (2016) D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. WIKIREADING: A novel large-scale language understanding task over Wikipedia. In ACL, 2016.
- Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
- Hofmann (1999) T. Hofmann. Probabilistic latent semantic indexing. In Proc. SIGIR, pages 50–57. ACM, 1999.
- Hu et al. (2014) B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching natural language sentences. In NIPS, pages 2042–2050, 2014.
- Huang et al. (2013) P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM, pages 2333–2338. ACM, 2013.
- Kenter and de Rijke (2015) T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, pages 1411–1420. ACM, 2015.
- Kenter et al. (2016) T. Kenter, A. Borisov, and M. de Rijke. Siamese cbow: Optimizing word embeddings for sentence representations. In ACL. ACL, 2016.
- Kingma and Ba (2015) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Kusner et al. (2015) M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, et al. From word embeddings to document distances. In ICML, volume 15, pages 957–966, 2015.
- Le and Mikolov (2014) Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.
- Lebret et al. (2016) R. Lebret, D. Grangier, and M. Auli. Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771, 2016.
Li et al. (2016)
J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky.
Deep reinforcement learning for dialogue generation.In EMNLP, 2016.
- Lin et al. (2015) Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, pages 2181–2187, 2015.
- Lu and Li (2013) Z. Lu and H. Li. A deep architecture for matching short texts. In Advances in Neural Information Processing Systems, pages 1367–1375, 2013.
- Mei et al. (2015) H. Mei, M. Bansal, and M. R. Walter. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. arXiv preprint arXiv:1509.00838, 2015.
- Mikolov et al. (2010) T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In Interspeech, 2010.
- Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Miller et al. (2016) A. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016.
- Mitra et al. (2016) B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- Mitra et al. (2017) B. Mitra, F. Diaz, and N. Craswell. Learning to match using local and distributed representations of text for web search. In WWW ’17, 2017.
- Robertson et al. (1995) S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
- Rumelhart et al. (1988) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3), 1988.
- Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
- Serban et al. (2016) I. V. Serban, A. García-Durán, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807, 2016.
- Severyn and Moschitti (2015) A. Severyn and A. Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In SIGIR, pages 373–382. ACM, 2015.
- Shen et al. (2014) Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, pages 101–110. ACM, 2014.
- Srivastava et al. (2014) N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
- Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- Szummer and Yilmaz (2011) M. Szummer and E. Yilmaz. Semi-supervised learning to rank with preference regularization. In CIKM, pages 269–278. ACM, 2011.
- Van Gysel et al. (2016a) C. Van Gysel, M. de Rijke, and E. Kanoulas. Learning latent vector spaces for product search. In CIKM ’16, pages 165–174, 2016a.
- Van Gysel et al. (2016b) C. Van Gysel, M. de Rijke, and M. Worring. Unsupervised, efficient and semantic expertise retrieval. In WWW ’16, pages 1069–1079, 2016b.
- Vinyals and Le (2015) O. Vinyals and Q. Le. A neural conversational model. In ICML, 2015.
- Wei and Croft (2006) X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proc. SIGIR, pages 178–185. ACM, 2006.
- Weston et al. (2016) J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov. Towards AI-complete question answering: A set of prerequisite toy tasks. In ICLR, 2016.
- Zamani and Croft (2016) H. Zamani and W. B. Croft. Estimating embedding vectors for queries. In ICTIR, pages 123–132. ACM, 2016.
- Zuccon et al. (2015) G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS. ACM, 2015.