Reward-free Policy Imitation Learning for Conversational Search
Existing conversational search studies mainly focused on asking better clarifying questions and/or improving search result quality. These works aim at retrieving better responses according to the search context, and their performances are evaluated on either single-turn tasks or multi-turn tasks under naive conversation policy settings. This leaves some questions about their applicability in real-world multi-turn conversations where realistically, each and every action needs to be made by the system itself, and search session efficiency is often an important concern of conversational search systems. While some recent works have identified the need for improving search efficiency in conversational search, they mostly require extensive data annotations and use hand-crafted rewards or heuristics to train systems that can achieve reasonable performance in a restricted number of turns, which has limited generalizability in practice. In this paper, we propose a reward-free conversation policy imitation learning framework, which can train a conversation policy without annotated conversation data or manually designed rewards. The trained conversation policy can be used to guide the conversational retrieval models to balance conversational search quality and efficiency. To evaluate the proposed conversational search system, we propose a new multi-turn-multi-response conversational evaluation metric named Expected Conversational Reciprocal Rank (ECRR). ECRR is designed to evaluate entire multi-turn conversational search sessions towards comprehensively evaluating both search result quality and search efficiency.
READ FULL TEXT