In this paper, we explore a new approach for automated chess commentary generation, which aims to generate chess commentary texts in different categories (e.g., description, comparison, planning, etc.). We introduce a neural chess engine into text generation models to help with encoding boards, predicting moves, and analyzing situations. By jointly training the neural chess engine and the generation models for different categories, the models become more effective. We conduct experiments on 5 categories in a benchmark Chess Commentary dataset and achieve inspiring results in both automatic and human evaluations.READ FULL TEXT VIEW PDF
We introduce Texygen, a benchmarking platform to support research on
This paper presents a systematic survey on recent development of neural ...
We release an open library, called TextBox, which provides a unified,
In this paper we present the Creative Invention Benchmark (CrIB), a
This work proposes an engine for the Creation Of Novel Adventure Narrati...
We follow the step-by-step approach to neural data-to-text generation we...
The analytical description of charts is an exciting and important resear...
With games exploding in popularity, the demand for Natural Language Generation (NLG) applications for games is growing rapidly. Related researches about generating real-time game reports Yao et al. (2017), comments Jhamtani et al. (2018); Kameko et al. (2015), and tutorials Green et al. (2018a, b) benefit people with entertainments and learning materials. Among these, chess commentary is a typical task. As illustrated in Figure 1, the commentators need to understand the current board and move. And then they comment about the current move (Description), their judgment about the move (Quality), the game situation for both sides (Contexts), their analysis (Comparison) and guesses about player’s strategy (Planning). The comments provide valuable information about what is going on and what will happen. Such information not only make the game more enjoyable for the viewers, but also help them learn to think and play. Our task is to design automated generation model to address all the 5 sub-tasks (Description, Quality, Comparison, Planning, and Contexts) of single-move chess commentary.
Automatically generating chess comments draws attention from researchers for a long time. Traditional template-based methods Sadikov et al. (2007)
are precise but limited in template variety. With the development of deep learning, data-driven methods using neural networks are proposed to produce comments with high quality and flexibility. However, generating insightful comments (e.g., to explain why a move is better than the others) is still very challenging. Current neural approachesKameko et al. (2015); Jhamtani et al. (2018) get semantic representations from raw boards, moves, and evaluation information (threats and scores) from external chess engines. Such methods can easily ground comments to current boards and moves. But they cannot provide sufficient analysis on what will happen next in the game. Although external features are provided by powerful chess engines, the features are not in a continuous space, which may be not very suitable for context modeling and commentary generation.
It is common knowledge that professional game commentators are usually game players. And expert players can usually provide more thorough analysis than amateurs. Inspired by this, we argue that for chess commentary generation, the generation model needs to know how to think and play in order to provide better outputs. In this paper, we introduce a neural chess engine into our generation models. The chess engine is pre-trained by supervised expert games collected from FICS Database111https://www.ficsgames.org/ and unsupervised self-play Silver et al. (2017a, b) games, and then jointly trained with the generation models. It is able to get board representations, predict reasonable move distributions, and give continuous predictions by self-play. Our generation models are designed to imitate commentators’ thinking process by using the representations and predictions from the internal chess engine. And then the models ground commentary texts to the thinking results (semantics). We perform our experiments on 5 categories (Description, Quality, Contexts, Comparison, Planning) in the benchmark Chess Commentary dataset provided by Jhamtani Jhamtani et al. (2018). We tried models with different chess engines having different playing strength. Both automatic and human evaluation results show the efficacy and superiority of our proposed models.
The contributions are summarized as follows:
To the best of our knowledge, we are the first to introduce a compatible neural chess engine to the chess comment generation models and jointly train them, which enables the generation models benefit a lot from internal representations of game playing and analysis.
On all the 5 categories in the Chess Commentary dataset, our proposed model performs significantly better than previous state-of-the-art models.
Our codes for models and data processing will be released on GitHub222https://github.com/zhyack/SCC. Experiments can be easily reproduced and extended.
The most relevant work is Jhamtani et al. (2018). The authors released the Chess Commentary dataset with the state-of-the-art Game Aware Commentary (GAC) generation models. Their models generate comments with extracted features from powerful search-based chess engines. We follow their work to further explore better solutions on different sub-tasks (categories) in their dataset. Another relevant research about Shogi (a similar board game to chess) commentary generation is from Kameko et al. Kameko et al. (2015). They rely on external tools to extract key words first, and then generate comments with respect to the key words. Different from their works, in this paper, we argue that an internal neural chess engine can provide better information about the game states, options and developments. And we design reasonable models and sufficient experiments to support our proposal.
Chess engine has been researched for decades Levy and Newborn (1982); Baxter et al. (2000); David et al. (2017); Silver et al. (2017a). Powerful chess engines have already achieved much better game strength than human-beings Campbell et al. (2002); Silver et al. (2017a)
. Traditional chess engines are based on rules and heuristic searchesMarsland (1987); Campbell et al. (2002). They are powerful, but limited to the human-designed value functions. In recent years, neural models Silver et al. (2016, 2017b); David et al. (2017) show their unlimited potential in board games. Several models are proposed and can easily beat the best human players in Go, Chess, Shogi, etc. Silver et al. (2017a). Compared to the traditional engines, the hidden states of neural engines can provide vast information about the game and have the potential to be compatible in NLG models. We follow the advanced techniques and design our neural chess engine. Apart from learning to play the game, our engine is designed to make game states compatible with semantic representations, which bridges the game state space and human language space. And to realize this, we deploy multi-task learning Collobert and Weston (2008); Sanh et al. (2018) in our proposed models.
Data-to-text generation is a popular track in NLG researches. Recent researches are mainly about generating from structured data to biography Sha et al. (2018), market comments Murakami et al. (2017), and game reports Li and Wan (2018). Here we manage to ground the commentary to the game data (boards and moves). Addressing content selection Wiseman et al. (2017) is one of the top considerations in our designs.
The overview of our approach is shown in Figure 2. Apart from the text generation models, there are three crucial modules in our approach: the internal chess engine, the move encoder, and the multi-choices encoder. We will first introduce our solution to all the sub-tasks of chess commentary generation with the modules as black boxes. And then we describe them in details.
In Figure 2, an example is presented with model structures to demonstrate the way our models solving all the sub-tasks. The process is impelled by the internal chess engine. Given the current board and move , the engine emulates the game and provides the current and next board states together with wining rates of the players. Besides, the engine also predicts for another optional move from to make comparisons to . And then a series of long-term moves () and boards () are further predicted by the engine in a self-play manner Silver et al. (2017a, b) for deep analysis. With the semantics provided by the engine, generation models are able to predict with abundant and informative contexts. We will first detail the different semantic contexts with respect to models for 5 different subtasks. And then we summarize the common decoding process for all the models.
Description Model: Descriptions about the current move intuitively depend on the move itself. However, playing the same move could have different motivations under different contexts. For example, e2e4 is the classic Queen Pawn Opening in a fresh start. But it can be forming a pawn defense structure in the middle of the game. Different from previous works for chess commentary generation Jhamtani et al. (2018); Kameko et al. (2015), we find all kinds of latent relationships in the current board vital for current move analysis. Therefore, our description model takes the representation of both and from the move encoder as semantic contexts to produce description comment . The description model is formulated as Eq.1.
Quality Model: Jhamtani et al. Jhamtani et al. (2018) find the wining rate features benefit the generation models on Quality category. Inspired by this, we concatenate the current board state , the next board state , and the wining rate difference as semantic contexts for the decoder. And to model the value of wining rate difference, we introduce a weight matrix to map the board state-value pair to the same semantic space of the other contexts by Eq.2. Our quality model is formulated as Eq.3, where is the target comment about quality.
Comparison Model: Usually, there are more than 10 possible moves in a given board. But not all of them are worth considering. Kameko et al. Kameko et al. (2015) propose an interesting phenomenon in chess commentary: when the expert commentators comment about a bad move, they usually explain why the move is bad by showing the right move, but not another bad move. Inspired by this, we only consider the true move and the potential best move (decided by the internal chess engine) as options for the comparison model. And the semantic contexts for the options are encoded by the multi-choices encoder. We define the comparison model as Eq.4 , where is the multi-choices encoder, is the board after executing on , is the board after executing on , and is the target comment about comparison.
We can always find such scenes where commentators try to predict what will happen assuming they are playing the game. And then they give analysis according to their simulations. Our internal chess engine is able to simulate and predict the game in a similar way (self-play). We realize our model for planning by imitating the human commentators’ behavior. Predicted moves and boards are processed by our multi-choices encoder to tell the potential big moments in the future. And we use the multi-choices encoderto produce the semantic contexts for the decoder. The process to generate planning comment is described in Eq.5.
Contexts Model: To analyze the situation of the whole game, the model should know about not only the current, but also the future. And similar to the planning model, contexts model takes a series of long-term moves and boards produced by self-play predictions as inputs. In this way, the model comments the game in a god-like perspective. And the semantic contexts is also processed by the multi-choices encoder for generating contexts comment as Eq.6.
. And we use cross entropy loss function for training. The function is formalized as Eq.7, where is the gold standard outputs.
as a bunch of raw context vectors, whereis the number of such context vectors and is the dimension of the vectors. Although the semantic contexts for different generation models are different as described before, we regard all of the board states, wining rates, and move representations as general semantic contexts. And we use attention mechanism Bahdanau et al. (2015); Luong et al. (2015) to gather information from the contexts. For example, assuming that we have a hidden vector drawing from LSTM units, to decode with the semantic contexts, we use the score function of Luong attention Luong et al. (2015) as
to calculate the attention weights for vectors in , where is a transformation function for the attentional context vectors. The scores are further normalized by a softmax function to by
We compute weighted sum of with to produce the attentional context vector for word decoding
The internal chess engine is in charge of the mapping from board to semantic representation , predicting possibility distribution on valid moves, and evaluating the wining rate for the players. In previous works Jhamtani et al. (2018); Kameko et al. (2015), researchers use discrete information (threats, game evaluation scores, etc.) analyzed by external chess engine to build semantic representations. It limits the capability of the representations by simply mapping the independent features. Our internal chess engine is able to mine deeper relations and semantics with the raw board as input. And it can also make predictions in a continuous semantic space, increasing the capability and robustness for generation.
Following advanced researches in neural chess engines David et al. (2017); Silver et al. (2017a), we split the input raw board into 20 feature planes for the sake of machine understanding. There are 12 planes for pieces’ (pawn, rook, knight, bishop, queen, king) positions of each player, 4 planes for white’s repetitions, black’s repetitions, total moves, and moves with no progress, and 4 planes for 2 castling choices of each player. The feature planes are encoded by several CNN layers to produce sufficient information for semantic representation . Like previous researches on chess engines, is used to predict the move possibility distribution and the wining rate by fully connected layers. But different from those pure engines, we share the board state with generation models in a multi-task manner Collobert and Weston (2008). The engine is designed not only for playing, but also for expressing. Our generation models use as part of the inputs to get better understanding of the game states.
Given the tuple of game replays where is the corresponding move and is the ground truth wining rate, we optimize the engine’s policy, value function at the same time as Eq.11 shows. When the engine grows stronger, we let the engine produce data by itself in a self-play manner Silver et al. (2017a). Besides, the engine jointly optimizes when training generative models.
Apart from understanding the board , commentators also need to know the semantics of the move . Besides using the chess engine to produce board representations , the move encoders also prepare for move embeddings as attention contexts for the text decoders. We set the features of the move (starting cell, the move ending cell, the piece at the starting cell, the piece at the ending cell, the promotion state, and the checking state) as a sequential input to a bi-directional RNN Schuster and Paliwal (1997). When a decoder requests attention contexts for hidden state , the encoder offers to build attentional context vector following Eq.9 and Eq.10.
For Comparison, Planning, and Contexts, there are multiple moves derived from variations and predictions. The model needs to find the bright spots to describe. To encode these moves and offer precise information for the generation models, we propose a multi-choices encoder. Human commentators usually choose different aspects to comment according to their experiences. We use a global vector to store our models’ experiences and choose important moves to comment. Note that is to be learned. In module (c) of Figure 2, we denote as the output vectors of the -th move encoder, as the board state of the -th board, and as the embedding of wining rate of the -th board. To model the wining rate value, we introduce a mapping matrix and process the state-value pair to the value embedding as
Then we calculate the soft weights of choices with respect to the board states by Eq.13. For hidden state vector from decoder, attention weight matrix are scaled by via Eq.14. And we finally get attentional context vector according to by Eq.15. This approach enables generation models to generate comments with attention to intriguing board states. And the attention weights can be more accurate when accumulates abundant experiences in training.
We conduct our experiments on recently proposed Chess Commentary dataset333https://github.com/harsh19/ChessCommentaryGeneration/ Jhamtani et al. (2018). In this dataset, Jhamtani et al. Jhamtani et al. (2018) collect and process 11,578 annotated chess games from a large social forum GAMEKNOT444https://gameknot.com. There are 298K aligned data pairs of game moves and commentaries. The dataset is split into training set, validation set and test set as a 7:1:2 ratio with respect to the games. As the GAMEKNOT is a free-speech forum, the comments can be very freewheeling in grammar and morphology. The informal language style and unpredictable expression tendency make a big challenge for data-driven neural generation models. To narrow down the expression tendency, Jhamtani et al. Jhamtani et al. (2018)classify the dataset into 6 categories: Description, Quality, Comparison, Planning, Contexts, and General. The General category is usually about the player and tournament information, which needs external knowledge irrelevant to game analysis. We do not conduct experiments on the last category.
And for the training of chess engine, we collect all of the standard chess game records in the past 10 years from FICS Games Database. And we remove the games where any player’s rating below 2,000. There are 36M training data (for single move step) after cleaning.
We train our neural chess engine using mixed data consisting of supervised FICS data and unsupervised self-play data. The number of self-play games are set to 0 initially. And it will be increased by 1 when the trained model beats the previous best version (with a wining rate larger than 0.55 in 20 games). During 400 iterations of training, we pick one strong engine and one weak engine for further experiments. The stronger engine loses 1 game and draws 55 games to the weak engine in 100 games. As mentioned in Section 3.2, when training generation models, we use the pre-trained chess engine and fine-tune it with the generation models.
Here we introduce our models and baselines in the experiments. We call our models the Skilled Chess Commentator (SCC) as they have the skills of playing chess.
SCC-weak: The generation models are integrated with the weak engine mentioned above, and they are trained independently with respect to the 5 categories in Chess Commentary dataset.
SCC-strong: The model is similar to SCC-weak, but integrated with the strong engine.
SCC-mult: This is a multi-task learning model where generation models for different categories share the strong chess engine, move encoder, the multi-choices encoder and the value mapping matrix .
GAC: The state-of-the-art method proposed by Jhamtani et al. Jhamtani et al. (2018). Their models incorporate the domain knowledge provided by external chess engines. Their models only work for first 3 categories: Description, Quality, and Comparison. We will compare our results with GAC on these categories.
KWG: Another state-of-the-art method for game commentary generation Kameko et al. (2015). It is a pipeline method based on keyword generation. We compare the results on all data categories.
Re: This is a retrieval-based baseline method. For each input in the test set, we find the most matched datum in the training set by numbers of matched input board and move features.
We develop both automatic evaluations and human evaluations to compare the models.
For automatic evaluations, we use BLEU Papineni et al. (2002) and METEOR Denkowski and Lavie (2014) to evaluate the generated comments with ground-truth outputs. BLEU evaluates the modified precision between the predicted texts and gold-standard references on corpus level. Evaluating with 4-grams (BLEU-4 555https://github.com/moses-smt/mosesdecoder/blob/ master/scripts/generic/multi-bleu.perl) is the most popular way in NLG researches. However, for tasks like dialogue system Li et al. (2016), story telling generation Jain et al. (2017), and chess commentary Jhamtani et al. (2018), the outputs can be rather short and free expressions. Under such circumstances, brevity penalty for 4-grams can be too strict and makes the results unbalanced. We use BLEU-2 666https://github.com/harsh19/ChessCommentaryGeneration/ blob/master/Code/methods/category_aware/BLEU2.perl
to show more steady results with BLEU evaluation algorithm. We also use METEOR as a metric, whose results are more closed to a normal distributionDobre (2015).
We also conduct human evaluation to make more convincing comparisons. We recruit 10 workers on Amazon Mechanical Turk777https://www.mturk.com to evaluate 150 groups of samples (30 from each category). Each sample is assigned to exactly 2 workers. The workers rate 8 shuffled texts (for Ground Truth, Temp, Re, GAC, KWG, and SCC models) for the following 4 aspect in a 5-pt Likert scale888https://en.wikipedia.org/wiki/Likert scale.
Fluency: Whether the comment is fluent and grammatical.
Accuracy: Whether the comment correctly describes current board and move.
Insights: Whether the comment makes appropriate predictions and thorough analysis.
Overall: The annotators’ overall impression about comments.
We present the automatic evaluation results in Table 1. Our SCC models outperform all of the baselines and previous state-of-the-art models. Temp is limited by the variety of templates. It is competitive with the neural models on Description and Quality due to limited expressions in these tasks. But when coming to Comparison, Planning and Contexts, Temp shows really bad performances. Re keeps flexibility by copying the sentences from training set. But it does not perform well, either. The ability of Re is limited by the sparse searching space, where there are 90,743 data in the training set, but possible boards999https://en.wikipedia.org/wiki/Shannon_number for chess game. KWG and GAC provide competitive results. With the help of external information from powerful chess engines, GAC shows good performances on Quality and Comparison. Although our internal chess engine is no match for the external engines that GAC uses at playing chess, it turns out that our models with directly internal information can better bridge the semantic spaces of chess game and comment language. As for the comparisons within our models, SCC-strong turns to be better than SCC-weak, which supports our assumption that better skills enable more precise predictions, resulting in better comments. Training with multi-task learning seems to hurt the overall performances a little. But SCC-mult still has the state-of-the-art performances. And more important, it can react to all sub-tasks as a whole.
The human annotators are required to be good at playing chess. That is to say, they are the true audiences of the commentator researches and applications. By introducing human evaluations, we further reveal the performances in the perspective of the audiences. We show the average scores and significance test results in Table 2. We further demonstrate the efficacy of our models with significantly better overall performances than the retrieval-based model and previous state-of-the-art ones. It is worth noting that the evaluations about Accuracy and Insights show that our models can produce more precise and thorough analysis owing to the internal chess engine. SCC-mult and SCC-strong perform better than SCC-weak in Accuracy and Overall scores. It also supports the points that the our commentary model can be improved with better internal engine.
in a two-tail T-test (p<0.01).
To have a better view of comparisons among model outputs, we present and analyze some samples in Figure 3. In these samples, our model refers to SCC-mult.
For the first example, black can exchange white’s e3 knight and e4 pawn with the b4 bishop if white takes no action. But white chooses to protect the e3 knight with the g1 knight. All the models generate comments about Description. Temp directly describes the move without explanation. Re finds similar situation in the training set and explains the move as defense and developing. KWG is right about developing, but wrong about the position of the knight and the threats. GAC produces safe comment about the developing. And our model has a better understanding about the boards. It annotates the move correctly and even gives the reason why white plays this move.
For the second example, the game is at the 3rd turn. White gives up the pawn on d5 and chooses to push the queen’s pawn. Re and KWG both make a mistake and recognize the move d2d4 as Queen Pawn Opening. Temp thinks white is going to win because white have the advantage of one more pawn. However, Temp cannot predict that white will lose the advantage in the next move. Our model is able to predict the future moves via self-play. And it draws the conclusion that pushing the queen’s pawn can open up the ways for the queen and bishop for future planning.
In this work we propose a new approach for automated chess commentary generation. We come up with the idea that models capable of playing chess will generate good comments, and models with better playing strength will perform better in generation. By introducing a compatible chess engine to comment generation models, we get models that can mine deeper information and ground more insightful comments to the input boards and moves. Comprehensive experiments demonstrate the effectiveness of our models.
Our experiment results show the direction to further developing the state-of-the-art chess engine to improve generation models. Another interesting direction is to extend our models to multi-move commentary generation tasks. And unsupervised approaches to leverage massive chess comments in social media is also worth exploring.
This work was supported by National Natural Science Foundation of China (61772036) and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). We thank the anonymous reviewers for their helpful comments. Xiaojun Wan is the corresponding author.
A unified architecture for natural language processing: deep neural networks with multitask learning. In Machine Learning, Proceedings of (ICML 2008), pp. 160–167. External Links: Cited by: §2, §3.2.
Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR abs/1712.01815. External Links: Cited by: §1, §2, §3.1, §3.2, §3.2.