Evaluation Function Approximation for Scrabble

01/25/2019 ∙ by Rishabh Agarwal, et al. ∙ 0

The current state-of-the-art Scrabble agents are not learning-based but depend on truncated Monte Carlo simulations and the quality of such agents is contingent upon the time available for running the simulations. This thesis takes steps towards building a learning-based Scrabble agent using self-play. Specifically, we try to find a better function approximation for the static evaluation function used in Scrabble which determines the move goodness at a given board configuration. In this work, we experimented with evolutionary algorithms and Bayesian Optimization to learn the weights for an approximate feature-based evaluation function. However, these optimization methods were not quite effective, which lead us to explore the given problem from an Imitation Learning point of view. We also tried to imitate the ranking of moves produced by the Quackle simulation agent using supervised learning with a neural network function approximator which takes the raw representation of the Scrabble board as the input instead of using only a fixed number of handcrafted features.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Scrabble

Scrabble[wiki:scrabble]

is a board game, where two players alternate on forming words on a 15x15 board by placing tiles bearing a single letter onto a board. The tiles must form words which, in crossword fashion, read left to right in rows or downwards in columns, and be defined in a standard dictionary or lexicon. Also, at least one of the tiles must be placed next to an existing tile on the board. The board is also marked with “premium” squares, which multiply the number of points awarded for a given word.

The game begins with a total of 100 tiles, placed in a bag in which the letters are invisible. The set of tiles that a player holds is called the rack and each player starts off with drawing seven tiles each from the bag. The set of tiles left after a player has moved is called the rack leave and for every letter placed on the board a replacement is drawn from the bag. The game ends when one player is out of tiles and no tiles are left to draw, or both players pass. The interested reader can find more details about the rules of Scrabble here.

Figure 1.1: An example Scrabble board with the rack of a particular player. Note that the squares other than the premium squares are colored green while other colors correspond to different type of premium squares. Source: http://www.bilder-hochladen.net/files/big/jcry-7r-bf1b.png

1.2 Why Scrabble?

Scrabble is a game of imperfect information, i.e. the current player is unaware about the rack of the opponent player, making it very hard to guess the opponent’s next move until the end of the game. Also, there is inherent randomness present in Scrabble as random letters are drawn from the bag to the current player’s rack at each round. The state space in Scrabble is also quite complex due to the tiles being marked with specific letters as opposed to being black and white. All these factors contribute to the difficulty of creating a perfect AI agent for Scrabble.

1.3 Outline

As a prerequisite to begin our exposition, we provide an overview of current state-of-the-art Scrabble agents with an high level description of the techniques employed in these agents in Chapter 2. Chapter 3 provides a detailed discussion of our approach towards the improving the evaluation function using black-box optimization methods. In Chapter 4, we shed light on our current framework involving supervised learning for approaching the problem. We conclude with a brief page of summary and directions for future work in Chapter 5.

2.1 Maven

Maven’s[sheppard2002world] game play is sub-divided into 3 phases:

  1. Mid-game

    : This phase lasts from the beginning of the game up until there are 9 or fewer tiles left in the bag. In this phase, all possible moves from the given rack are generated followed by sorting using some simple heuristics. The most promising moves among these moves are evaluated using 2-ply Monte-Carlo simulation where both players are assumed to draw tiles from the bag in each turn, and are assumed to play the best move based on the aforementioned heuristics. The move which is evaluated to be best after the simulations is played during the game.

  2. Pre-endgame: This phase is almost similar to the mid-game phase and is designed to attempt to yield a good end-game situation.

  3. Endgame: During this phase, there are no tiles left in the bag and thus, Scrabble becomes a game of perfect information. Maven uses the search algorithm to analyze the game tree during the endgame phase.

2.2 Quackle

Quackle[katz-brown_o'laughlin] uses a similar approach as used by Maven. Quackle’s heuristic function111Henceforth, shall be referred to as the static evaluation function

used in the mid-game phase is a sum of the move score and the value of the rack leave. This leave value is computed using a precomputed database: it favors both the chance of playing a Bingo on the next turn as well as leaves that garnered high scores when they appeared on a player’s rack in a large sample of Quackle versus Quackle games. The win-percentage of each move is estimated and the move with the best win-percentage is played. Figure

2.1 presents a high level flow chart of the algorithm used in Quackle.

Figure 2.1: Quackle’s AI in a Block Diagram. Source: http://people.csail.mit.edu/jasonkb/quackle/doc/blockdiagram.png

These are mainly two types of players present in the Quackle open source code which was used for our experiments:

  1. Speedy Player: The player only uses only static evaluations throughout the game and no Monte-Carlo simulations. The move predicted to be the best by the static evaluation function is played at a given board configuration.

  2. Championship Player: This player uses simulations in addition to static evaluations along with a perfect endgame player. This player can be provided with a particular amount of think-time222The win-rate of second player when the first player is a “Twenty Second Championship Player” i.e. it has a think-time of 20 seconds/move and the second player is a “Speedy Player” is calculated using 50000 games. to run the truncated Monte-Carlo simulations for a single turn. For example, “Five Minute Championship Player” can take as long as five minutes to decide upon the move to play in a single turn.

Please note that the win-rate of second player when both players are Speedy Players is , calculated using 50000 self-play games.

3.1 Fitness Function

3.1.1 True Fitness Function

The true fitness function, for a given evaluation function is given by the win-percentage of the Scrabble agent using the given evaluation function against another agent (usually Speedy Player) using another fixed evaluation function. The win-percentage should be estimated using a large number of games, in order to keep the error-bars around the win-percentage small, leading to a reasonable estimate of the function .

Using Hoeffding’s inequality[wiki:hoeffding]

, the true mean of a Bernoulli random variable that has been sampled N times will lie, with probability at least 1 -

, within the empirical mean . The win-rate approximated using N self-play games is estimated correctly with an error bound given by the above inequality. Using , and N = 5000 leads to a error bar of . Further, increasing N to 50000 games still leads to an error of . We can achieve slightly tighter error bounds based on KL Divergence[kaufmann2013information] given by equations 3.1 and 3.2.

(3.1)
(3.2)

where denotes the true mean of the Bernoulli and N is the number of self-play games. These equations also a error of for games for .

3.1.2 Other fitness functions

The idea of imitating the simulation agent (“Championship Player”) to improve the static evaluation function also seemed quite reasonable and in order to explore this idea more, we experimented with the fitness function , which utilized the ranking of the moves by the “Five-Minute Championship Player”. We generated 70000 board configurations, and for each board configuration b, we calculated a ranked list of size min moves out of all possible moves using the “Championship Player”. For each board configuration, a list of top min was generated using the given static evaluation function. The player was provided a score of where k is the index of the best move in among those 10 moves. The function is given by the sum of all such scores.

We also tested the fitness function given by the sum of final score differences for a “Speedy Player” over all games played between a fixed Quackle player and the “Speedy Player” which uses a static evaluation function. However, the preliminary results111 obtained a win-rate of 46% as compared to the win-rate of 44.5% obtained by using CMA-ES with 6 generations and a population size of 50. Please note that these results are correct to within as they were obtained using fitness function evaluation over 5000 games only . which we obtained by using were worse as compared to using and therefore, we didn’t use in any further experiments.

3.2 Cma-Es

We used an evolutionary algorithm[spears1993overview], called CMA-ES (Covariance Matrix Adaptation Evolution Strategy)[hansen2016cma], to find the optimal weights for a feature-based evaluation function. An evolutionary algorithm is broadly based on the principle of biological evolution: In each generation, new individuals are generated by stochastic variation of the current parental individuals. Then, the top individuals out of are selected to become the parents in the next generation based on their fitness function value . In this manner, over the generation sequence, individuals with better and better -values are generated. Figure 3.1 presents a high level overview of the CMA-ES algorithm.

Figure 3.1: CMA-ES flowchart. In each iteration(generation) step, a weighted combination of the best out of new candidate solutions is used to update the distribution parameters , , . Here,

represents a normal distribution and

is the fitness function. Source: http://ascelibrary.org/cms/attachment/9942/197821/figure3.gif

For our experiments, we ran 4-6 generations of CMA-ES with and agents using the fitness function calculated through self-play games between two “Speedy Players” as described in section 3.1.

3.2.1 Linear Evaluation Function

The linear static evaluation function scores a move based on some state-action features

generated using the current board configuration and the move to be evaluated. The computed score is a dot product of the feature values with a weight vector. Note that the state-only features are not used as they will be same for all the moves at a given state and thus would not affect the ranking using a linear function. The move with the highest value of the evaluation function is played. The feature set

consists of the following features:

  1. Move Score: The score we obtain by playing the given move on the board.

  2. Leave Value: A precomputed value for each rack leave generated by the Quackle code

  3. Leave Playability: A value calculated for each rack leave which calculates the expected value of the move score that can be obtained after sampling letters from the bag to complete the rack and using that rack to form a word

  4. Difference in consonants and vowels left on the rack

  5. Number of blanks left on the rack after playing the move

The current static evaluation function in Quackle, only uses the top two features mentioned above i.e. the move score and leave value, with a weight of for each of the two features.

Figure 3.2: Result of applying CMA-ES using only the first 5 features for a linear evaluation function with the weight of first feature always fixed to 1. The upper left shows the variation in the function with being the least

-value in a given generation. The lower figures show the square root of eigenvalues (left) and of diagonal elements (right) of the covariance matrix

. The actual sample distribution has the covariance matrix .

Figure 3.2 shows the results we obtained after running CMA-ES with above-mentioned features. As shown in the upper left plot in this figure , the value of the - min() is increasing as the generation number increases except for the last generation. The final weights with the min() (or max()) results in a win-percentage of 222This is an improvement of 0.3% over calculated over 50000 games for the second player.

We also experimented with other features such as the length of the rack leave and features corresponding to the board configuration changes after playing a move. For example, when a player plays a move on the board, new openings (positions which can be used to play a valid move) are created on the board for the opponent player. Some of these openings can also utilize premium squares. We experimented with the feature corresponding to number of such openings for each kind of premium square, leading to a total of such features. However, these new features deteriorated the maximum fitness value obtained using CMA-ES and therefore, were skipped for further experiments. The fact that the above mentioned features didn’t lead to any improvement is surprising, and we believe that this was due to the limited representation ability of the linear evaluation function.

Using CMA-ES, , as described in section 3.1.2 also didn’t lead to improvement with the above mentioned features333Surprisingly, this experiment resulted in a win-rate of 41.8% when was evaluated using 50000 games.. The baseline -value for indicates that the best move generated by the “Championship Player” is expected to lie withing the top 5 moves predicted by .

3.2.2 Non-Linear Evaluation function

Since a linear model has limited representation ability, we also experimented with evaluation functions represented by a neural network with 1-2 hidden layers and non-linear activation functions such as tanh and ReLU

[nair2010rectified]. The neural network is only used to introduce the non-linearity in a structured manner. State-only features pertaining only to the board such as differences of vowels and consonants on the board, number of blanks on board etc. were also used as input to the network, in addition to state-action features mentioned in section 3.2.1. The results from this experiment were not encouraging and the given class of non-linear functions were not able improve upon .

3.3 Bayesian Optimization

Bayesian optimization[brochu2010tutorial] is a method of optimization of expensive cost functions without calculating derivatives. This kind of optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This allows an utility-based selection of the next observation using an acquisition function[snoek2012practical] that determines what the next query point should be. The acquisition function takes into account both sampling from areas of high uncertainty as well as areas likely to offer improvement over the current best observation.

Since our fitness functions are noisy (as only finite number of games are used to compute them), the idea of applying Bayesian optimization to find the optimal static evaluation function looked promising. However, this experiment also didn’t result in a significant improvement over the current win-rate of obtained by .

3.4 Multiple evaluation functions

Instead of using the same evaluation function for all stages444Refer to section 2.1 for detailed information about the different phases. of the game, it seemed plausible to us that a combination of evaluation functions could work better for different game stages in Scrabble. This idea has been previously employed successfully in the game of Robot Soccer[macalpine2012ut] and already been proposed in the context of Scrabble[romero2009] as well. Keeping the evaluation function weights fixed for the early-game555Early-game was defined as the phase of the game when the number of tiles in the bag were 80. and endgame phases, we only evolved the weights pertaining to the mid-game where the mid-game was defined to be the phase of game when the number of tiles in the bag were between 20 to 80. This experiment also proved to be futile and didn’t result in any improvement over the function .

4.1 Learning to Rank

We initially experimented with a neural network approximator as the static evaluation function with input as a move represented by the various state-action features (including state-only features as well) described in section 3.2.2. We trained this neural network using a fixed dataset of state-action pairs with 70000 board configurations. For each board configuration , the Quackle “Five Minute Championship Player” provides a sorted list of size min{10, }, where is the list of all possible moves at configuration . The dataset D is further split into training and validation datasets and with a ratio of 97:3.

Figure 4.1 presents the Rank-net[burges2005learning] architecture for which invokes the network . For each board configuration in dataset , the network is provided with pair of moves (, ) where where L[i] denotes the element of the list . The output label,

for all such pairs. The binary cross-entropy loss function used for training is given by

Figure 4.1: Architecture of the network . Given two moves , for a given board configuration(state), the network takes the tuple (, ) as input and outputs . is trained using a binary cross-entropy loss where the label for pair , is 1 if the move was ranked above move by the Quackle agent, otherwise the label is 0.

However, results from this experiment were not very encouraging. The function approximator was only able to achieve a slightly better win-rate than . Further, the validation accuracy achieved using the ranking loss was close to the baseline accuracy() produced by . We believe that the limited representation ability of the features used for this experiment was the limiting factor in this experiment.

4.2 Raw Board Representation

The problem of designing good features for the game of Scrabble is quite important. The goodness of a board configuration depends both on spatial arrangement and on the presence or absence of sets of letters and ultimately needs consultation with the Scrabble dictionary as well. However, we think that combining these dimensions into a small number of discriminative signals is not an easy task. In order to tackle this problem, we decided to use the raw board representation of the Scrabble board as input to our evaluation function instead of using only a fixed number of hand-coded features.

AlphaGo Zero[silver2017mastering] used the raw board representation as input to neural networks trained using self-play and mastered the game of Go without using any human knowledge. Unlike Go, which only has white and black stones, Scrabble has 27 letters, each with a different value, and with combinations of letters having different a value from the sum of the constituents. This makes the Scrabble board a much harder proposition for a convolutional neural net(CNN[lecun2015deep])111Henceforth, convolutional neural net is abbreviated as CNN to model well. However, given the enormous success of AlphaGo Zero, we decided to further investigate the idea of using raw board representations for Scrabble.

A Scrabble board is encoded into a feature vector where the third dimension pertain to the different type of features used in our encoding of the Scrabble board. For a particular feature, each board position is a given a value. We used the following features for our encoding:

  1. Whether a position is blank or not

  2. Whether a position contains a particular alphabet, leading to a total of 26 features for the letters A-Z

  3. The score of the tile placed on a position. All positions not containing any tiles are given a score of zero except the premium scores which are given a score of -1.

The move at a given board configuration is provided as input using the concatenation of the feature vectors and (let’s call it input1) where is the configuration obtained after playing the given move on . In addition to input1, the two features used in (denoted by input2) also provided as input to our evaluation function. The neural network consists of many residual blocks[he2016deep]

of convolutional layers with batch normalization

[ioffe2015batch] and ReLU activations and is trained in a similar manner as described by figure 4.1.

Specifically, consists of a single convolutional block followed by 2 residual blocks and a final block. The convolutional block applies the following modules sequentially to input1:

  1. [nosep]

  2. A convolution of 16 filters of kernel size

    with stride 1

  3. Batch normalization

  4. A rectifier nonlinearity

Each residual block applies the following modules sequentially to its input:

  1. [nosep]

  2. A convolution of 16 filters of kernel size with stride 1

  3. Batch normalization

  4. A rectifier nonlinearity

  5. A convolution of 16 filters of kernel size with stride 1

  6. Batch normalization

  7. A skip connection that adds the input to the block

  8. A rectifier nonlinearity

The final block applies the following modules sequentially to its input:

  1. [nosep]

  2. A convolution of 1 filter of kernel size with stride 1

  3. Batch normalization

  4. A rectifier nonlinearity

  5. A fully connected linear layer to a hidden layer of size 32

  6. A rectifier nonlinearity

  7. A fully connected linear layer to a hidden layer of size 4

  8. A rectifier nonlinearity

  9. A concatenation layer which combines the previous layer input and input2

  10. A fully connected linear layer to a scalar

The number of weight parameters in were approximately 1/20 times the number of training example pairs used to train .

Figure 4.2:

Training and validation loss against the number of training epochs. It can be observed that the training loss continually decreases with increasing number of epochs. The validation loss follows the same trend as the training loss, however, it stabilizes to a much higher loss value indicating overfitting.

Figure 4.2 shows the learning curves for the network . This approach led to an validation accuracy of on which was not used for training. Given these initial results, this approach looks very promising and future work may involve exploring it further.

Bibliography