Self-Play Learning Without a Reward Metric

12/16/2019
by   Dan Schmidt, et al.
0

The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2023

Analysis of reward mechanism for quizmarket

A reward algorithm is needed for games which rewards risk, i.e. early pl...
research
05/06/2023

A Novel Reward Shaping Function for Single-Player Mahjong

Mahjong is a complex game with an intractably large state space with ext...
research
03/11/2021

A Reinforcement Learning Based Approach to Play Calling in Football

With the vast amount of data collected on football and the growth of com...
research
09/23/2016

Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game

We present the first reinforcement-learning model to self-improve its re...
research
12/11/2011

Adaptive Forgetting Factor Fictitious Play

It is now well known that decentralised optimisation can be formulated a...
research
11/29/2022

Configurable Agent With Reward As Input: A Play-Style Continuum Generation

Modern video games are becoming richer and more complex in terms of game...
research
05/25/2023

Lucy-SKG: Learning to Play Rocket League Efficiently Using Deep Reinforcement Learning

A successful tactic that is followed by the scientific community for adv...

Please sign up or login with your details

Forgot password? Click here to reset