
OffPolicy ExploitabilityEvaluation and EquilibriumLearning in TwoPlayer ZeroSum Markov Games
Offpolicy evaluation (OPE) is the problem of evaluating new policies us...
read it

Provably Efficient Policy Gradient Methods for TwoPlayer ZeroSum Markov Games
Policy gradient methods are widely used in solving twoplayer zerosum g...
read it

NearOptimal Reinforcement Learning with SelfPlay
This paper considers the problem of designing optimal algorithms for rei...
read it

Stackelberg Punishment and BullyProofing Autonomous Vehicles
Mutually beneficial behavior in repeated games can be enforced via the t...
read it

Controlling a Random Population is EXPTIMEhard
Bertrand et al. [1] (LMCS 2019) describe twoplayer zerosum games in wh...
read it

Equilibrium solutions of three player Kuhn poker with N>3 cards: A new numerical method using regularization and arclength continuation
We study the equilibrium solutions of three player Kuhn poker with N>3 c...
read it

A Short Solution to the ManyPlayer Silent Duel with Arbitrary Consolation Prize
The classical constantsum 'silent duel' game had two antagonistic marks...
read it
Identity Concealment Games: How I Learned to Stop Revealing and Love the Coincidences
In an adversarial environment, a hostile player performing a task may behave like a nonhostile one in order not to reveal its identity to an opponent. To model such a scenario, we define identity concealment games: zerosum stochastic reachability games with a zerosum objective of identity concealment. To measure the identity concealment of the player, we introduce the notion of an average player. The average player's policy represents the expected behavior of a nonhostile player. We show that there exists an equilibrium policy pair for every identity concealment game and give the optimality equations to synthesize an equilibrium policy pair. If the player's opponent follows a nonequilibrium policy, the player can hide its identity better. For this reason, we study how the hostile player may learn the opponent's policy. Since learning via exploration policies would quickly reveal the hostile player's identity to the opponent, we consider the problem of learning a nearoptimal policy for the hostile player using the game runs collected under the average player's policy. Consequently, we propose an algorithm that provably learns a nearoptimal policy and give an upper bound on the number of sample runs to be collected.
READ FULL TEXT
Comments
There are no comments yet.