Incorporating Behavioral Constraints in Online AI Systems

09/15/2018
by   Avinash Balakrishnan, et al.
0

AI systems that learn through reward feedback about the actions they take are increasingly deployed in domains that have significant impact on our daily life. However, in many cases the online rewards should not be the only guiding criteria, as there are additional constraints and/or priorities imposed by regulations, values, preferences, or ethical principles. We detail a novel online agent that learns a set of behavioral constraints by observation and uses these learned constraints as a guide when making decisions in an online setting while still being reactive to reward feedback. To define this agent, we propose to adopt a novel extension to the classical contextual multi-armed bandit setting and we provide a new algorithm called Behavior Constrained Thompson Sampling (BCTS) that allows for online learning while obeying exogenous constraints. Our agent learns a constrained policy that implements the observed behavioral constraints demonstrated by a teacher agent, and then uses this constrained policy to guide the reward-based online exploration and exploitation. We characterize the upper bound on the expected regret of the contextual bandit algorithm that underlies our agent and provide a case study with real world data in two application domains. Our experiments show that the designed agent is able to act within the set of behavior constraints without significantly degrading its overall reward performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2018

Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

Autonomous cyber-physical agents and systems play an increasingly large ...
research
02/25/2022

Towards neoRL networks; the emergence of purposive graphs

The neoRL framework for purposive AI implements latent learning by emula...
research
04/08/2020

A Dynamic Observation Strategy for Multi-agent Multi-armed Bandit Problem

We define and analyze a multi-agent multi-armed bandit problem in which ...
research
07/18/2022

Online Learning with Off-Policy Feedback

We study the problem of online learning in adversarial bandit problems u...
research
05/26/2017

Combinatorial Multi-Armed Bandits with Filtered Feedback

Motivated by problems in search and detection we present a solution to a...
research
04/14/2023

Repeated Principal-Agent Games with Unobserved Agent Rewards and Perfect-Knowledge Agents

Motivated by a number of real-world applications from domains like healt...
research
08/13/2023

Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden Rewards

In practice, incentive providers (i.e., principals) often cannot observe...

Please sign up or login with your details

Forgot password? Click here to reset