Vizarel: A System to Help Better Understand RL Agents

07/10/2020 ∙ by Shuby Deshpande, et al. ∙ Carnegie Mellon University 0

Visualization tools for supervised learning have allowed users to interpret, introspect, and gain intuition for the successes and failures of their models. While reinforcement learning practitioners ask many of the same questions, existing tools are not applicable to the RL setting. In this work, we describe our initial attempt at constructing a prototype of these ideas, through identifying possible features that such a system should encapsulate. Our design is motivated by envisioning the system to be a platform on which to experiment with interpretable reinforcement learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning systems have made impressive advances due to their ability to learn high dimensional models from large amounts of data LeCun_Bengio_Hinton_2015. However, high dimensional models are hard to understand and trust doshivelez2017rigorous. Visualization systems are important for overcoming these challenges.

Many tools exist for addressing these challenges in the supervised learning setting. which find usage in tracking metrics satyanarayan_vegalite, generating graphs of model internals wongsuphasawat_tfgraphs, and visualizing embeddings maaten_tsne. However, there is no corresponding set of tools for the reinforcement learning setting. At first glance, we may repurpose existing libraries or packages for this task. However, we quickly run into limitations, which arise due to the intent with which tools were designed in the first place.

Reinforcement learning is fundamentally an interactive science Neftci_Averbeck_2019 in that there is a stronger feedback loop between the researcher and model (i.e. agent), compared to supervised learning. We need tools that reflect this dynamic instead of limiting us to the constraints imposed by the supervised learning framework.

Visualization systems at their core consist of two components: representation and interaction. Though these may appear to be disparate, it is hard to discount the influence that each has on each other. The tools we use for representation affect how we interact with the system, and our interaction affects the representations that we create Yi_understanding_interaction. Visualization interfaces should adhere to the human action cycle norman_2013, which provides us with a useful model to think about when designing features which our systems should encapsulate.

Three dimensions along which to evaluate interaction in visualization systems, as proposed by Beaudouin-Lafon_2004, and adapted here for relevance, are:

  • [noitemsep, nolistsep]

  • descriptive power: the ability to describe a significant range of existing interfaces

  • evaluative power: the ability to help assess multiple alternatives

  • generative power: the ability to help create new designs

Existing tools primarily focus on discriptive power. Using them, we can plot common descriptive metrics such as cumulative reward, TD-error, action values, to name a few. However, these systems either lack or are deficient in evaluative and generative power. Ideally, the systems we use should help us answer questions such as:

  • [noitemsep, nolistsep]

  • What sequence of dynamics causes my agent to behave the way it does?

  • What actions should I take to induce the intended agent behavior instead?

  • What effects does experiencing noteworthy states have on the resulting policy? Are there other states which lead to similar outcomes?

These are far from an exhaustive list of questions that the researcher may pose during training agent policies but are chosen to illustrate the current gap that our interfaces face with regards to evaluative and generative power. This paper describes our initial attempt at constructing a visualization system that can answer these questions. Concretely, we make the following contributions:

  • [noitemsep, nolistsep]

  • identifying features in which an interactive system for interpretable reinforcement learning should encapsulate.

  • building a prototype of these ideas, which instantiates a working system with these features

  • enumerating upcoming features which extend the system’s existing capabilities

Figure 1: Reward + State Space Viewport Visualizing autogenerated reward & state space viewports for the Pong environment. This representation should provide the user better intuition about the correspondence between rewards and states, especially for environments with denser rewards.

2 Preliminaries

We use the standard reinforcement learning setup sutton. An agent interacting with an environment at discrete timesteps , receiving a scalar reward . The agent’s behavior is defined by a policy , which maps states

, to a probability distribution over actions,

. The environment

can be stochastic, which is modeled by a Markov decision process with a state space

, action space , an initial state distribution , a transition function , and a reward function . The return from a state is defined as , with a discount factor . We use a replay buffer mnih2013playing to store the agent’s experiences in a buffer .

3 System Description

Our tool has two components, the frontend dashboard and control panel, and the backend storage server and logging unit.

3.1 Frontend

The frontend enables the construction of multiple viewports, which serve as the base class for further visualization extensions. Each viewport as an abstract entity can be backed by different specs, based on the underlying data stream. For example, one could use:

  1. [noitemsep, nolistsep]

  2. image buffers: to visualize image based observation spaces (or non-image based spaces if rendering is enabled)

  3. line plots: to visualize non-image based spaces, action values, and rewards

  4. scatter plots: to visualize embedding spaces

This naturally leads to the idea of an ecosystem of plugins that can be integrated into the core system to support different visualization schemes and algorithms. Though these are the currently available viewports, in a later section we describe upcoming viewport designs that are being integrated, to support additional visualizations. The following subsections detail different views that the frontend interface currently supports.

3.1.1 State Spaces

Referring to the state-space formulation from §2

, states can primarily be classified as either image-based or non-image based spaces. The type of observation space influences the corresponding spec through which the viewport is generated. We provide two examples that illustrate how these differing specs can result in different viewports. Consider a non-image based observation space, such as that for the inverted pendulum task. Here, the state vector

, where is the angle which the pendulum makes with the vertical.

We can visualize the state vector components individually, which gives us a sense of how states vary across episode timesteps (Figure 2). Since an image representation is easier for humans to interpret, it seems reasonable to also generate an additional viewport which tracks the corresponding changes in image space. Having this simultaneous visualization is useful since this now enables us to jump back and forth between the state representation which the agent receives, and the corresponding element in image space, by simply clicking on the desired timestep in the state viewport.

Figure 2: State Space Viewport: Visualizing autogenerated state space viewports for the inverted pendulum task. This representation with an rendered image overlay (see Figure 3), provides the user with better intuition about the correspondence between state dimensions and images, which humans find easier to interpret.

For environments that have higher dimensional non image states, such as that of a robotic arm with multiple degrees of freedom, we could visualize individual state components. However, since this may not be intuitive, we can also generate an additional viewport as an overlay to display an image rendering of the environment, similar to that shown in Figure


3.1.2 Action Spaces

As per the action space formulation from §2, at each timestep the agent chooses an action , which is either discrete or continuous depending on the type of action space. We can visualize how the action varies across the episode by creating a viewport backed by a line plot spec. For agents where we have access to a distribution over actions instead, we can generate a viewport backed by a histogram spec, and visualize how the action distribution changes over time. A similar visualization can be generated for agents that make use of action-value functions sutton, for action selection.

Figure 3: Action Spaces: Visualizing autogenerated action space viewports for the Pong environment. This representation with an image overlay, provides the user better intuition about the current agent policy. This along with a slider to control and query episode level logs (see §3.1.5), can help the user to better debug agent policies.

3.1.3 Rewards

As per the reward formulation from §2, at each timestep , the agent receives a reward conditioned on the previous state and action . The reward is typically a scalar quantity, so it would be useful to generate a viewport backed by a line plot spec.

For most agent environments, the reward function comprises of different components weighted by different coefficients. These individual components are often easier to interpret since they are usually backed by a physically motivated quantity tied to specific behaviors that we wish to either reward or penalize. In situations where we have access to these, we can autogenerate multiple viewports each of which visualizes different components of this reward function vector.

3.1.4 Replay Buffer

As formulated in §2, the replay buffer stores the agent’s experiences in a buffer . For off-policy algorithms, the replay buffer is of crucial importance, since it in effect serves as a proxy dataset during agent policy updates. For visualizing datasets, there exist tools, which provide the user with an intuition for the underlying data distribution. Similarly, it would make sense to visualize the replay buffer state, since this in effect a proxy to a dataset for the RL agent.

Since the individual elements of the replay buffer are at least a four-dimensional vector, this rules out the possibility of generating viewports backed by specs, in the original space. We can instead generate a lower-dimensional projection of the replay buffer distribution, which provides a notion of the replay buffer diversity.

Figure 4: Replay Buffer Projection Projecting the contents of the replay buffer into a 2D space for easier navigation. This provides a proxy to the replay buffer diversity, and can help in subsequent debugging.

This is supported by the current system, which computes a lower-dimensional projection maaten_tsne, of the replay buffer, and then allows the user to visualize the distribution, along with a hover icon which describes the original 4-tuple from which the projected point was computed.

3.1.5 Control Panel

Figure 5: Control Panel

The control panel (Figure 5) is a common component across all views which supports functionality to:

  1. [noitemsep, nolistsep]

  2. display high-level descriptive metrics such as average return, average time per episode, and the number of episodes.

  3. retrieve logs for arbitrary episode IDs.

  4. control the currently active frame, which is reflected in the corresponding state, action, and reward viewports.

Possible extensions to this are to provide real-time suggestions to the user, to help navigation through a large collection of episode logs.

3.2 Backend

The backend system is responsible for storage, logging, and communication with one or multiple frontend clients attempting to interface with the agent.

Figure 6: Backend Architecture Overview

At a high level, it consists of three sub-components: the serving thread, the communication thread, and the logging thread. The serving thread interfaces with frontend clients, which request data streams for visualization. The communication thread acts as the arbiter between the serving and logging threads, performs the logical mapping from the server request to the data store, and communicates with the logging thread to notify it of validated commands received from the frontend.

The logging thread is responsible for caching tensors to the data store. It does so by pushing data onto a task queue, which is then asynchronously committed to disk by another thread, after running storage optimizations, designed as such to reduce the computation overhead within the main agent training loop.

The overhead of integrating the overall system is minimal and can be enabled through a mere 2 lines of code, one for initializing the system, and another for caching tensors within the agent training loop as shown in Figure 7.

[breaklines]python from vizarel.container import VizarelState

logger = VizarelState(steps, obs_dim, obs_type, action_dim, action_type, reward_dim, reward_type)

logger.log_state(n_samples, obses, actions, rewards, dones)

Figure 7: Sample code to enable logging

4 Future Work

This paper describes the preliminary version of the system we have prototyped as a testbed for ideas. There are multiple features under development that contribute towards both the core interface and the plugin ecosystem which was alluded to earlier. We enumerate some below as a representative sample:

  • [noitemsep, nolistsep]

  • Dynamically switching logging on or off, conditional on the occurrence of noteworthy experiences during agent training.

  • Integration of additional data streams such as saliency maps greydanus2017visualizing for image-based state spaces, which can be enabled through the plugin ecosystem.

  • Data processing before rendering, for example, chaining different action dimensions, or clustering rewards across time to diagnose similar states.

These are features that we think would be useful to have, but we expect that the best features yet to be built will emerge through feedback from the broader RL and ML interpretability communities.


We thank Benjamin Eysenbach for valuable discussions and feedback over the initial drafts of this work. This work is supported by the CMU Argo AI Center. Any opinions, recommendations, and conclusions expressed in this material are those of the author(s) and do not reflect the views of any funding agencies.