Interaction-Grounded Learning with Action-inclusive Feedback

Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/09/2021

Interaction-Grounded Learning

Consider a prosthetic arm, learning to adapt to its user's control signa...
02/19/2019

Learning to Generalize from Sparse and Underspecified Rewards

We consider the problem of learning from sparse and underspecified rewar...
03/09/2020

Human AI interaction loop training: New approach for interactive reinforcement learning

Reinforcement Learning (RL) in various decision-making tasks of machine ...
09/18/2019

No-Regret Learning in Unknown Games with Correlated Payoffs

We consider the problem of learning to play a repeated multi-agent game ...
04/12/2021

An Efficient Algorithm for Deep Stochastic Contextual Bandits

In stochastic contextual bandit (SCB) problems, an agent selects an acti...
07/05/2018

Contextual Bandits under Delayed Feedback

Delayed feedback is an ubiquitous problem in many industrial systems emp...
06/07/2020

Learning Behaviors with Uncertain Human Feedback

Human feedback is widely used to train agents in many domains. However, ...

1 Introduction

Most real-world learning problems, such as BCI and HCI problems, are not tagged with rewards. Consequently, (biological and artificial) learners must infer rewards based on interactions with the environment, which reacts to the learner’s actions by generating feedback, but does not provide any explicit reward signal. This paradigm has been previously studied by researchers (e.g., Grizou et al., 2014; Nguyen et al., 2021), including a recent formalization (Xie et al., 2021b) that proposed the term Interaction-Grounded Learning (IGL).

In IGL, the learning algorithm discovers a grounding for the feedback which implicitly discovers a reward function. An information-theoretic impossibility argument indicates additional assumptions are necessary to succeed. Xie et al. (2021b) proceed by assuming the action is conditionally independent of the feedback given the reward. However, this is unnatural in many settings such as neurofeedback in BCI (Katyal et al., 2014; Mishra and Gazzaley, 2015; Debettencourt et al., 2015; Muñoz-Moldes and Cleeremans, 2020; Akinola et al., 2020; Poole and Lee, 2021) and multimodal interactive feedback in HCI (Pantic and Rothkrantz, 2003; Vitense et al., 2003; Freeman et al., 2017; Mott et al., 2017; Duchowski, 2018; Noda, 2018; Saran et al., 2018, 2020; Cui et al., 2021; Gao et al., 2021), where the action proceeds and thus influences the feedback. If you apply this prior approach to these settings, it will fail catastrophically because the requirement of conditional independence is essential to its function. This motivates the question:

Is it possible to do interaction-ground learning when the feedback has the full information of the action embedded in it?

We propose a new approach to solve IGL, which we call action-inclusive IGL (AI-IGL), that allows the action to be incorporated into the feedback in arbitrary ways. We consider latent reward as playing the role of latent states, which can be further separated via a contrastive learning method (Section 3.1

). Different from the typical latent state discovery in rich-observation reinforcement learning 

(e.g., Dann et al., 2018; Du et al., 2019; Misra et al., 2020), the IGL setting also requires identifying the semantic meaning of the latent reward states, which is addressed by a symmetry breaking procedure (Section 3.2). We analyze the theoretical properties of the proposed approach, and we prove that it is guaranteed to learn a near-optimal policy as long as the feedback satisfies a weaker context conditional independence assumption. We also evaluate the proposed AI-IGL approach using large-scale experiments on Open-ML’s supervised classification datasets (Bischl et al., 2021) and demonstrate the effectiveness of the proposed approach (Section 5). Thus, our findings broaden the scope of applicability for IGL.

The paper proceeds as follows. In Section 2, we present the mathematical formulation for IGL. In Section 3, we present a contrastive learning perspective for grounding latent reward which helps to expand the applicability of IGL. In Section 4, we state the resulting algorithm AI-IGL. We provide experimental support for the technique in Section 5 using a diverse set of datasets. We conclude with discussion in Section 6.

2 Background

Interaction-Grounded Learning

This paper studies the Interaction-Grounded Learning (IGL) setting (Xie et al., 2021b), where the learner optimizes for a latent reward by interacting with the environment and associating (“grounding”) observed feedback with the latent reward. At each time step, the learner receives an i.i.d.  context from context set and distribution . The learner then selects an action from a finite action set . The environment generates a latent binary reward (can be either deterministic or stochastic) and a feedback vector conditional on , but only the feedback vector is revealed to the learner. In this paper, we use to denote the expected (latent) reward after executing action on context . The space of context and feedback vector can be arbitrarily large.

Throughout this paper, we use to denote a (stochastic) policy. The expected return of policy is defined by . The learning goal of IGL is to find the optimal policy in the policy class, only from the observations of context-action-feedback tuples, . This paper mainly considers the batch setting, and we use to denote the behavior policy. In this paper, we also introduce value function classes and decoder classes. We assume the learner has access to a value function class where and reward decoder class . We defer the assumptions we made on these classes to Section 3 for clarity.

We may hope to solve IGL without any additional assumptions. However, it is information-theoretically impossible without additional assumptions, even if the latent reward is decodable from , as demonstrated by the following example. [Hardness of assumption-free IGL] Suppose and suppose the reward is deterministic in . In this case, the latent reward can be perfectly decoded from . However, the learner receives no more information than from the tuple. Thus if contains at least 2 policies, for any environment where any IGL algorithm succeeds, we can construct another environment with the same observable statistics where that algorithm must fail.

IGL with full conditional independence

Example 2 demonstrates the need for further assumptions to succeed at IGL. Xie et al. (2021b) proposed an algorithm that leverages the following conditional independence assumption to facilitate grounding the feedback in the latent reward. [Full conditional independence] For arbitrary tuples where and are generated conditional on the context and action , we assume the feedback vector is conditionally independent of context and action given the latent reward , i.e. . Xie et al. (2021b) introduce a reward decoder class

for estimating

, which leads to the decoded return . They proved that it is possible to learn the best and jointly by optimizing the following proxy learning objective:

(1)

where is a policy known to have low expected return. Over this paper, we use IGL (full CI) to denote the proposed algorithm in (Xie et al., 2021b).

3 A Contrastive-learning Perspective for Grounding Latent Reward

The existing work of interaction-grounded learning leverages the assumption of full conditional independence (Assumption 2), where the feedback vector only contains information from the latent reward. The goal of this paper is to relax these constraining assumptions and broaden the scope of applicability for IGL. This paper focuses on the scenario where the action information is possibly embedded in the feedback vector, which is formalized by the following assumption. [Context Conditional Independence] For arbitrary tuple where and are generated conditional on the context and action , we assume the feedback vector is conditionally independent of context given the latent reward and action . That is, .

Assumption 3 allows the feedback vector to be generated from both latent reward and action , differing from Assumption 2 which constrains the feedback vector to be generated based only on the latent reward . We discuss the implications at the end of this section.

3.1 Grounding Latent Reward via Contrastive Learning

In this section, we propose a contrastive learning objective for interaction-grounded learning, which further guides the design of our algorithm. We perform derivations with exact expectations for clarity and intuitions and provide finite-sample guarantees in Section 4.

[Separability] For each , there exists an , such that: 1) ; 2) . We also assume , where .

Assumption 3.1 consists of two components. For , it is a realizability assumption that ensures that a perfect reward decoder is included in the function classes. Although this superficially appears unreasonably strong, note is generated based upon and the realization of ; therefore this is compatible with stochastic rewards. For , it ensures the expected predicted reward conditioned on the latent reward, having value and the action being , is separable. When , can be lower bounded by ( denotes the constant policy with action ). Detailed proof of this argument can be found in Appendix A. One sufficient condition is that the expected latent reward

has enough variance and

. The condition of can be constructed easily via standard realizable classes. That is, if for some classes , simply setting satisfies Assumption 3.1. Note that, this construction of only amplifies the size of by a factor of 2.

Reward Prediction via Contrastive Learning

We now formulate the following contrastive-learning objective for solving IGL. Suppose is the behavior policy. We also abuse to denote the data distribution, and , , to denote the marginal distribution under action . We construct an augmented data distribution for each : (i.e., sampling and independently from ).

Conceptually, for each action , we consider tuples and . From Assumption 3.1, conditioned on the optimal feedback decoder has mean equal to the optimal reward predictor . Therefore we might seek an

pair which minimizes any consistent loss function, e.g., squared loss. However this is trivially achievable, e.g., by having both always predict 0. Therefore we formulate a contrastive-learning objective, where we maximize the loss between the predictor and decoder on the augmented data distribution. For each

, we solve the following objective

(2)

In the notation of equation (2), the subscript indicates the pair are optimal for action . Note they are always evaluated at , which we retain as an input for compatibility with the original function classes.

Note that is also similar to many popular contrastive losses (e.g., Wu et al., 2018; Chen et al., 2020) especially for the spectral contrastive loss (HaoChen et al., 2021).

(3)
(4)
(spectral contrastive loss)
()

Below we show that minimizing decodes the latent reward under Assumptions 3 and 3.1 up to a sign ambiguity. For simplicity, we introduce the notation of and for any ,

(5)

and are the expected predicted reward of and decoded reward (under the behavior policy for ) conditioned on the latent reward having value and the action being .

For any action , if for all and Assumption 3 and 3.1 hold, and let be the solution of eq:def_obj_a. Then, and .

Proof Sketch.

For any policy , we use to denote the visitation occupancy of action under policy , and to denote the average reward received under executing action . Then, by the context conditional independent assumption (Assumption 3), we know

(6)
(7)
(8)

Therefore, separately maximizing and maximizes . ∎

3.2 Symmetry Breaking

In the last section, we demonstrated that the latent reward could be decoded in a contrastive-learning manner up to a sign ambiguity. The following example demonstrates the ambiguity.

[Why do we need extra information to identify the latent reward?] In the optimal solution of objective eq:def_obj_a, both and yield the same value. It is information-theoretically difficult to distinguish which one of them is the correct solution without extra information. That is because, for any environment ENV1, there always exists a “symmetric” environment ENV2, where: 1) of ENV1 is identical to of ENV2 for all ; 2) the the conditional distribution of in ENV1 is identical to the conditional distribution of in the ENV2 for all . In this example, ENV1 and ENV2 will always generate the identical distribution of feedback vector after any . However, ENV1 and ENV2 have the exactly opposite latent reward information.

As we demonstrate in Example 3.2, the learner decoder from eq:def_obj_a could be corresponding to a symmetric pair of semantic meanings, and identifying them without extra information is information-theoretically impossible. The symmetry breaking procedure is one of the key challenges of interaction-grounded learning. To achieve symmetry breaking, we make the following assumption to ensure the identifiability of the latent reward. [Baseline Policies] For each , there exists a baseline policy , such that,

  1. [(a)]

  2. satisfies .

  3. , where .

To instantiate Assumption 3.2 in practice, we provide the following simple example of that satisfies Assumption 3.2. Suppose (uniformly random policy), and we have “all constant policies are bad”, i.e., for all . Then it is easy to verify that and for all .

Note that can be different over actions. Intuitively, Assumption 3.2

(a) is saying that the total probability of

selecting action (over all context ) is at least . This condition ensures that has enough visitation to action and makes symmetry breaking possible. Assumption 3.2(b) states that if we only consider the reward obtained from taking action , is known to be either “sufficiently bad” or “sufficiently good”. Note the directionality of the extremeness of must be known, e.g., a policy which has a unknown reward of either 0 or 1 is not usable. This condition follows a similar intuition as the identifiability assumption of (Xie et al., 2021b, Assumption 2) and breaks the symmetry. For example, consider the ENV1 and ENV2 introduced in Example 3.2, for any policy . To separating ENV1 and ENV2 using some policy , and require to have a non-zero gap, which leads to Assumption 3.2(b).

The effectiveness of symmetry breaking under Assumption 3.2 can be summarized as below: we conduct the following estimation of , using the learned , If can efficiently decode the latent reward, then converges to either or . Therefore, applying Assumption 3.2(b) breaks the symmetry.

3.3 Comparison to Full CI

When we have the context conditional independence, it is easy to verify the failure of optimizing the original IGL objective eq:old_igl_objective by the following example.

[Failure of the original IGL objective under Assumption 3] Let and feedback vector is generated by (we use to denote in the following part). We also assume for any and . Then, we have, for any (approach proposed by Xie et al. (2021b) assumes the reward decoder only takes feedback vector as the input),

On the other hand, consider the constant policy for all and decoder for all , then,

This implies that maximizing the original IGL objective eq:old_igl_objective could not always converge to when we only have the context conditional independence.

This example indicates optimizing a combined contrastive objective with a single symmetry-breaking policy is insufficient to succeed in our case. Our current approach corresponds to optimizing a contrastive objective and breaking symmetry for each action separately rather than simultaneously.

3.4 Viewing Latent Reward as a Latent State

(a) Latent State in Rich-Observation RL (Misra et al., 2020)
(b) Latent Reward in IGL under
Assumption 3 (this paper)
(c) Latent Reward in IGL under
Assumption 2 (Xie et al., 2021b)
Figure 1: Causal graphs of interaction-grounded learning under different assumptions as well as rich-observation reinforcement Learning.

Our approach is motivated by latent state discovery in Rich-Observation RL (Misra et al., 2020). Figure 1 compares the causal graphs of Rich-Observation RL, IGL with context conditional independence, and IGL with full conditional independence. In Rich-Observation RL, a contrastive learning objective is used to discover latent states; whereas in IGL a contrastive learning objective is used to discover latent rewards. In this manner we view latent rewards analogously to latent states.

Identifying latent states up to a permutation is completely acceptable in Rich-Observations RL, as the resulting imputed MDP orders policies identically to the true MDP. However in IGL the latent states have scalar values associated with them (rewards), and identifying up to a permutation is not sufficient to order policies correctly. Thus, we require the additional step of symmetry breaking.

4 Main Algorithm

This section instantiates the algorithm for IGL with action-inclusive feedback, following the concept we introduced in Section 3. For simplicity, we select the uniformly random policy as behavior policy in this section. That choice of can be further relaxed using an appropriate importance weight. We first introduce the empirical estimation of (defined in eq:def_obj_a) as follows. For any , we define the following empirical estimation of the spectral contrastive loss:

(9)

Using this definition, Algorithm 1 instantiates a version of the IGL algorithm with action-inclusive feedback. Without loss of generality, we also assume for Assumption 3.2(b) for all in this section. The case of for some action can be addressed by modifying the symmetry-breaking step properly in Algorithm 1.

Input: Batch data generated by . baseline policy .

1:Initialize policy as the uniform policy.
2:for  do
3:      Obtain by Latent State (Reward) Discovery
(10)
where is defined in eq:emp_obj
4:      Compute by Symmetry Breaking
5:     if  then.
6:     else.
7:     end if
8:end for
9:Generate decoded contextual bandits dataset .
10:Output policy , where denotes an offline contextual bandit oracle.
Algorithm 1 Action-inclusive IGL (AI-IGL)

At a high level, Algorithm 1 has two separate components, latent state (reward) discovery (line 3) and symmetry breaking (line 4-7), for each action in .

Theoretical guarantees

In Algorithm 1, the output policy is obtained by calling an offline contextual bandits oracle (). We now formally define this oracle and its expected property. [Offline contextual bandits oracle] An algorithm is called an offline contextual bandit oracle if for any dataset (, and is the reward determined by ) and any policy class , the policy produced by satisfies

The notion in Definition 4 corresponds to the standard policy learning approaches in the contextual bandits literature (e.g., Langford and Zhang, 2007; Dudik et al., 2011; Agarwal et al., 2014), and typically leads to . We now provide the theoretical analysis of Algorithm 1. In this paper, we use to denote the joint statistical complexity of the class of and . For example, if the function classes are finite, we have , and is the failure probability. The infinite function classes can be addressed by some advanced methods such as covering number or Rademacher complexity (see, e.g., Mohri et al., 2018). The following theorem provides the performance guarantee of the output policy of Algorithm 1. Suppose Assumptions 33.1 and  3.2 hold. Let be the output policy of Algorithm 1 and . If we have then, with high probability,

(11)

Similar to the performance of (Xie et al., 2021b), the learned is guaranteed to converge in the right direction only after we have sufficient data for the symmetric breaking. The dependence on in Theorem 4 can be improved as different action has a separate learning procedure. For example, if we consider and , where and are independent components that is only corresponding to action (this is a common setup for linear approximated reinforcement learning approaches with discrete action space). If and are identical copies of separate classes, we know and , which leads a improvement.

We now provide the proof sketch of Theorem 4 and we defer the detailed proof to Appendix A.

Proof Sketch.

The proof of Theorem 4 consists of two different components—discovering latent reward and breaking the symmetry, which are formalized by the following lemma. [Discovering latent reward] Suppose Assumptions 3 and 3.1 hold, and let be obtained by eq:decode_f_psi. Then, with high probability, we have . Lemma 4 ensures that the learned decoder on eq:decode_f_psi correctly separates the latent reward. In particular since ranges over , Lemma 4 ensures under the behaviour policy. Thus, if we can break symmetry, we can use to generate a reward signal and reduce to ordinary contextual bandit learning. The following lemma guarantees the correctness of the symmetry-breaking step. [Breaking symmetry] Suppose Assumption 3.2 holds. For any , if we have then, with high probability, . Combining these two lemmas above as well as the oracle establishes the proof of Theorem 4, and the detailed proof can be found in Appendix A. ∎

5 Empirical Evaluations

In this section, we provide empirical evaluations in simulated environments created using supervised classification datasets.

(a) Contextual Bandits (CB)
(b) IGL (full CI) (Xie et al., 2021b)
(c) AI-IGL
Figure 2:

Different learning approaches based on the MNIST dataset. The gray number/image denotes the unobserved reward/feedback vector.

Figure 1(a): In the contextual bandits setting, the exact reward information on the selected action can be observed. Figure 1(b): In IGL (full CI), the feedback vector is generated only based on the latent reward. Figure 1(c): In AI-IGL, the feedback vector can be generated based on both latent reward and selected action.

The task is depicted in Figure 2. We evaluate our approach by comparing: (1) CB: Contextual Bandits with exact reward, (2) IGL (full CI): The method proposed by Xie et al. (2021b)

which assumes the feedback vector contains no information about the context and reward, and (3) AI-IGL: The proposed method which assumes that the feedback vector could contain information about the action but is conditionally independent of the context given the reward. Note that contextual bandits (CB) is a skyline compared to both IGL (full CI) and AI-IGL, since it need not disambiguate the latent reward. All methods use logistic regression with a linear representation. At test time, each method takes the argmax of the policy. We provide details of setting up the experiment in Appendix 

B.

We evaluate our approach on different environments with action-inclusive feedback generated using supervised classification datasets. We generate high-dimensional feedback vectors for the MNIST classification dataset (Section 5.1) and low-dimensional feedback vectors for more than 200 OpenML CC-18 datasets (Section 5.2). Following the practical instantiation of Assumption 3.2 (see Section 3.2), we know if the dataset has a balanced action distribution (no action belongs to more than 50% of the samples), selecting uniformly random policy as for all satisfies Assumption 3.2. Therefore, in this section, our experiments are all based on the dataset with a balanced action distribution, and we select for all .

5.1 Experiments on the MNIST dataset

The environment for this experiment is generated from the supervised classification MNIST dataset (LeCun et al., 1998) which is licensed under Attribution-Share Alike 3.0 license111https://creativecommons.org/licenses/by-sa/3.0/. At each time step, the context is generated uniformly at random. Then the learner selects an action as the predicted label of . The binary reward is the correctness of the prediction label . The high-dimensional feedback vector is an image of the digit . An example is shown in Figure 2. Our results are averaged over 20 trials.

Algorithm
Policy accuracy for
action-inclusive feedback (%)
Policy accuracy for
action-exclusive feedback (%)
CB
IGL (full CI)
AI-IGL
Table 1:

Results in the MNIST environment with high-dimensional action-inclusive and action-exclusive feedback. Average and standard error reported over 20 trials for each algorithm.

To highlight that the proposed algorithm still operates well under conditions where the feedback vector does not include action information, we also perform experiments under the setting introduced by Xie et al. (2021b) on the MNIST environment. This setting is similar to the one described in Section 5.1 except the feedback vector is the image of the digit instead of . We find that under this action-exclusive feedback setting, the new proposed algorithm AI-IGL works as well as the IGL (full CI). This signifies that our algorithm, which incorporates the presence of action information in the feedback vector, does not hurt performance in cases when the action information is missing from the feedback vector.

5.2 Large-scale Experiments with OpenML CC-18 Datasets

To verify that our proposed algorithm scales to a variety of tasks, we evaluate performance on datasets from the publicly available OpenML Curated Classification Benchmarking Suite (Vanschoren et al., 2015; Casalicchio et al., 2019; Feurer et al., 2021; Bischl et al., 2021). OpenML datasets are licensed under CC-BY license222https://creativecommons.org/licenses/by/4.0/ and the platform and library are licensed under the BSD (3-Clause) license333https://opensource.org/licenses/BSD-3-Clause. Similar to Section 5.1, at each time step, the context is generated uniformly at random. Again, the learner selects an action as the predicted label of (where is the total number of actions available in the environment). The binary reward is the correctness of the prediction label . The feedback is a two dimensional vector . Each dataset has a different sample size () and a different set of available actions (). We sample datasets with 3 or more actions, and with a balanced action distribution (no action belongs to more than of the samples) to satisfy Assumption 4. We use of the data for training and the remaining for evaluation. The results are averaged over trials and shown in Table 2.

Dataset
Criteria
Dataset
Count
Constant
Action
CB Policy
Accuracy (%)
IGL (full CI)
Policy Accuracy (%)
Performance
w.r.t CB
AI-IGL
Policy Accuracy (%)
Performance
w.r.t CB
, 83
Table 2: Results in the OpenML environments with two-dimensional action-inclusive feedback. Average and standard error reported over 20 trials for each algorithm. ‘Performance w.r.t. CB’ reports the ratio of an IGL method’s policy accuracy over CB policy accuracy.

Analysis of dataset properties in relation to AI-IGL’s performance

To further understand under which conditions AI-IGL succeeds, we analyzed its performance against 12 different features (based on findings from Torra et al. (2008); Abdelmessih et al. (2010); Reif et al. (2014); Lorena et al. (2019)) for each OpenML dataset (more details in Appendix B.3). We consider 3 additional features based on the following criteria. Compared to the typical CB guarantee, AI-IGL needs one more factor in its theoretical guarantees (Theorem 4), and the factors can be also possibly improved under some specific choice of function class (see discussion in Section 4). We then select , and as the additional features of the dataset for predicting the relative performance of AI-IGL.

We use a binary random forest classifier to predict the success of AI-IGL’s performance relative to CB. If the relative performance is

, we label it as a success. We find that with all datasets with , 10-fold cross-validation using 100 trees gives an average F1-score of . We find to be the most predictive feature of AI-IGL’s relative performance (Figure 3(a)). It can alone predict it’s performance with an average F1 score of 0.79 under the same experimental setup. However, for datasets with a small value of (), there is high variability in relative performance. Using such a subset of datasets, we find maximum Fisher discriminant (Lorena et al., 2019) (a measure of classification complexity that quantifies the informativeness of a given sample) to be the most predictive of relative performance (Figure 3(b)). Such datasets are representative of realistic interaction datasets with small sample sizes. More details on the experimental analysis are in Appendix B.3 (Figure 5 and Figure 6). This finding makes it possible to predict, for a given dataset, whether AI-IGL can match CB performance. It can also help researchers improve the design of novel applications of IGL, e.g., in HCI and BCI, by ensuring the resulting dataset’s features are amenable to high performance.

6 Discussion

We have presented a new approach to solving Interaction-Grounded Learning, in which an agent learns to interact with the environment in the absence of any explicit reward signals. Compared to a prior solution (Xie et al., 2021b), the proposed AI-IGL approach removes the assumption of conditional independence of actions and the feedback vector by treating the latent reward as a latent state. It thereby provably solves IGL for action-inclusive feedback vectors.

By viewing the feedback as containing an action-(latent) reward pair which is an unobserved latent space, we propose latent reward learning using a contrastive learning approach. This solution concept naturally connects to latent state discovery in rich-observation reinforcement learning (e.g., Dann et al., 2018; Du et al., 2019; Misra et al., 2020). On the other hand, different from rich-observation RL, the problem of IGL also contains a unique challenge in identifying the semantic meaning of the decoded class, which is addressed by a symmetry-breaking procedure. In this work, we focus on binary latent rewards, for which symmetry breaking is possible using one policy (per action). Breaking symmetry in more general latent reward spaces is a topic for future work. A possible negative societal impact of this work can be performance instability, especially with inappropriate use of the techniques in risk-sensitive applications.

Barring intentional misuse, we envision several potential benefits of the proposed approach. The proposed algorithm broadens the scope of IGL’s feasibility for real-world applications (where action signals are included in the feedback). Imagine an agent being trained to interpret the brain signals of a user to control a prosthetic arm. The brain’s response (feedback vector) to an action is a neural signal that may contain information about the action itself. This is so prevalent in neuroimaging that fMRI studies routinely use specialized techniques to orthogonalize different information (e.g. action and reward) within the same signal (Momennejad and Haynes, 2012, 2013; Belghazi et al., 2018; Shah and Peters, 2020). Another example is a continually self-calibrating eye tracker, used by people with motor disabilities such as ALS (Hansen et al., 2004; Liu et al., 2010; Mott et al., 2017; Gao et al., 2021). A learning agent adapting to the ability of such users can encounter feedback directly influenced by the calibration correction action. The proposed approach is a stepping stone on the path to solving IGL in such complex settings, overcoming the need for explicit rewards as well as explicit separation of action and feedback information.

Acknowledgment

The authors would like to thank Mark Rucker for sharing methods and code for computing and comparing diagnostic features of OpenML datasets. NJ acknowledges funding support from ARL Cooperative Agreement W911NF-17-2-0196, NSF IIS-2112471, NSF CAREER award, and Adobe Data Science Research Award.

References

  • S. D. Abdelmessih, F. Shafait, M. Reif, and M. Goldstein (2010) Landmarking for meta-learning using rapidminer. In RapidMiner community meeting and conference, Cited by: 1st item, 2nd item, 9th item, §B.3, §5.2.
  • A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire (2014) Taming the monster: a fast and simple algorithm for contextual bandits. In

    International Conference on Machine Learning

    ,
    pp. 1638–1646. Cited by: §4.
  • I. Akinola, Z. Wang, J. Shi, X. He, P. Lapborisuth, J. Xu, D. Watkins-Valls, P. Sajda, and P. Allen (2020) Accelerated robot learning via human brain signals. In 2020 IEEE international conference on robotics and automation (ICRA), pp. 3799–3805. Cited by: §1.
  • M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, R. D. Hjelm, and A. C. Courville (2018) Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, Vol. 80, pp. 530–539. Cited by: §6.
  • B. Bischl, G. Casalicchio, M. Feurer, P. Gijsbers, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn, and J. Vanschoren (2021) OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks, Cited by: §1, §5.2.
  • G. Casalicchio, J. Bossek, M. Lang, D. Kirchhoff, P. Kerschke, B. Hofner, H. Seibold, J. Vanschoren, and B. Bischl (2019) OpenML: an r package to connect to the machine learning platform openml. Computational Statistics 34 (3), pp. 977–991. Cited by: §5.2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §3.1.
  • Y. Cui, Q. Zhang, B. Knox, A. Allievi, P. Stone, and S. Niekum (2021) The empathic framework for task learning from implicit human feedback. In Conference on Robot Learning, pp. 604–626. Cited by: §1.
  • C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire (2018) On oracle-efficient pac rl with rich observations. Advances in neural information processing systems 31. Cited by: §1, §6.
  • M. T. Debettencourt, J. D. Cohen, R. F. Lee, K. A. Norman, and N. B. Turk-Browne (2015) Closed-loop training of attention with real-time brain imaging. Nature neuroscience 18 (3), pp. 470–475. Cited by: §1.
  • S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford (2019) Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665–1674. Cited by: §1, §6.
  • A. T. Duchowski (2018) Gaze-based interaction: a 30 year retrospective. Computers & Graphics 73, pp. 59–69. Cited by: §1.
  • M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang (2011) Efficient optimal learning for contextual bandits. In

    Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence

    ,
    pp. 169–178. Cited by: §4.
  • M. Feurer, J. N. van Rijn, A. Kadra, P. Gijsbers, N. Mallik, S. Ravi, A. Müller, J. Vanschoren, and F. Hutter (2021) OpenML-python: an extensible python api for openml. Journal of Machine Learning Research 22, pp. 1–5. Cited by: §5.2.
  • E. Freeman, G. Wilson, D. Vo, A. Ng, I. Politis, and S. Brewster (2017) Multimodal feedback in hci: haptics, non-speech audio, and their applications. In The Handbook of Multimodal-Multisensor Interfaces: Foundations, User Modeling, and Common Modality Combinations-Volume 1, pp. 277–317. Cited by: §1.
  • J. Gao, S. Reddy, G. Berseth, N. Hardy, N. Natraj, K. Ganguly, A. D. Dragan, and S. Levine (2021) X2T: training an x-to-text typing interface with online learning from user feedback. In 9th International Conference on Learning Representations, ICLR, Cited by: §1, §6.
  • J. Grizou, I. Iturrate, L. Montesano, P. Oudeyer, and M. Lopes (2014) Calibration-free bci based control. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28. Cited by: §1.
  • J. P. Hansen, K. Tørning, A. S. Johansen, K. Itoh, and H. Aoki (2004) Gaze typing compared with input by head and hand. In Proceedings of the 2004 symposium on Eye tracking research & applications, pp. 131–138. Cited by: §6.
  • J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma (2021)

    Provable guarantees for self-supervised deep learning with spectral contrastive loss

    .
    Advances in Neural Information Processing Systems 34. Cited by: §3.1.
  • K. D. Katyal, M. S. Johannes, S. Kellis, T. Aflalo, C. Klaes, T. G. McGee, M. P. Para, Y. Shi, B. Lee, K. Pejsa, et al. (2014) A collaborative bci approach to autonomous control of a prosthetic limb system. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1479–1482. Cited by: §1.
  • J. Langford and T. Zhang (2007)

    The epoch-greedy algorithm for contextual multi-armed bandits

    .
    Advances in neural information processing systems 20 (1), pp. 96–1. Cited by: §4.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.1.
  • S. S. Liu, A. Rawicz, T. Ma, C. Zhang, K. Lin, S. Rezaei, and E. Wu (2010) An eye-gaze tracking and human computer interface system for people with als and other locked-in diseases. CMBES Proceedings 33. Cited by: §6.
  • A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, and T. K. Ho (2019) How complex is your classification problem? a survey on measuring classification complexity. ACM Computing Surveys (CSUR) 52 (5), pp. 1–34. Cited by: 11st item, 12nd item, 3rd item, 4th item, 6th item, 7th item, §B.3, §B.3, §5.2, §5.2.
  • J. Mishra and A. Gazzaley (2015) Closed-loop cognition: the next frontier arrives. Trends in cognitive sciences 19 (5), pp. 242–243. Cited by: §1.
  • D. Misra, M. Henaff, A. Krishnamurthy, and J. Langford (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pp. 6961–6971. Cited by: §1, 0(a), §3.4, §6.
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §4.
  • I. Momennejad and J. Haynes (2012) Human anterior prefrontal cortex encodes the ‘what’and ‘when’of future intentions. Neuroimage 61 (1), pp. 139–148. Cited by: §6.
  • I. Momennejad and J. Haynes (2013) Encoding of prospective tasks in the human prefrontal cortex under varying task loads. Journal of Neuroscience 33 (44), pp. 17342–17349. Cited by: §6.
  • M. E. Mott, S. Williams, J. O. Wobbrock, and M. R. Morris (2017) Improving dwell-based gaze typing with dynamic, cascading dwell times. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2558–2570. Cited by: §1, §6.
  • S. Muñoz-Moldes and A. Cleeremans (2020) Delineating implicit and explicit processes in neurofeedback learning. Neuroscience & Biobehavioral Reviews 118, pp. 681–688. Cited by: §1.
  • K. X. Nguyen, D. Misra, R. Schapire, M. Dudík, and P. Shafto (2021) Interactive learning from activity description. In International Conference on Machine Learning, pp. 8096–8108. Cited by: §1.
  • K. Noda (2018) Google home: smart speaker as environmental control unit. Disability and rehabilitation: assistive technology 13 (7), pp. 674–675. Cited by: §1.
  • M. Pantic and L. J. Rothkrantz (2003) Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE 91 (9), pp. 1370–1390. Cited by: §1.
  • B. Poole and M. Lee (2021) Towards intrinsic interactive reinforcement learning. arXiv preprint arXiv:2112.01575. Cited by: §1.
  • M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel (2014) Automatic classifier selection for non-experts. Pattern Analysis and Applications 17 (1), pp. 83–96. Cited by: 1st item, 2nd item, 8th item, 9th item, §B.3, §5.2.
  • A. Saran, S. Majumdar, E. S. Short, A. Thomaz, and S. Niekum (2018) Human gaze following for human-robot interaction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8615–8621. Cited by: §1.
  • A. Saran, E. S. Short, A. Thomaz, and S. Niekum (2020) Understanding teacher gaze patterns for robot learning. In Conference on Robot Learning, pp. 1247–1258. Cited by: §1.
  • R. D. Shah and J. Peters (2020) The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics 48 (3), pp. 1514–1538. Cited by: §6.
  • V. Torra, Y. Narukawa, and T. Gakuen (2008) Modeling decisions for artificial intelligence. International Journal of Intelligent Systems 23 (2), pp. 113. Cited by: 10th item, 8th item, §B.3, §5.2.
  • J. Vanschoren, J. N. van Rijn, B. Bischl, G. Casalicchio, M. Lang, and M. Feurer (2015) OpenML: a networked science platform for machine learning. In ICML 2015 MLOSS Workshop, Vol. 3. Cited by: §5.2.
  • H. S. Vitense, J. A. Jacko, and V. K. Emery (2003) Multimodal feedback: an assessment of performance and mental workload. Ergonomics 46 (1-3), pp. 68–87. Cited by: §1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3733–3742. Cited by: §3.1.
  • T. Xie, N. Jiang, H. Wang, C. Xiong, and Y. Bai (2021a) Policy finetuning: bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems 34. Cited by: by [Xie et al., 2021a, Lemma A.1].
  • T. Xie, J. Langford, P. Mineiro, and I. Momennejad (2021b) Interaction-grounded learning. In International Conference on Machine Learning, pp. 11414–11423. Cited by: §B.1, §1, §1, §2, §2, 0(c), §3.2, §3.3, §4, 1(b), §5.1, §5, §6.

Appendix A Detailed Proofs

Proposition3.1.

Over this proof, we follow the definition of eq:def_far_psiar for each . For any , let and , then

(12)
(by )
(13)

and

(14)
(15)
(16)

To see why (a) and (b) hold, we use the following argument. By definition of and , we have

(17)
(18)
(19)
(20)

Then, we have

(21)
(22)
(23)
(24)

Similar procedure also induces the remaining terms of (a) and (b).

Therefore, combining the two equalities above, we obtain

(25)
(26)
(27)
(28)

This completes the proof. ∎

Lower bound of when .

For any , let