1 Introduction
Most realworld learning problems, such as BCI and HCI problems, are not tagged with rewards. Consequently, (biological and artificial) learners must infer rewards based on interactions with the environment, which reacts to the learner’s actions by generating feedback, but does not provide any explicit reward signal. This paradigm has been previously studied by researchers (e.g., Grizou et al., 2014; Nguyen et al., 2021), including a recent formalization (Xie et al., 2021b) that proposed the term InteractionGrounded Learning (IGL).
In IGL, the learning algorithm discovers a grounding for the feedback which implicitly discovers a reward function. An informationtheoretic impossibility argument indicates additional assumptions are necessary to succeed. Xie et al. (2021b) proceed by assuming the action is conditionally independent of the feedback given the reward. However, this is unnatural in many settings such as neurofeedback in BCI (Katyal et al., 2014; Mishra and Gazzaley, 2015; Debettencourt et al., 2015; MuñozMoldes and Cleeremans, 2020; Akinola et al., 2020; Poole and Lee, 2021) and multimodal interactive feedback in HCI (Pantic and Rothkrantz, 2003; Vitense et al., 2003; Freeman et al., 2017; Mott et al., 2017; Duchowski, 2018; Noda, 2018; Saran et al., 2018, 2020; Cui et al., 2021; Gao et al., 2021), where the action proceeds and thus influences the feedback. If you apply this prior approach to these settings, it will fail catastrophically because the requirement of conditional independence is essential to its function. This motivates the question:
Is it possible to do interactionground learning when the feedback has the full information of the action embedded in it?
We propose a new approach to solve IGL, which we call actioninclusive IGL (AIIGL), that allows the action to be incorporated into the feedback in arbitrary ways. We consider latent reward as playing the role of latent states, which can be further separated via a contrastive learning method (Section 3.1
). Different from the typical latent state discovery in richobservation reinforcement learning
(e.g., Dann et al., 2018; Du et al., 2019; Misra et al., 2020), the IGL setting also requires identifying the semantic meaning of the latent reward states, which is addressed by a symmetry breaking procedure (Section 3.2). We analyze the theoretical properties of the proposed approach, and we prove that it is guaranteed to learn a nearoptimal policy as long as the feedback satisfies a weaker context conditional independence assumption. We also evaluate the proposed AIIGL approach using largescale experiments on OpenML’s supervised classification datasets (Bischl et al., 2021) and demonstrate the effectiveness of the proposed approach (Section 5). Thus, our findings broaden the scope of applicability for IGL.The paper proceeds as follows. In Section 2, we present the mathematical formulation for IGL. In Section 3, we present a contrastive learning perspective for grounding latent reward which helps to expand the applicability of IGL. In Section 4, we state the resulting algorithm AIIGL. We provide experimental support for the technique in Section 5 using a diverse set of datasets. We conclude with discussion in Section 6.
2 Background
InteractionGrounded Learning
This paper studies the InteractionGrounded Learning (IGL) setting (Xie et al., 2021b), where the learner optimizes for a latent reward by interacting with the environment and associating (“grounding”) observed feedback with the latent reward. At each time step, the learner receives an i.i.d. context from context set and distribution . The learner then selects an action from a finite action set . The environment generates a latent binary reward (can be either deterministic or stochastic) and a feedback vector conditional on , but only the feedback vector is revealed to the learner. In this paper, we use to denote the expected (latent) reward after executing action on context . The space of context and feedback vector can be arbitrarily large.
Throughout this paper, we use to denote a (stochastic) policy. The expected return of policy is defined by . The learning goal of IGL is to find the optimal policy in the policy class, only from the observations of contextactionfeedback tuples, . This paper mainly considers the batch setting, and we use to denote the behavior policy. In this paper, we also introduce value function classes and decoder classes. We assume the learner has access to a value function class where and reward decoder class . We defer the assumptions we made on these classes to Section 3 for clarity.
We may hope to solve IGL without any additional assumptions. However, it is informationtheoretically impossible without additional assumptions, even if the latent reward is decodable from , as demonstrated by the following example. [Hardness of assumptionfree IGL] Suppose and suppose the reward is deterministic in . In this case, the latent reward can be perfectly decoded from . However, the learner receives no more information than from the tuple. Thus if contains at least 2 policies, for any environment where any IGL algorithm succeeds, we can construct another environment with the same observable statistics where that algorithm must fail.
IGL with full conditional independence
Example 2 demonstrates the need for further assumptions to succeed at IGL. Xie et al. (2021b) proposed an algorithm that leverages the following conditional independence assumption to facilitate grounding the feedback in the latent reward. [Full conditional independence] For arbitrary tuples where and are generated conditional on the context and action , we assume the feedback vector is conditionally independent of context and action given the latent reward , i.e. . Xie et al. (2021b) introduce a reward decoder class
for estimating
, which leads to the decoded return . They proved that it is possible to learn the best and jointly by optimizing the following proxy learning objective:(1) 
where is a policy known to have low expected return. Over this paper, we use IGL (full CI) to denote the proposed algorithm in (Xie et al., 2021b).
3 A Contrastivelearning Perspective for Grounding Latent Reward
The existing work of interactiongrounded learning leverages the assumption of full conditional independence (Assumption 2), where the feedback vector only contains information from the latent reward. The goal of this paper is to relax these constraining assumptions and broaden the scope of applicability for IGL. This paper focuses on the scenario where the action information is possibly embedded in the feedback vector, which is formalized by the following assumption. [Context Conditional Independence] For arbitrary tuple where and are generated conditional on the context and action , we assume the feedback vector is conditionally independent of context given the latent reward and action . That is, .
Assumption 3 allows the feedback vector to be generated from both latent reward and action , differing from Assumption 2 which constrains the feedback vector to be generated based only on the latent reward . We discuss the implications at the end of this section.
3.1 Grounding Latent Reward via Contrastive Learning
In this section, we propose a contrastive learning objective for interactiongrounded learning, which further guides the design of our algorithm. We perform derivations with exact expectations for clarity and intuitions and provide finitesample guarantees in Section 4.
[Separability] For each , there exists an , such that: 1) ; 2) . We also assume , where .
Assumption 3.1 consists of two components. For , it is a realizability assumption that ensures that a perfect reward decoder is included in the function classes. Although this superficially appears unreasonably strong, note is generated based upon and the realization of ; therefore this is compatible with stochastic rewards. For , it ensures the expected predicted reward conditioned on the latent reward, having value and the action being , is separable. When , can be lower bounded by ( denotes the constant policy with action ). Detailed proof of this argument can be found in Appendix A. One sufficient condition is that the expected latent reward
has enough variance and
. The condition of can be constructed easily via standard realizable classes. That is, if for some classes , simply setting satisfies Assumption 3.1. Note that, this construction of only amplifies the size of by a factor of 2.Reward Prediction via Contrastive Learning
We now formulate the following contrastivelearning objective for solving IGL. Suppose is the behavior policy. We also abuse to denote the data distribution, and , , to denote the marginal distribution under action . We construct an augmented data distribution for each : (i.e., sampling and independently from ).
Conceptually, for each action , we consider tuples and . From Assumption 3.1, conditioned on the optimal feedback decoder has mean equal to the optimal reward predictor . Therefore we might seek an
pair which minimizes any consistent loss function, e.g., squared loss. However this is trivially achievable, e.g., by having both always predict 0. Therefore we formulate a contrastivelearning objective, where we maximize the loss between the predictor and decoder on the augmented data distribution. For each
, we solve the following objective(2) 
In the notation of equation (2), the subscript indicates the pair are optimal for action . Note they are always evaluated at , which we retain as an input for compatibility with the original function classes.
Note that is also similar to many popular contrastive losses (e.g., Wu et al., 2018; Chen et al., 2020) especially for the spectral contrastive loss (HaoChen et al., 2021).
(3)  
(4)  
(spectral contrastive loss)  
() 
Below we show that minimizing decodes the latent reward under Assumptions 3 and 3.1 up to a sign ambiguity. For simplicity, we introduce the notation of and for any ,
(5) 
and are the expected predicted reward of and decoded reward (under the behavior policy for ) conditioned on the latent reward having value and the action being .
For any action , if for all and Assumption 3 and 3.1 hold, and let be the solution of eq:def_obj_a. Then, and .
Proof Sketch.
For any policy , we use to denote the visitation occupancy of action under policy , and to denote the average reward received under executing action . Then, by the context conditional independent assumption (Assumption 3), we know
(6)  
(7)  
(8) 
Therefore, separately maximizing and maximizes . ∎
3.2 Symmetry Breaking
In the last section, we demonstrated that the latent reward could be decoded in a contrastivelearning manner up to a sign ambiguity. The following example demonstrates the ambiguity.
[Why do we need extra information to identify the latent reward?] In the optimal solution of objective eq:def_obj_a, both and yield the same value. It is informationtheoretically difficult to distinguish which one of them is the correct solution without extra information. That is because, for any environment ENV1, there always exists a “symmetric” environment ENV2, where: 1) of ENV1 is identical to of ENV2 for all ; 2) the the conditional distribution of in ENV1 is identical to the conditional distribution of in the ENV2 for all . In this example, ENV1 and ENV2 will always generate the identical distribution of feedback vector after any . However, ENV1 and ENV2 have the exactly opposite latent reward information.
As we demonstrate in Example 3.2, the learner decoder from eq:def_obj_a could be corresponding to a symmetric pair of semantic meanings, and identifying them without extra information is informationtheoretically impossible. The symmetry breaking procedure is one of the key challenges of interactiongrounded learning. To achieve symmetry breaking, we make the following assumption to ensure the identifiability of the latent reward. [Baseline Policies] For each , there exists a baseline policy , such that,

[(a)]

satisfies .

, where .
To instantiate Assumption 3.2 in practice, we provide the following simple example of that satisfies Assumption 3.2. Suppose (uniformly random policy), and we have “all constant policies are bad”, i.e., for all . Then it is easy to verify that and for all .
Note that can be different over actions. Intuitively, Assumption 3.2
(a) is saying that the total probability of
selecting action (over all context ) is at least . This condition ensures that has enough visitation to action and makes symmetry breaking possible. Assumption 3.2(b) states that if we only consider the reward obtained from taking action , is known to be either “sufficiently bad” or “sufficiently good”. Note the directionality of the extremeness of must be known, e.g., a policy which has a unknown reward of either 0 or 1 is not usable. This condition follows a similar intuition as the identifiability assumption of (Xie et al., 2021b, Assumption 2) and breaks the symmetry. For example, consider the ENV1 and ENV2 introduced in Example 3.2, for any policy . To separating ENV1 and ENV2 using some policy , and require to have a nonzero gap, which leads to Assumption 3.2(b).3.3 Comparison to Full CI
When we have the context conditional independence, it is easy to verify the failure of optimizing the original IGL objective eq:old_igl_objective by the following example.
[Failure of the original IGL objective under Assumption 3] Let and feedback vector is generated by (we use to denote in the following part). We also assume for any and . Then, we have, for any (approach proposed by Xie et al. (2021b) assumes the reward decoder only takes feedback vector as the input),
On the other hand, consider the constant policy for all and decoder for all , then,
This implies that maximizing the original IGL objective eq:old_igl_objective could not always converge to when we only have the context conditional independence.
This example indicates optimizing a combined contrastive objective with a single symmetrybreaking policy is insufficient to succeed in our case. Our current approach corresponds to optimizing a contrastive objective and breaking symmetry for each action separately rather than simultaneously.
3.4 Viewing Latent Reward as a Latent State
Our approach is motivated by latent state discovery in RichObservation RL (Misra et al., 2020). Figure 1 compares the causal graphs of RichObservation RL, IGL with context conditional independence, and IGL with full conditional independence. In RichObservation RL, a contrastive learning objective is used to discover latent states; whereas in IGL a contrastive learning objective is used to discover latent rewards. In this manner we view latent rewards analogously to latent states.
Identifying latent states up to a permutation is completely acceptable in RichObservations RL, as the resulting imputed MDP orders policies identically to the true MDP. However in IGL the latent states have scalar values associated with them (rewards), and identifying up to a permutation is not sufficient to order policies correctly. Thus, we require the additional step of symmetry breaking.
4 Main Algorithm
This section instantiates the algorithm for IGL with actioninclusive feedback, following the concept we introduced in Section 3. For simplicity, we select the uniformly random policy as behavior policy in this section. That choice of can be further relaxed using an appropriate importance weight. We first introduce the empirical estimation of (defined in eq:def_obj_a) as follows. For any , we define the following empirical estimation of the spectral contrastive loss:
(9) 
Using this definition, Algorithm 1 instantiates a version of the IGL algorithm with actioninclusive feedback. Without loss of generality, we also assume for Assumption 3.2(b) for all in this section. The case of for some action can be addressed by modifying the symmetrybreaking step properly in Algorithm 1.
Input: Batch data generated by . baseline policy .
(10) 
At a high level, Algorithm 1 has two separate components, latent state (reward) discovery (line 3) and symmetry breaking (line 47), for each action in .
Theoretical guarantees
In Algorithm 1, the output policy is obtained by calling an offline contextual bandits oracle (). We now formally define this oracle and its expected property. [Offline contextual bandits oracle] An algorithm is called an offline contextual bandit oracle if for any dataset (, and is the reward determined by ) and any policy class , the policy produced by satisfies
The notion in Definition 4 corresponds to the standard policy learning approaches in the contextual bandits literature (e.g., Langford and Zhang, 2007; Dudik et al., 2011; Agarwal et al., 2014), and typically leads to . We now provide the theoretical analysis of Algorithm 1. In this paper, we use to denote the joint statistical complexity of the class of and . For example, if the function classes are finite, we have , and is the failure probability. The infinite function classes can be addressed by some advanced methods such as covering number or Rademacher complexity (see, e.g., Mohri et al., 2018). The following theorem provides the performance guarantee of the output policy of Algorithm 1. Suppose Assumptions 3, 3.1 and 3.2 hold. Let be the output policy of Algorithm 1 and . If we have then, with high probability,
(11) 
Similar to the performance of (Xie et al., 2021b), the learned is guaranteed to converge in the right direction only after we have sufficient data for the symmetric breaking. The dependence on in Theorem 4 can be improved as different action has a separate learning procedure. For example, if we consider and , where and are independent components that is only corresponding to action (this is a common setup for linear approximated reinforcement learning approaches with discrete action space). If and are identical copies of separate classes, we know and , which leads a improvement.
Proof Sketch.
The proof of Theorem 4 consists of two different components—discovering latent reward and breaking the symmetry, which are formalized by the following lemma. [Discovering latent reward] Suppose Assumptions 3 and 3.1 hold, and let be obtained by eq:decode_f_psi. Then, with high probability, we have . Lemma 4 ensures that the learned decoder on eq:decode_f_psi correctly separates the latent reward. In particular since ranges over , Lemma 4 ensures under the behaviour policy. Thus, if we can break symmetry, we can use to generate a reward signal and reduce to ordinary contextual bandit learning. The following lemma guarantees the correctness of the symmetrybreaking step. [Breaking symmetry] Suppose Assumption 3.2 holds. For any , if we have then, with high probability, . Combining these two lemmas above as well as the oracle establishes the proof of Theorem 4, and the detailed proof can be found in Appendix A. ∎
5 Empirical Evaluations
In this section, we provide empirical evaluations in simulated environments created using supervised classification datasets.
Different learning approaches based on the MNIST dataset. The gray number/image denotes the unobserved reward/feedback vector.
Figure 1(a): In the contextual bandits setting, the exact reward information on the selected action can be observed. Figure 1(b): In IGL (full CI), the feedback vector is generated only based on the latent reward. Figure 1(c): In AIIGL, the feedback vector can be generated based on both latent reward and selected action.The task is depicted in Figure 2. We evaluate our approach by comparing: (1) CB: Contextual Bandits with exact reward, (2) IGL (full CI): The method proposed by Xie et al. (2021b)
which assumes the feedback vector contains no information about the context and reward, and (3) AIIGL: The proposed method which assumes that the feedback vector could contain information about the action but is conditionally independent of the context given the reward. Note that contextual bandits (CB) is a skyline compared to both IGL (full CI) and AIIGL, since it need not disambiguate the latent reward. All methods use logistic regression with a linear representation. At test time, each method takes the argmax of the policy. We provide details of setting up the experiment in Appendix
B.We evaluate our approach on different environments with actioninclusive feedback generated using supervised classification datasets. We generate highdimensional feedback vectors for the MNIST classification dataset (Section 5.1) and lowdimensional feedback vectors for more than 200 OpenML CC18 datasets (Section 5.2). Following the practical instantiation of Assumption 3.2 (see Section 3.2), we know if the dataset has a balanced action distribution (no action belongs to more than 50% of the samples), selecting uniformly random policy as for all satisfies Assumption 3.2. Therefore, in this section, our experiments are all based on the dataset with a balanced action distribution, and we select for all .
5.1 Experiments on the MNIST dataset
The environment for this experiment is generated from the supervised classification MNIST dataset (LeCun et al., 1998) which is licensed under AttributionShare Alike 3.0 license^{1}^{1}1https://creativecommons.org/licenses/bysa/3.0/. At each time step, the context is generated uniformly at random. Then the learner selects an action as the predicted label of . The binary reward is the correctness of the prediction label . The highdimensional feedback vector is an image of the digit . An example is shown in Figure 2. Our results are averaged over 20 trials.
Algorithm 




CB  
IGL (full CI)  
AIIGL 
Results in the MNIST environment with highdimensional actioninclusive and actionexclusive feedback. Average and standard error reported over 20 trials for each algorithm.
To highlight that the proposed algorithm still operates well under conditions where the feedback vector does not include action information, we also perform experiments under the setting introduced by Xie et al. (2021b) on the MNIST environment. This setting is similar to the one described in Section 5.1 except the feedback vector is the image of the digit instead of . We find that under this actionexclusive feedback setting, the new proposed algorithm AIIGL works as well as the IGL (full CI). This signifies that our algorithm, which incorporates the presence of action information in the feedback vector, does not hurt performance in cases when the action information is missing from the feedback vector.
5.2 Largescale Experiments with OpenML CC18 Datasets
To verify that our proposed algorithm scales to a variety of tasks, we evaluate performance on datasets from the publicly available OpenML Curated Classification Benchmarking Suite (Vanschoren et al., 2015; Casalicchio et al., 2019; Feurer et al., 2021; Bischl et al., 2021). OpenML datasets are licensed under CCBY license^{2}^{2}2https://creativecommons.org/licenses/by/4.0/ and the platform and library are licensed under the BSD (3Clause) license^{3}^{3}3https://opensource.org/licenses/BSD3Clause. Similar to Section 5.1, at each time step, the context is generated uniformly at random. Again, the learner selects an action as the predicted label of (where is the total number of actions available in the environment). The binary reward is the correctness of the prediction label . The feedback is a two dimensional vector . Each dataset has a different sample size () and a different set of available actions (). We sample datasets with 3 or more actions, and with a balanced action distribution (no action belongs to more than of the samples) to satisfy Assumption 4. We use of the data for training and the remaining for evaluation. The results are averaged over trials and shown in Table 2.










,  83 
Analysis of dataset properties in relation to AIIGL’s performance
To further understand under which conditions AIIGL succeeds, we analyzed its performance against 12 different features (based on findings from Torra et al. (2008); Abdelmessih et al. (2010); Reif et al. (2014); Lorena et al. (2019)) for each OpenML dataset (more details in Appendix B.3). We consider 3 additional features based on the following criteria. Compared to the typical CB guarantee, AIIGL needs one more factor in its theoretical guarantees (Theorem 4), and the factors can be also possibly improved under some specific choice of function class (see discussion in Section 4). We then select , and as the additional features of the dataset for predicting the relative performance of AIIGL.
We use a binary random forest classifier to predict the success of AIIGL’s performance relative to CB. If the relative performance is
, we label it as a success. We find that with all datasets with , 10fold crossvalidation using 100 trees gives an average F1score of . We find to be the most predictive feature of AIIGL’s relative performance (Figure 3(a)). It can alone predict it’s performance with an average F1 score of 0.79 under the same experimental setup. However, for datasets with a small value of (), there is high variability in relative performance. Using such a subset of datasets, we find maximum Fisher discriminant (Lorena et al., 2019) (a measure of classification complexity that quantifies the informativeness of a given sample) to be the most predictive of relative performance (Figure 3(b)). Such datasets are representative of realistic interaction datasets with small sample sizes. More details on the experimental analysis are in Appendix B.3 (Figure 5 and Figure 6). This finding makes it possible to predict, for a given dataset, whether AIIGL can match CB performance. It can also help researchers improve the design of novel applications of IGL, e.g., in HCI and BCI, by ensuring the resulting dataset’s features are amenable to high performance.6 Discussion
We have presented a new approach to solving InteractionGrounded Learning, in which an agent learns to interact with the environment in the absence of any explicit reward signals. Compared to a prior solution (Xie et al., 2021b), the proposed AIIGL approach removes the assumption of conditional independence of actions and the feedback vector by treating the latent reward as a latent state. It thereby provably solves IGL for actioninclusive feedback vectors.
By viewing the feedback as containing an action(latent) reward pair which is an unobserved latent space, we propose latent reward learning using a contrastive learning approach. This solution concept naturally connects to latent state discovery in richobservation reinforcement learning (e.g., Dann et al., 2018; Du et al., 2019; Misra et al., 2020). On the other hand, different from richobservation RL, the problem of IGL also contains a unique challenge in identifying the semantic meaning of the decoded class, which is addressed by a symmetrybreaking procedure. In this work, we focus on binary latent rewards, for which symmetry breaking is possible using one policy (per action). Breaking symmetry in more general latent reward spaces is a topic for future work. A possible negative societal impact of this work can be performance instability, especially with inappropriate use of the techniques in risksensitive applications.
Barring intentional misuse, we envision several potential benefits of the proposed approach. The proposed algorithm broadens the scope of IGL’s feasibility for realworld applications (where action signals are included in the feedback). Imagine an agent being trained to interpret the brain signals of a user to control a prosthetic arm. The brain’s response (feedback vector) to an action is a neural signal that may contain information about the action itself. This is so prevalent in neuroimaging that fMRI studies routinely use specialized techniques to orthogonalize different information (e.g. action and reward) within the same signal (Momennejad and Haynes, 2012, 2013; Belghazi et al., 2018; Shah and Peters, 2020). Another example is a continually selfcalibrating eye tracker, used by people with motor disabilities such as ALS (Hansen et al., 2004; Liu et al., 2010; Mott et al., 2017; Gao et al., 2021). A learning agent adapting to the ability of such users can encounter feedback directly influenced by the calibration correction action. The proposed approach is a stepping stone on the path to solving IGL in such complex settings, overcoming the need for explicit rewards as well as explicit separation of action and feedback information.
Acknowledgment
The authors would like to thank Mark Rucker for sharing methods and code for computing and comparing diagnostic features of OpenML datasets. NJ acknowledges funding support from ARL Cooperative Agreement W911NF1720196, NSF IIS2112471, NSF CAREER award, and Adobe Data Science Research Award.
References
 Landmarking for metalearning using rapidminer. In RapidMiner community meeting and conference, Cited by: 1st item, 2nd item, 9th item, §B.3, §5.2.

Taming the monster: a fast and simple algorithm for contextual bandits.
In
International Conference on Machine Learning
, pp. 1638–1646. Cited by: §4.  Accelerated robot learning via human brain signals. In 2020 IEEE international conference on robotics and automation (ICRA), pp. 3799–3805. Cited by: §1.
 Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, Proceedings of Machine Learning Research, Vol. 80, pp. 530–539. Cited by: §6.
 OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks, Cited by: §1, §5.2.
 OpenML: an r package to connect to the machine learning platform openml. Computational Statistics 34 (3), pp. 977–991. Cited by: §5.2.
 A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §3.1.
 The empathic framework for task learning from implicit human feedback. In Conference on Robot Learning, pp. 604–626. Cited by: §1.
 On oracleefficient pac rl with rich observations. Advances in neural information processing systems 31. Cited by: §1, §6.
 Closedloop training of attention with realtime brain imaging. Nature neuroscience 18 (3), pp. 470–475. Cited by: §1.
 Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665–1674. Cited by: §1, §6.
 Gazebased interaction: a 30 year retrospective. Computers & Graphics 73, pp. 59–69. Cited by: §1.

Efficient optimal learning for contextual bandits.
In
Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence
, pp. 169–178. Cited by: §4.  OpenMLpython: an extensible python api for openml. Journal of Machine Learning Research 22, pp. 1–5. Cited by: §5.2.
 Multimodal feedback in hci: haptics, nonspeech audio, and their applications. In The Handbook of MultimodalMultisensor Interfaces: Foundations, User Modeling, and Common Modality CombinationsVolume 1, pp. 277–317. Cited by: §1.
 X2T: training an xtotext typing interface with online learning from user feedback. In 9th International Conference on Learning Representations, ICLR, Cited by: §1, §6.
 Calibrationfree bci based control. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28. Cited by: §1.
 Gaze typing compared with input by head and hand. In Proceedings of the 2004 symposium on Eye tracking research & applications, pp. 131–138. Cited by: §6.

Provable guarantees for selfsupervised deep learning with spectral contrastive loss
. Advances in Neural Information Processing Systems 34. Cited by: §3.1.  A collaborative bci approach to autonomous control of a prosthetic limb system. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1479–1482. Cited by: §1.

The epochgreedy algorithm for contextual multiarmed bandits
. Advances in neural information processing systems 20 (1), pp. 96–1. Cited by: §4.  Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.1.
 An eyegaze tracking and human computer interface system for people with als and other lockedin diseases. CMBES Proceedings 33. Cited by: §6.
 How complex is your classification problem? a survey on measuring classification complexity. ACM Computing Surveys (CSUR) 52 (5), pp. 1–34. Cited by: 11st item, 12nd item, 3rd item, 4th item, 6th item, 7th item, §B.3, §B.3, §5.2, §5.2.
 Closedloop cognition: the next frontier arrives. Trends in cognitive sciences 19 (5), pp. 242–243. Cited by: §1.
 Kinematic state abstraction and provably efficient richobservation reinforcement learning. In International conference on machine learning, pp. 6961–6971. Cited by: §1, 0(a), §3.4, §6.
 Foundations of machine learning. MIT press. Cited by: §4.
 Human anterior prefrontal cortex encodes the ‘what’and ‘when’of future intentions. Neuroimage 61 (1), pp. 139–148. Cited by: §6.
 Encoding of prospective tasks in the human prefrontal cortex under varying task loads. Journal of Neuroscience 33 (44), pp. 17342–17349. Cited by: §6.
 Improving dwellbased gaze typing with dynamic, cascading dwell times. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2558–2570. Cited by: §1, §6.
 Delineating implicit and explicit processes in neurofeedback learning. Neuroscience & Biobehavioral Reviews 118, pp. 681–688. Cited by: §1.
 Interactive learning from activity description. In International Conference on Machine Learning, pp. 8096–8108. Cited by: §1.
 Google home: smart speaker as environmental control unit. Disability and rehabilitation: assistive technology 13 (7), pp. 674–675. Cited by: §1.
 Toward an affectsensitive multimodal humancomputer interaction. Proceedings of the IEEE 91 (9), pp. 1370–1390. Cited by: §1.
 Towards intrinsic interactive reinforcement learning. arXiv preprint arXiv:2112.01575. Cited by: §1.
 Automatic classifier selection for nonexperts. Pattern Analysis and Applications 17 (1), pp. 83–96. Cited by: 1st item, 2nd item, 8th item, 9th item, §B.3, §5.2.
 Human gaze following for humanrobot interaction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8615–8621. Cited by: §1.
 Understanding teacher gaze patterns for robot learning. In Conference on Robot Learning, pp. 1247–1258. Cited by: §1.
 The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics 48 (3), pp. 1514–1538. Cited by: §6.
 Modeling decisions for artificial intelligence. International Journal of Intelligent Systems 23 (2), pp. 113. Cited by: 10th item, 8th item, §B.3, §5.2.
 OpenML: a networked science platform for machine learning. In ICML 2015 MLOSS Workshop, Vol. 3. Cited by: §5.2.
 Multimodal feedback: an assessment of performance and mental workload. Ergonomics 46 (13), pp. 68–87. Cited by: §1.

Unsupervised feature learning via nonparametric instance discrimination.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3733–3742. Cited by: §3.1.  Policy finetuning: bridging sampleefficient offline and online reinforcement learning. Advances in neural information processing systems 34. Cited by: by [Xie et al., 2021a, Lemma A.1].
 Interactiongrounded learning. In International Conference on Machine Learning, pp. 11414–11423. Cited by: §B.1, §1, §1, §2, §2, 0(c), §3.2, §3.3, §4, 1(b), §5.1, §5, §6.
Appendix A Detailed Proofs
Proposition3.1.
Over this proof, we follow the definition of eq:def_far_psiar for each . For any , let and , then
(12)  
(by )  
(13) 
and
(14)  
(15)  
(16) 
To see why (a) and (b) hold, we use the following argument. By definition of and , we have
(17)  
(18)  
(19)  
(20) 
Then, we have
(21)  
(22)  
(23)  
(24) 
Similar procedure also induces the remaining terms of (a) and (b).
Therefore, combining the two equalities above, we obtain
(25)  
(26)  
(27)  
(28) 
This completes the proof. ∎
Lower bound of when .
For any , let