1 Introduction
The stochastic contextual multiarmed bandit (MAB) (Dani et al., 2008; Lattimore and Szepesvári, 2020)
has been studied for decades in machine learning community to solve sequential decision making, with applications in online advertising
(Li et al., 2010), personal recommendation (Wu et al., 2016; Ban and He, 2021b), etc. In the standard contextual bandit setting, a set of arms are presented to a learner in each round, where each arm is represented by a context vector. Then by certain strategy, the learner selects and plays one arm, receiving a reward. The goal of this problem is to maximize the cumulative rewards of rounds.MAB algorithms have principled approaches to address the tradeoff between Exploitation and Exploration (EE), as the collected data from past rounds should be exploited to get good rewards but also underexplored arms need to be explored with the hope of getting even better rewards. The most widelyused approaches for EE tradeoff can be classified into three main techniques: Epsilongreedy
(Langford and Zhang, 2008), Thompson Sampling (TS) (Thompson, 1933), and Upper Confidence Bound (UCB) (Auer, 2002).Linear contextual bandits (Dani et al., 2008; Li et al., 2010; AbbasiYadkori et al., 2011), where the reward is assumed to be a linear function with respect to arm vectors, have been well studied and succeeded both empirically and theoretically. Given an arm, ridge regression is usually adapted to estimate its reward based on collected data from past rounds. UCBbased algorithms (Li et al., 2010; Chu et al., 2011; Wu et al., 2016; Ban and He, 2021b) calculate an upper bound for the confidence ellipsoid of estimated reward and determine the arm according to the sum of estimated reward and UCB. TSbased algorithms (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) formulate each arm as a posterior distribution where mean is the estimated reward and choose the one with the maximal sampled reward. However, the linear assumption regarding the reward may not be true in realworld applications (Valko et al., 2013b).
To learn nonlinear reward functions, recent works have utilized deep neural networks to learn the underlying reward function, thanks to its powerful representation ability. Considering the past selected arms and received rewards as training samples, a neural network is built for exploitation. Zhou et al. (2020) computes a gradientbased upper confidence bound with respect to and uses UCB strategy to select arms. Zhang et al. (2020)
formulates each arm as a normal distribution where the mean is
and deviation is calculated based on gradient of , and then uses the TS strategy to choose arms. Both Zhou et al. (2020) and Zhang et al. (2020) achieve the nearoptimal regret bound of .In this paper, we propose a neuralbased bandit algorithm coming with a novel exploration strategy, named "EENet". Similar to other neural bandits, EENet has an exploitation network to estimate rewards for each arm. The crucial difference from existing works is that EENet has an exploration network to predict the potential gain for each arm compared to current reward estimate. The input to the exploration network is the gradient of and the groundtruth is residual between received reward and estimated reward. The strategy is inspired by recent advances in the neuralbased UCB (Ban et al., 2021; Zhou et al., 2020). Finally, a decisionmaker is constructed to select arms. has two modes: linear or nonlinear. In linear mode, is a linear combination of and , inspired by the UCB strategy. In the nonlinear mode, is formulated as a neural network with input (
) and the goal is to learn the probability of being an optimal arm for each arm. Table
2 summarizes the selection criterion difference of EENet from other neural bandit algorithms. To sum up, the contributions of this paper can be summarized as follows:
We propose a novel exploration strategy, EENet, where a neural network is assigned to learn the potential gain compared to the current estimation.

Under standard assumptions of overparameterized neural networks, we prove that EENet can achieve the regret upper bound of , which is tighter than existing stateoftheart bandit algorithms.

We conduct extensive experiments on four realworld datasets, showing that EENet outperforms baselines crossing greedy, TS, and UCB, and becomes the new stateoftheart exploration policy.
Next, we will show the standard problem definition and elaborate the proposed EENet, before we present our theoretical analysis. In the end, we provide the empirical evaluation and conclusion.
Methods  Selection Criterion 

Neural Epsilongreedy  With probability , ; Otherwise, select randomly. 
NeuralTS (Zhang et al., 2020)  For , draw from . Then, 
NeuralUCB (Zhou et al., 2020)  
EENet (Our approach)  , compute , (Exploration Net). Then . 
2 Related Work
Constrained Contextual bandits. The common constrain placed on the reward function is the linear assumption, usually calculated by ridge regression (Dani et al., 2008; Li et al., 2010; AbbasiYadkori et al., 2011; Valko et al., 2013a). The linear UCBbased bandit algorithms (AbbasiYadkori et al., 2011; Li et al., 2016) and the linear Thompson Sampling (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) can achieve successful performance and the nearoptimal regret bound of . To break the linear assumption, Filippi et al. (2010) generalizes the reward function to a composition of linear and nonlinear functions and then adopt a UCBbased algorithm to deal with it; Bubeck et al. (2011) imposes the Lipschitz property on reward metric space and constructs a hierarchical optimistic optimization to make selections; Valko et al. (2013b) embeds the reward function into Reproducing Kernel Hilbert Space and proposes the kernelized TS/UCB bandit algorithms.
Neural Bandits. To learn nonlinear reward functions, deep neural networks have been adapted to bandits with various variants. Riquelme et al. (2018); Lu and Van Roy (2017) build Llayer DNN to learn the arm embeddings and apply Thompson sampling on the last layer for exploration. Zhou et al. (2020) first introduces a provable neuralbased contextual bandit algorithm with a UCB exploration strategy and then Zhang et al. (2020) extends the neural network to Thompson sampling framework. Their regret analysis is built on recent advances on the convergence theory in overparameterized neural networks(Du et al., 2019; AllenZhu et al., 2019) and utilizes Neural Tangent Kernel (Jacot et al., 2018; Arora et al., 2019) to construct connections with linear contextual bandits (AbbasiYadkori et al., 2011). Ban and He (2021a)
further adopts convolutional neural networks with UCB exploration aiming for visualaware applications.
Xu et al. (2020) performs UCBbased exploration on the last layer of neural networks to reduce the computational cost brought by gradientbased UCB. Different from the above existing works, EENet keeps the powerful representation ability of neural networks to learn reward function and first assigns another neural network to determine exploration.3 Problem definition
We consider the standard contextual multiarmed bandit with the known number of rounds (Zhou et al., 2020; Zhang et al., 2020). In each round , where the sequence , the learner is presented with arms, in which each arm is represented by a feature vector for each . After playing one arm , its reward is assumed to be generated by the function:
(1) 
where the unknown reward function can be either linear or nonlinear and the noise is drawn from certain distribution with expectation . Following many existing works (Zhou et al., 2020; Ban et al., 2021; Zhang et al., 2020), we consider bounded rewards, . For the brevity, we denote the selected arm in round by and the reward received in by . The standard regret of rounds is defined as:
(2) 
where . The goal of this problem is to minimize by certain selection strategy.
Notation. We denote by the sequence . We use or to denote the Euclidean norm for a vector , and and to denote the spectral and Frobenius norm for a matrix . We use to denote the standard inner product between two vectors or two matrices.
4 Proposed Method: EENet
EENet is composed of three independent components, while their input and output are closely correlated. The first component is the exploitation network, , which is to learn the unknown reward function based on the data collected in past rounds. The second component is the exploration network, , which is to measure the exploration efforts we should make in the present round. The third component is the decisionmaker, , which is to further trade off between exploitation and exploration, and make the selection.
1) Exploitation Net. A neural network model is provided to learn the mapping from arms to rewards. In round , denote the network by , where the superscript of is the index of network and the subscript represents the round where the parameters of finished the last update. Given an arm , is considered the "exploitation score" for . By some criterion, after playing arm , we receive a reward . Therefore, we can conduct gradient descent to update based on the collected training samples and denote the updated parameters by .
Input  Network  Label 

(Exploitation)  
(Exploration)  or or  
(Decisionmaker with nonlinear function) 
2) Exploration Net. Our exploration strategy is inspired by existing UCBbased neural bandits (Zhou et al., 2020; Ban et al., 2021). Based on the Lemma 5.2 in (Ban et al., 2021), given an arm , with probability at least , we have the following form:
where is defined in Eq. (1) and is the upper confidence bound represented by a function with respect to the gradient (See more details and discussions in Appendix D). Then we have the following definition.
Definition 4.1.
Given an arm , we define as the "expected potential gain" for and as the "potential gain" for .
Let . When , the arm has positive potential gain compared to the estimated reward . A large positive makes the arm more suitable for exploration, whereas a small (or negative) makes the arm unsuitable for exploration. Recall that traditional approaches such as UCB effectively compute such potential gain using standard tools, e.g., Markov inequality, Hoeffding bounds, etc. from large deviation bounds.
Instead of calculating a largedeviation based statistical form for , we use a neural network to learn , where the input is and the ground truth is . Adopting gradient as the input also is due to the fact that it incorporates two aspects of information: the feature of arm and the discriminative information of .
To sum up, we consider as the "exploration score" of , because it indicates the potential gain of compared to our current exploitation score . Given the selected arm , let . Therefore, in round , we can use gradient descent to update based on collected training samples
. We also provide other two heuristic forms:
and . We compare them in an ablation study in Appendix A.3) Decisionmaker. In round , given an arm , with the computed exploitation score and exploration score , we use a function to trade off between exploitation and exploration and compute the final score for . The selection criterion is defined as
Note that can be either linear or nonlinear functions. We provide the following two forms.
(1) Linear function. can be formulated as a linear function with respect to and :
where , are two weights preset by the learner. When , can be thought of as UCBtype policy, where the estimated reward and potential gain are simply added together. In experiments, we report its empirical performance in ablation study (Appendix A).
(2) Nonlinear function. also can be formulated as a neural network to learn the mapping from to the optimal arm. We transform the bandit problem into a binary classification problem. Given an arm , we define as the probability of being the optimal arm for in round . For brevity, we denote by the probability of being the optimal arm for the selected arm in round . According to different reward distributions, we have different approaches to determine .

Binary reward. , suppose
is a binary variable over
, it is straightforward to set: if ; , otherwise. 
Continuous reward. , suppose is a continuous variable over the range , we provide two ways to determine . (1) can be directly set as . (2) The learner can set a threshold . Then if ; , otherwise.
Therefore, with the collected training samples in round , we can conduct gradient descent to update parameters of . Table 2 details the working structure of EENet and Algorithm 1 depicts the workflow of EENet.
Remark 4.1.
The networks can be different structures according to different applications. For example, in the vision tasks, can be set up as convolutional layers (LeCun et al., 1995).∎
Remark 4.2.
For the exploration network , the input may have exploding dimensions when the exploitation network becomes wide and deep, which may cause huge computation cost for . To address this challenge, we can apply dimensionality reduction techniques (Roweis and Saul, 2000; Van Der Maaten et al., 2009) to obtain lowdimensional vectors of . In the experiments, we use Roweis and Saul (2000) to acquire a dimensional vector for and achieve the best performance among all baselines.∎
5 Regret Analysis
In this section, we provide the regret analysis of EENet when is set as the linear function , which can be thought of as the UCBtype tradeoff between exploitation and exploration. For the sake of simplicity, we conduct the regret analysis on some unknown but fixed data distribution . In each round , samples are drawn from , where is the representation of arm satisfying and is the corresponding reward satisfying , which are standard assumptions in neural bandits (Zhou et al., 2020; Zhang et al., 2020).
The analysis will focus on overparameterized neural networks (Jacot et al., 2018; Du et al., 2019; AllenZhu et al., 2019). Given an input , without loss of generality, we define the fullyconnected network with depth and width :
(3) 
where
is the ReLU activation function,
, , for , , and . In round , given the collected data, the loss function is defined as:
(4) 
Initialization. For any , each entry of is drawn from the normal distribution . Note that EENet at most has three networks . We define them following the definition of for brevity, although they may have different depth or width. Then, we have the following theorem for EENet. Recall that are the learning rates for ; is the number of iterations of gradient descent for in each round; and is the number of iterations for .
Theorem 1.
Comparison with NeuralUCB/TS. Under the same assumptions in overparameterized neural networks, the regret bounds complexity of NeuralUCB (Zhou et al., 2020) and NeuralTS (Zhang et al., 2020) both are
where
and is the neural tangent kernel matrix (NTK) (Jacot et al., 2018; Arora et al., 2019) and is a regularization parameter.
Remark 5.1.
It is easy to observe that the regret bound of EENet is tighter than NeuralUCB/TS, which roughly improves by a multiplicative factor of , because our proof of EENet is directly built on recent advances in convergence theory (AllenZhu et al., 2019) and generalization bound (Cao and Gu, 2019) of overparameterized neural networks. Instead, the analysis for NeuralUCB/TS follows the proof flow of linear contextual bandits (AbbasiYadkori et al., 2011) to calculate the distance among network function, NTK, and ridge regression.∎
Remark 5.2.
The regret bound of EENet does not have the effective dimension which is a considerable multiplicative factor when the input dimension is extremely large. The effective dimension is first introduced by Valko et al. (2013b) to measure the underlying dimensions of observed context. Although can be upper bounded to some dimensional subspace of reproducing kernel hilbert space (RKHS) by NTK (Zhang et al., 2020), their regret bound still has the multiplicative factor , but EENet does not have this factor.∎
6 Experiments
In this section, we evaluate EENet on four realworld datasets comparing with strong stateoftheart baselines. We first present the setup of experiments, then show regret comparison and report ablation study. For the reproducibility, all the code has been released anonymously^{1}^{1}1https://anonymous.4open.science/r/iclr_2022B085/README.md.
MNIST dataset
. MNIST is a wellknown image dataset
(LeCun et al., 1998) for the 10class classification problem. Following the evaluation setting of existing works (Valko et al., 2013b; Zhou et al., 2020; Zhang et al., 2020), we transform this classification problem into bandit problem. Consider an image , we aim to classify it from classes. First, in each round, the image is transformed into arms and presented to the learner, represented by vectors in sequence . The reward is defined as if the index of selected arm matches the index of ’s groundtruth class; Otherwise, the reward is .Yelp^{2}^{2}2https://www.yelp.com/dataset and Movielens (Harper and Konstan, 2015) datasets. Yelp is a dataset released in the Yelp dataset challenge, which consists of 4.7 million rating entries for restaurants by
million users. MovieLens is a dataset consisting of
million ratings between users and movies. We build the rating matrix by choosing the top users and toprestaurants(movies) and use singularvalue decomposition (SVD) to extract a
dimension feature vector for each user and restaurant(movie). In these two datasets, the bandit algorithm is to choose the restaurants(movies) with bad ratings. We generate the reward by using the restaurant(movie)’s gained stars scored by the users. In each rating record, if the user scores a restaurant(movie) less than 2 stars (5 stars totally), its reward is ; Otherwise, its reward is . In each round, we set arms as follows: we randomly choose one with reward and randomly pick the other restaurants(movies) with rewards; then, the representation of each arm is the concatenation of corresponding user feature vector and restaurant(movie) feature vector.Disin (Ahmed et al., 2018) dataset. Disin is a fake news dataset on kaggle^{3}^{3}3https://www.kaggle.com/clmentbisaillon/fakeandrealnewsdataset including 12600 fake news articles and 12600 truthful news articles, where each article is represented by the text. To transform the text into vectors, we use the approach (Fu and He, 2021) to represent each article by a 300dimension vector. Similarly, we form a 10arm pool in each round, where 9 real news and 1 fake news are randomly selected. If the fake news is selected, the reward is ; Otherwise, the reward is .
Baselines. To comprehensively evaluate EENet, we choose neuralbased bandit algorithms, one linear and one kernelized bandit algorithms.

LinUCB (Li et al., 2010) explicitly assumes the reward is a linear function of arm vector and unknown user parameter and then applies the ridge regression and un upper confidence bound to determine selected arm.

KernelUCB (Valko et al., 2013a) adopts a predefined kernel matrix on the reward space combined with a UCBbased exploration strategy.

NeuralNoExplore only uses the exploitation network and selects an arm by the greedy strategy .

NeuralEpsilon adapts the epsilongreedy exploration strategy on exploitation network . I.e., with probability , the arm is selected by and with probability , the arm is chosen randomly.

NeuralUCB (Zhou et al., 2020) uses the exploitation network to learn the reward function coming with an UCBbased exploration strategy.

NeuralTS (Zhang et al., 2020) adopts the exploitation network to learn the reward function coming with an Thompson Sampling exploration strategy.
Note that we do not report results of LinTS and KernelTS in experiments, because of the limited space in figures, but LinTS and KernelTS have been significantly outperformed by NeuralTS (Zhang et al., 2020).
Setup for EENet. To compare fairly, for all the neuralbased methods including EENet, the exploitation network is built by a 2layer fullyconnected network with 100 width. For the exploration network , we use a 2layer fullyconnected network with 100 width as well. For the decision maker , by comprehensively evaluate both linear and nonlinear functions, we found that the most effective approach is combining them together, which we call " hybrid decision maker". In detail, for rounds , is set as , and for , is set as a neural network with two 20width fullyconnected layers. Setting in this way is because the linear decision maker can maintain stable performance in each running (robustness) and the nonlinear decision maker lacks the stability while can further improve the performance (see details in Appendix A). The hybrid decision maker can combine these two advantages together. For all the neural networks, we use the Adam optimizer (Kingma and Ba, 2014) and learning rate is set as .
6.1 Regret Comparison
Configurations. For LinUCB, following (Li et al., 2010), we do a grid search for the exploration constant over which is to tune the scale of UCB. For KernelUCB (Valko et al., 2013a)
, we use the radial basis function kernel and stop adding contexts after 1000 rounds, following
(Valko et al., 2013b; Zhou et al., 2020). For the regularization parameter and exploration parameter in KernelUCB, we do the grid search for over and for over . For NeuralUCB and NeuralTS, following setting of (Zhou et al., 2020; Zhang et al., 2020), we use the exploiation network and conduct the grid search for the exploration parameter over and for the regularization parameter over . For NeuralEpsilon, we use the same neural network and do the grid search for the exploration probability over . NeuralNoexplore uses the same neural network as well. For the neural bandits NeuralUCB/TS, following their setting, as they have expensive computation cost to store and compute the whole gradient matrix, we use a diagonal matrix to make approximation. For all gridsearched parameters, we choose the best of them for the comparison and report the averaged results of runs for all methods.Results. Figure 1 and Figure 2 show the regret comparison on these four datasets. EENet consistently outperforms all baselines across all datasets. For LinUCB and KernelUCN, the simple linear reward function or predefined kernel cannot properly formulate groundtruth reward function existed in realworld datasets. In particular, on Mnist and Disin datasets, the correlations between rewards and arm feature vectors are not linear or some simple mappings. Thus, LinUCB and KernelUCB barely exploit the past collected data samples and fail to select correct arms. For neuralbased bandit algorithms, as NeuralNoexplore does not have exploration portion, its rates of collecting new samples and learning new knowledge are unstable and usually delayed. Therefore, NeuralNoexplore usually is inferior to the methods with exploration. The exploration probability of NeuralEpsilon is fixed and difficult to be adjustable. Thus it is usually hard to make effective exploration. To make exploration, NeuralUCB statistically calculates a gradientbased upper confidence bound and NeuralTS draws each arm’s predicted reward from a normal distribution where the standard deviation is computed by gradient. However, the confidence bound or standard deviation they calculated only consider the worst cases and thus may not be able represent the actual potential of each arm. Instead, EENet uses a neural network to learn each arm’s potential by neural network’s powerful representation ability. Therefore, EENet can outperform these two stateoftheart bandit algorithms. Note that NeuralUCB/TS does need two parameters to tune UCB/TS according to different scenarios while EENet only needs to set up a neural network and automatically learns it.
Ablation Study. In Appendix A, we conduct ablation study regarding the label function of and the different setting of . To sum up, usually outperforms and ReLU( empirically and the proposed hybrid setting of often achieves the best performance compared to linear or nonlinear functions.
7 Conclusion
In this paper, we propose a novel exploration strategy, EENet. In addition to a neural network that exploits collected data in past rounds , EENet has another neural network to learn the potential gain compared to current estimation for exploration. Then, a decision maker is built to make selections to further trade off between exploitation and exploration. We demonstrate that EENet outperforms NeuralUCB and NeuralTS both theoretically and empirically, becoming the new stateoftheart exploration policy.
References
 Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1, §2, §2, Remark 5.1.
 Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pp. 176–184. Cited by: §1, §2.
 Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. Cited by: §1, §2.
 Detecting opinion spams and fake news using text classification. Security and Privacy 1 (1), pp. e9. Cited by: §6.

A convergence theory for deep learning via overparameterization
. In International Conference on Machine Learning, pp. 242–252. Cited by: item a, item b, Appendix B, Appendix B, §2, Remark 5.1, §5.  On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8141–8150. Cited by: §2, §5.
 Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §1.
 Multifacet contextual bandits: a neural network perspective. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 1418, 2021, pp. 35–45. Cited by: Lemma D.1, Appendix D, Appendix D, §1, §3, §4.
 Convolutional neural bandit: provable algorithm for visualaware advertising. arXiv preprint arXiv:2107.07438. Cited by: §2.
 Local clustering in contextual multiarmed bandits. In Proceedings of the Web Conference 2021, pp. 2335–2346. Cited by: §1, §1.
 Xarmed bandits.. Journal of Machine Learning Research 12 (5). Cited by: §2.

Generalization bounds of stochastic gradient descent for wide and deep neural networks
. Advances in Neural Information Processing Systems 32, pp. 10836–10846. Cited by: Lemma B.4, Remark 5.1.  An approximate formula for a partial sum of the divergent pseries. Applied Mathematics Letters 22 (5), pp. 732–737. Cited by: Appendix B, Appendix C.
 Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: §1.
 Stochastic linear optimization under bandit feedback. Cited by: §1, §1, §2.
 Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. Cited by: §2, §5.
 Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §2.
 SDG: A simplified and dynamic graph neural network. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 1115, 2021, pp. 2273–2277. External Links: Link, Document Cited by: §6.
 The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: §6.
 Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580. Cited by: §2, §5, §5.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.

The epochgreedy algorithm for multiarmed bandits with side information
. In Advances in neural information processing systems, pp. 817–824. Cited by: §1.  Bandit algorithms. Cambridge University Press. Cited by: §1.
 Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: Remark 4.1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §6.
 A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §1, §1, §2, item 1, §6.1.
 Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 539–548. Cited by: §2.
 Ensemble sampling. arXiv preprint arXiv:1705.07347. Cited by: §2.
 Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127. Cited by: §2.
 Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: Remark 4.2.
 Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285–295. Cited by: Appendix C.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.
 Finitetime analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869. Cited by: §2, item 2, §6.1.
 Finitetime analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869. Cited by: §1, §2, Remark 5.2, §6.1, §6.
 Dimensionality reduction: a comparative. J Mach Learn Res 10 (6671), pp. 13. Cited by: Remark 4.2.
 Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 529–538. Cited by: §1, §1.
 Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780. Cited by: §2.
 Neural thompson sampling. arXiv preprint arXiv:2010.00827. Cited by: Table 1, §1, §2, §3, Remark 5.2, §5, §5, item 6, §6.1, §6, §6.
 Neural contextual bandits with ucbbased exploration. In International Conference on Machine Learning, pp. 11492–11502. Cited by: Appendix D, Appendix D, Table 1, §1, §1, §2, §3, §4, §5, §5, item 5, §6.1, §6.
Appendix A Ablation Study
In this section, we conduct ablation study regarding the label function for exploration network and seting of decision maker on two representative datasets Movielens and Mnist.
Label function . In this paper, we use to measure the potential gain of an arm, as the label of . Moreover, we provide other two intuitive form and . Figure 3 shows the regret with different , where "EENet" denotes our method with default , "EENetabs" represents the one with and "EENetReLU" is with . On Movielens and Mnist datasets, EENet slightly outperforms EENetabs and EENetReLU. In fact, can effectively represent the positive potential gain and negative potential gain, such that intends to score the arm with positive gain higher and score the arm with negative gain lower. However, treats the positive/negative potential gain evenly, weakening the discriminative ability. can recognize the positive gain while neglecting the difference of negative gain. Therefore, usually is the most effective one for empirical performance.
Setting of . can be set as linear function or nonlinear function. In the experiment, we test the simple linear function , denoted by "EENetLin", and a nonlinear function represented by a 2layer 20width fullyconnected neural network, denoted by "EENetNoLin". For the default hybrid setting, denoted by "EENet", when rounds , ; Otherwise, is the neural network. Figure 4 reports the regret with these three different modes. EENet achieves the best performance with small standard deviation. In contrast, EENetNoLin obtains the worst performance and largest standard deviation. However, notice that EENetNoLin can achieve the best performance in certain running (the green shallow) but it is erratic. Because in the begin phase, without enough training samples, EENetNoLin strongly relies on the quality of collected samples. With appropriate training samples, gradient descent can lead to global optimum. On the other hand, with misleading training samples, gradient descent can deviate from global optimum. Therefore, EENetNoLin shows very unstable performance. In contrast, EENetLin is inspired by the UCB strategy, i.e., the exploitation plus the exploration, exhibiting stable performance. To combine their advantages together, we propose the hybrid approach, EENet, achieving the best performance with strong stability.
Appendix B Proof of Theorem 1
In this section, we provide the proof of Theorem 1 and related lemmas.
Theorem 1. Let follow the setting of (Eq. (3) ) with width respectively and same depth . Let be loss functions defined in Algorithm 1. Set as . Given two constants , , suppose
(6)  
then with probability at least , the expected cumulative regret of EENet in rounds satisfies
where
When , we have
Proof.
In round , given the arms , let denote their expected rewards. For brevity, for the selected arm in round , let be its expected reward and be the optimal arm in round .
Then, the expected regret of round is computed by