DeepAI

# EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits

Contextual multi-armed bandits have been studied for decades and adapted to various applications such as online advertising and personalized recommendation. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural-based bandit algorithms have been proposed, where a neural network is assigned to learn the underlying reward function and TS or UCB are adapted for exploration. In this paper, we propose "EE-Net", a neural-based bandit approach with a novel exploration strategy. In addition to utilizing a neural network (Exploitation network) to learn the reward function, EE-Net adopts another neural network (Exploration network) to adaptively learn potential gains compared to currently estimated reward. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net achieves 𝒪(√(Tlog T)) regret, which is tighter than existing state-of-the-art neural bandit algorithms (𝒪(√(T)log T) for both UCB-based and TS-based). Through extensive experiments on four real-world datasets, we show that EE-Net outperforms existing linear and neural bandit approaches.

• 13 publications
• 3 publications
• 40 publications
• 32 publications
05/04/2020

### Hyper-parameter Tuning for the Contextual Bandit

We study here the problem of learning the exploration exploitation trade...
04/02/2020

### Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation

Contextual multi-armed bandit (MAB) achieves cutting-edge performance on...
06/22/2021

### Pure Exploration in Kernel and Neural Bandits

We study pure exploration in bandits, where the dimension of the feature...
07/18/2021

### GuideBoot: Guided Bootstrap for Deep Contextual Bandits

The exploration/exploitation (E E) dilemma lies at the core of interac...
09/12/2017

### Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

In this paper, we propose and study opportunistic bandits - a new varian...
10/12/2022

### Maximum entropy exploration in contextual bandits with neural networks and energy based models

Contextual bandits can solve a huge range of real-world problems. Howeve...
10/11/2019

### Privacy-Preserving Contextual Bandits

Contextual bandits are online learners that, given an input, select an a...

## 1 Introduction

The stochastic contextual multi-armed bandit (MAB) (Dani et al., 2008; Lattimore and Szepesvári, 2020)

has been studied for decades in machine learning community to solve sequential decision making, with applications in online advertising

(Li et al., 2010), personal recommendation (Wu et al., 2016; Ban and He, 2021b), etc. In the standard contextual bandit setting, a set of arms are presented to a learner in each round, where each arm is represented by a context vector. Then by certain strategy, the learner selects and plays one arm, receiving a reward. The goal of this problem is to maximize the cumulative rewards of rounds.

MAB algorithms have principled approaches to address the trade-off between Exploitation and Exploration (EE), as the collected data from past rounds should be exploited to get good rewards but also under-explored arms need to be explored with the hope of getting even better rewards. The most widely-used approaches for EE trade-off can be classified into three main techniques: Epsilon-greedy

(Langford and Zhang, 2008), Thompson Sampling (TS) (Thompson, 1933), and Upper Confidence Bound (UCB) (Auer, 2002).

Linear contextual bandits (Dani et al., 2008; Li et al., 2010; Abbasi-Yadkori et al., 2011), where the reward is assumed to be a linear function with respect to arm vectors, have been well studied and succeeded both empirically and theoretically. Given an arm, ridge regression is usually adapted to estimate its reward based on collected data from past rounds. UCB-based algorithms (Li et al., 2010; Chu et al., 2011; Wu et al., 2016; Ban and He, 2021b) calculate an upper bound for the confidence ellipsoid of estimated reward and determine the arm according to the sum of estimated reward and UCB. TS-based algorithms (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) formulate each arm as a posterior distribution where mean is the estimated reward and choose the one with the maximal sampled reward. However, the linear assumption regarding the reward may not be true in real-world applications (Valko et al., 2013b).

To learn non-linear reward functions, recent works have utilized deep neural networks to learn the underlying reward function, thanks to its powerful representation ability. Considering the past selected arms and received rewards as training samples, a neural network is built for exploitation. Zhou et al. (2020) computes a gradient-based upper confidence bound with respect to and uses UCB strategy to select arms. Zhang et al. (2020)

formulates each arm as a normal distribution where the mean is

and deviation is calculated based on gradient of , and then uses the TS strategy to choose arms. Both Zhou et al. (2020) and Zhang et al. (2020) achieve the near-optimal regret bound of .

In this paper, we propose a neural-based bandit algorithm coming with a novel exploration strategy, named "EE-Net". Similar to other neural bandits, EE-Net has an exploitation network to estimate rewards for each arm. The crucial difference from existing works is that EE-Net has an exploration network to predict the potential gain for each arm compared to current reward estimate. The input to the exploration network is the gradient of and the ground-truth is residual between received reward and estimated reward. The strategy is inspired by recent advances in the neural-based UCB (Ban et al., 2021; Zhou et al., 2020). Finally, a decision-maker is constructed to select arms. has two modes: linear or nonlinear. In linear mode, is a linear combination of and , inspired by the UCB strategy. In the nonlinear mode, is formulated as a neural network with input (

) and the goal is to learn the probability of being an optimal arm for each arm. Table

2 summarizes the selection criterion difference of EE-Net from other neural bandit algorithms. To sum up, the contributions of this paper can be summarized as follows:

1. We propose a novel exploration strategy, EE-Net, where a neural network is assigned to learn the potential gain compared to the current estimation.

2. Under standard assumptions of over-parameterized neural networks, we prove that EE-Net can achieve the regret upper bound of , which is tighter than existing state-of-the-art bandit algorithms.

3. We conduct extensive experiments on four real-world datasets, showing that EE-Net outperforms baselines crossing -greedy, TS, and UCB, and becomes the new state-of-the-art exploration policy.

Next, we will show the standard problem definition and elaborate the proposed EE-Net, before we present our theoretical analysis. In the end, we provide the empirical evaluation and conclusion.

## 2 Related Work

Constrained Contextual bandits. The common constrain placed on the reward function is the linear assumption, usually calculated by ridge regression (Dani et al., 2008; Li et al., 2010; Abbasi-Yadkori et al., 2011; Valko et al., 2013a). The linear UCB-based bandit algorithms (Abbasi-Yadkori et al., 2011; Li et al., 2016) and the linear Thompson Sampling (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017) can achieve successful performance and the near-optimal regret bound of . To break the linear assumption, Filippi et al. (2010) generalizes the reward function to a composition of linear and non-linear functions and then adopt a UCB-based algorithm to deal with it; Bubeck et al. (2011) imposes the Lipschitz property on reward metric space and constructs a hierarchical optimistic optimization to make selections; Valko et al. (2013b) embeds the reward function into Reproducing Kernel Hilbert Space and proposes the kernelized TS/UCB bandit algorithms.

Neural Bandits. To learn non-linear reward functions, deep neural networks have been adapted to bandits with various variants. Riquelme et al. (2018); Lu and Van Roy (2017) build L-layer DNN to learn the arm embeddings and apply Thompson sampling on the last layer for exploration. Zhou et al. (2020) first introduces a provable neural-based contextual bandit algorithm with a UCB exploration strategy and then Zhang et al. (2020) extends the neural network to Thompson sampling framework. Their regret analysis is built on recent advances on the convergence theory in over-parameterized neural networks(Du et al., 2019; Allen-Zhu et al., 2019) and utilizes Neural Tangent Kernel (Jacot et al., 2018; Arora et al., 2019) to construct connections with linear contextual bandits (Abbasi-Yadkori et al., 2011). Ban and He (2021a)

further adopts convolutional neural networks with UCB exploration aiming for visual-aware applications.

Xu et al. (2020) performs UCB-based exploration on the last layer of neural networks to reduce the computational cost brought by gradient-based UCB. Different from the above existing works, EE-Net keeps the powerful representation ability of neural networks to learn reward function and first assigns another neural network to determine exploration.

## 3 Problem definition

We consider the standard contextual multi-armed bandit with the known number of rounds (Zhou et al., 2020; Zhang et al., 2020). In each round , where the sequence , the learner is presented with arms, in which each arm is represented by a feature vector for each . After playing one arm , its reward is assumed to be generated by the function:

 rt,i=h(xt,i)+ηt,i, (1)

where the unknown reward function can be either linear or non-linear and the noise is drawn from certain distribution with expectation . Following many existing works (Zhou et al., 2020; Ban et al., 2021; Zhang et al., 2020), we consider bounded rewards, . For the brevity, we denote the selected arm in round by and the reward received in by . The standard regret of rounds is defined as:

 RT=E[T∑t=1(r∗t−rt)]=T∑t=1(h(x∗t)−h(xt)), (2)

where . The goal of this problem is to minimize by certain selection strategy.

Notation. We denote by the sequence . We use or to denote the Euclidean norm for a vector , and and to denote the spectral and Frobenius norm for a matrix . We use to denote the standard inner product between two vectors or two matrices.

## 4 Proposed Method: EE-Net

EE-Net is composed of three independent components, while their input and output are closely correlated. The first component is the exploitation network, , which is to learn the unknown reward function based on the data collected in past rounds. The second component is the exploration network, , which is to measure the exploration efforts we should make in the present round. The third component is the decision-maker, , which is to further trade off between exploitation and exploration, and make the selection.

1) Exploitation Net. A neural network model is provided to learn the mapping from arms to rewards. In round , denote the network by , where the superscript of is the index of network and the subscript represents the round where the parameters of finished the last update. Given an arm , is considered the "exploitation score" for . By some criterion, after playing arm , we receive a reward . Therefore, we can conduct gradient descent to update based on the collected training samples and denote the updated parameters by .

2) Exploration Net. Our exploration strategy is inspired by existing UCB-based neural bandits (Zhou et al., 2020; Ban et al., 2021). Based on the Lemma 5.2 in (Ban et al., 2021), given an arm , with probability at least , we have the following form:

 |h(xt,i)−f1(xt,i;θ1t)|≤Ψ(▽θ1f1(xt,i;θ1t)),

where is defined in Eq. (1) and is the upper confidence bound represented by a function with respect to the gradient (See more details and discussions in Appendix D). Then we have the following definition.

###### Definition 4.1.

Given an arm , we define as the "expected potential gain" for and as the "potential gain" for .

Let . When , the arm has positive potential gain compared to the estimated reward . A large positive makes the arm more suitable for exploration, whereas a small (or negative) makes the arm unsuitable for exploration. Recall that traditional approaches such as UCB effectively compute such potential gain using standard tools, e.g., Markov inequality, Hoeffding bounds, etc. from large deviation bounds.

Instead of calculating a large-deviation based statistical form for , we use a neural network to learn , where the input is and the ground truth is . Adopting gradient as the input also is due to the fact that it incorporates two aspects of information: the feature of arm and the discriminative information of .

To sum up, we consider as the "exploration score" of , because it indicates the potential gain of compared to our current exploitation score . Given the selected arm , let . Therefore, in round , we can use gradient descent to update based on collected training samples

. We also provide other two heuristic forms:

and . We compare them in an ablation study in Appendix A.

3) Decision-maker. In round , given an arm , with the computed exploitation score and exploration score , we use a function to trade off between exploitation and exploration and compute the final score for . The selection criterion is defined as

 xt=argmaxi∈[n]f3(f1(xt,i;θ1t−1),f2(▽θ1f1(xt,i;θ1t−1);θ2t−1);θ3t−1).

Note that can be either linear or non-linear functions. We provide the following two forms.

(1) Linear function. can be formulated as a linear function with respect to and :

 f3(f1,f2;θ3)=w1f1(xt,i;θ1)+w2f2(▽θ1f1;θ2)

where , are two weights preset by the learner. When , can be thought of as UCB-type policy, where the estimated reward and potential gain are simply added together. In experiments, we report its empirical performance in ablation study (Appendix A).

(2) Non-linear function. also can be formulated as a neural network to learn the mapping from to the optimal arm. We transform the bandit problem into a binary classification problem. Given an arm , we define as the probability of being the optimal arm for in round . For brevity, we denote by the probability of being the optimal arm for the selected arm in round . According to different reward distributions, we have different approaches to determine .

1. Binary reward. , suppose

is a binary variable over

, it is straightforward to set: if ; , otherwise.

2. Continuous reward. , suppose is a continuous variable over the range , we provide two ways to determine . (1) can be directly set as . (2) The learner can set a threshold . Then if ; , otherwise.

Therefore, with the collected training samples in round , we can conduct gradient descent to update parameters of . Table 2 details the working structure of EE-Net and Algorithm 1 depicts the workflow of EE-Net.

###### Remark 4.1.

The networks can be different structures according to different applications. For example, in the vision tasks, can be set up as convolutional layers (LeCun et al., 1995).∎

###### Remark 4.2.

For the exploration network , the input may have exploding dimensions when the exploitation network becomes wide and deep, which may cause huge computation cost for . To address this challenge, we can apply dimensionality reduction techniques (Roweis and Saul, 2000; Van Der Maaten et al., 2009) to obtain low-dimensional vectors of . In the experiments, we use Roweis and Saul (2000) to acquire a -dimensional vector for and achieve the best performance among all baselines.∎

## 5 Regret Analysis

In this section, we provide the regret analysis of EE-Net when is set as the linear function , which can be thought of as the UCB-type trade-off between exploitation and exploration. For the sake of simplicity, we conduct the regret analysis on some unknown but fixed data distribution . In each round , samples are drawn from , where is the representation of arm satisfying and is the corresponding reward satisfying , which are standard assumptions in neural bandits (Zhou et al., 2020; Zhang et al., 2020).

The analysis will focus on over-parameterized neural networks (Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019). Given an input , without loss of generality, we define the fully-connected network with depth and width :

 f(x;θ)=WLσ(WL−1σ(WL−2…σ(W1x))) (3)

where

is the ReLU activation function,

, , for , , and . In round , given the collected data

, the loss function is defined as:

 L=12t∑i=1(f(xi;θ)−ri)2. (4)

Initialization. For any , each entry of is drawn from the normal distribution . Note that EE-Net at most has three networks . We define them following the definition of for brevity, although they may have different depth or width. Then, we have the following theorem for EE-Net. Recall that are the learning rates for ; is the number of iterations of gradient descent for in each round; and is the number of iterations for .

###### Theorem 1.

Let follow the setting of (Eq. (3) ) with width respectively and same depth . Let be loss function defined in Algorithm 1. Set as . Given two constants , , assume

 m ≥poly(T,n,L,log(1/δ)⋅d⋅e√log1/δ), m′≥Ω(m2L) (5) η1 =Θ(dδpoly(T,n,L)⋅m),  η2=Θ(O(m2L)δpoly(T,n,L)⋅m′) K1 =Θ(poly(T,n,L)δ2⋅log((ϵ1/2)−1)),  K2=Θ(poly(T,n,L)δ2⋅log(ϵ−12)),

then with probability at least , the expected cumulative regret of EE-Net in rounds satisfies

 RT≤O((2√T−1)√2ϵ2)+O((ξ2+ϵ1)(2√T−1)√2log(O(Tn)/δ)).

where

 ξ2=O(T4nL√O(m2L)logm′δ√m′)+O(T5nL2√O(m2L)log11/6m′δm′1/6)<1.

When , we have

 RT≤O(1)+O((ξ2+ϵ1)(2√T−1)√2log(O(Tn)/δ)).

Comparison with NeuralUCB/TS. Under the same assumptions in over-parameterized neural networks, the regret bounds complexity of NeuralUCB (Zhou et al., 2020) and NeuralTS (Zhang et al., 2020) both are

where

 ~d=logdet(I+H/λ)log(1+Tn/λ)

and is the neural tangent kernel matrix (NTK) (Jacot et al., 2018; Arora et al., 2019) and is a regularization parameter.

###### Remark 5.1.

It is easy to observe that the regret bound of EE-Net is tighter than NeuralUCB/TS, which roughly improves by a multiplicative factor of , because our proof of EE-Net is directly built on recent advances in convergence theory (Allen-Zhu et al., 2019) and generalization bound (Cao and Gu, 2019) of over-parameterized neural networks. Instead, the analysis for NeuralUCB/TS follows the proof flow of linear contextual bandits (Abbasi-Yadkori et al., 2011) to calculate the distance among network function, NTK, and ridge regression.∎

###### Remark 5.2.

The regret bound of EE-Net does not have the effective dimension which is a considerable multiplicative factor when the input dimension is extremely large. The effective dimension is first introduced by Valko et al. (2013b) to measure the underlying dimensions of observed context. Although can be upper bounded to some dimensional subspace of reproducing kernel hilbert space (RKHS) by NTK (Zhang et al., 2020), their regret bound still has the multiplicative factor , but EE-Net does not have this factor.∎

The proof of Theorem 1 is in Appendix B. Moreover, we provide the regret analysis of greedy approach with the only exploitation network , i.e. , showing that EE-Net theoretically outperforms the greedy approach (see details in Appendix C).

## 6 Experiments

In this section, we evaluate EE-Net on four real-world datasets comparing with strong state-of-the-art baselines. We first present the setup of experiments, then show regret comparison and report ablation study. For the reproducibility, all the code has been released anonymously.

MNIST dataset

. MNIST is a well-known image dataset

(LeCun et al., 1998) for the 10-class classification problem. Following the evaluation setting of existing works (Valko et al., 2013b; Zhou et al., 2020; Zhang et al., 2020), we transform this classification problem into bandit problem. Consider an image , we aim to classify it from classes. First, in each round, the image is transformed into arms and presented to the learner, represented by vectors in sequence . The reward is defined as if the index of selected arm matches the index of ’s ground-truth class; Otherwise, the reward is .

Yelp and Movielens (Harper and Konstan, 2015) datasets. Yelp is a dataset released in the Yelp dataset challenge, which consists of 4.7 million rating entries for restaurants by

million users. MovieLens is a dataset consisting of

million ratings between users and movies. We build the rating matrix by choosing the top users and top

restaurants(movies) and use singular-value decomposition (SVD) to extract a

-dimension feature vector for each user and restaurant(movie). In these two datasets, the bandit algorithm is to choose the restaurants(movies) with bad ratings. We generate the reward by using the restaurant(movie)’s gained stars scored by the users. In each rating record, if the user scores a restaurant(movie) less than 2 stars (5 stars totally), its reward is ; Otherwise, its reward is . In each round, we set arms as follows: we randomly choose one with reward and randomly pick the other restaurants(movies) with rewards; then, the representation of each arm is the concatenation of corresponding user feature vector and restaurant(movie) feature vector.

Disin (Ahmed et al., 2018) dataset. Disin is a fake news dataset on kaggle including 12600 fake news articles and 12600 truthful news articles, where each article is represented by the text. To transform the text into vectors, we use the approach (Fu and He, 2021) to represent each article by a 300-dimension vector. Similarly, we form a 10-arm pool in each round, where 9 real news and 1 fake news are randomly selected. If the fake news is selected, the reward is ; Otherwise, the reward is .

Baselines. To comprehensively evaluate EE-Net, we choose neural-based bandit algorithms, one linear and one kernelized bandit algorithms.

1. LinUCB (Li et al., 2010) explicitly assumes the reward is a linear function of arm vector and unknown user parameter and then applies the ridge regression and un upper confidence bound to determine selected arm.

2. KernelUCB (Valko et al., 2013a) adopts a predefined kernel matrix on the reward space combined with a UCB-based exploration strategy.

3. Neural-NoExplore only uses the exploitation network and selects an arm by the greedy strategy .

4. Neural-Epsilon adapts the epsilon-greedy exploration strategy on exploitation network . I.e., with probability , the arm is selected by and with probability , the arm is chosen randomly.

5. NeuralUCB (Zhou et al., 2020) uses the exploitation network to learn the reward function coming with an UCB-based exploration strategy.

6. NeuralTS (Zhang et al., 2020) adopts the exploitation network to learn the reward function coming with an Thompson Sampling exploration strategy.

Note that we do not report results of LinTS and KernelTS in experiments, because of the limited space in figures, but LinTS and KernelTS have been significantly outperformed by NeuralTS (Zhang et al., 2020).

Setup for EE-Net. To compare fairly, for all the neural-based methods including EE-Net, the exploitation network is built by a 2-layer fully-connected network with 100 width. For the exploration network , we use a 2-layer fully-connected network with 100 width as well. For the decision maker , by comprehensively evaluate both linear and nonlinear functions, we found that the most effective approach is combining them together, which we call " hybrid decision maker". In detail, for rounds , is set as , and for , is set as a neural network with two 20-width fully-connected layers. Setting in this way is because the linear decision maker can maintain stable performance in each running (robustness) and the non-linear decision maker lacks the stability while can further improve the performance (see details in Appendix A). The hybrid decision maker can combine these two advantages together. For all the neural networks, we use the Adam optimizer (Kingma and Ba, 2014) and learning rate is set as .

### 6.1 Regret Comparison

Configurations. For LinUCB, following (Li et al., 2010), we do a grid search for the exploration constant over which is to tune the scale of UCB. For KernelUCB (Valko et al., 2013a)

, we use the radial basis function kernel and stop adding contexts after 1000 rounds, following

(Valko et al., 2013b; Zhou et al., 2020). For the regularization parameter and exploration parameter in KernelUCB, we do the grid search for over and for over . For NeuralUCB and NeuralTS, following setting of (Zhou et al., 2020; Zhang et al., 2020), we use the exploiation network and conduct the grid search for the exploration parameter over and for the regularization parameter over . For NeuralEpsilon, we use the same neural network and do the grid search for the exploration probability over . Neural-Noexplore uses the same neural network as well. For the neural bandits NeuralUCB/TS, following their setting, as they have expensive computation cost to store and compute the whole gradient matrix, we use a diagonal matrix to make approximation. For all grid-searched parameters, we choose the best of them for the comparison and report the averaged results of runs for all methods.

Results. Figure 1 and Figure 2 show the regret comparison on these four datasets. EE-Net consistently outperforms all baselines across all datasets. For LinUCB and KernelUCN, the simple linear reward function or predefined kernel cannot properly formulate ground-truth reward function existed in real-world datasets. In particular, on Mnist and Disin datasets, the correlations between rewards and arm feature vectors are not linear or some simple mappings. Thus, LinUCB and KernelUCB barely exploit the past collected data samples and fail to select correct arms. For neural-based bandit algorithms, as Neural-Noexplore does not have exploration portion, its rates of collecting new samples and learning new knowledge are unstable and usually delayed. Therefore, Neural-Noexplore usually is inferior to the methods with exploration. The exploration probability of Neural-Epsilon is fixed and difficult to be adjustable. Thus it is usually hard to make effective exploration. To make exploration, NeuralUCB statistically calculates a gradient-based upper confidence bound and NeuralTS draws each arm’s predicted reward from a normal distribution where the standard deviation is computed by gradient. However, the confidence bound or standard deviation they calculated only consider the worst cases and thus may not be able represent the actual potential of each arm. Instead, EE-Net uses a neural network to learn each arm’s potential by neural network’s powerful representation ability. Therefore, EE-Net can outperform these two state-of-the-art bandit algorithms. Note that NeuralUCB/TS does need two parameters to tune UCB/TS according to different scenarios while EE-Net only needs to set up a neural network and automatically learns it.

Ablation Study. In Appendix A, we conduct ablation study regarding the label function of and the different setting of . To sum up, usually outperforms and ReLU( empirically and the proposed hybrid setting of often achieves the best performance compared to linear or non-linear functions.

## 7 Conclusion

In this paper, we propose a novel exploration strategy, EE-Net. In addition to a neural network that exploits collected data in past rounds , EE-Net has another neural network to learn the potential gain compared to current estimation for exploration. Then, a decision maker is built to make selections to further trade off between exploitation and exploration. We demonstrate that EE-Net outperforms NeuralUCB and NeuralTS both theoretically and empirically, becoming the new state-of-the-art exploration policy.

## References

• Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2011) Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1, §2, §2, Remark 5.1.
• M. Abeille and A. Lazaric (2017) Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pp. 176–184. Cited by: §1, §2.
• S. Agrawal and N. Goyal (2013) Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. Cited by: §1, §2.
• H. Ahmed, I. Traore, and S. Saad (2018) Detecting opinion spams and fake news using text classification. Security and Privacy 1 (1), pp. e9. Cited by: §6.
• Z. Allen-Zhu, Y. Li, and Z. Song (2019)

A convergence theory for deep learning via over-parameterization

.
In International Conference on Machine Learning, pp. 242–252. Cited by: item a, item b, Appendix B, Appendix B, §2, Remark 5.1, §5.
• S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8141–8150. Cited by: §2, §5.
• P. Auer (2002) Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §1.
• Y. Ban, J. He, and C. B. Cook (2021) Multi-facet contextual bandits: a neural network perspective. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pp. 35–45. Cited by: Lemma D.1, Appendix D, Appendix D, §1, §3, §4.
• Y. Ban and J. He (2021a) Convolutional neural bandit: provable algorithm for visual-aware advertising. arXiv preprint arXiv:2107.07438. Cited by: §2.
• Y. Ban and J. He (2021b) Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pp. 2335–2346. Cited by: §1, §1.
• S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári (2011) X-armed bandits.. Journal of Machine Learning Research 12 (5). Cited by: §2.
• Y. Cao and Q. Gu (2019)

Generalization bounds of stochastic gradient descent for wide and deep neural networks

.
Advances in Neural Information Processing Systems 32, pp. 10836–10846. Cited by: Lemma B.4, Remark 5.1.
• E. Chlebus (2009) An approximate formula for a partial sum of the divergent p-series. Applied Mathematics Letters 22 (5), pp. 732–737. Cited by: Appendix B, Appendix C.
• W. Chu, L. Li, L. Reyzin, and R. Schapire (2011) Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: §1.
• V. Dani, T. P. Hayes, and S. M. Kakade (2008) Stochastic linear optimization under bandit feedback. Cited by: §1, §1, §2.
• S. Du, J. Lee, H. Li, L. Wang, and X. Zhai (2019) Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. Cited by: §2, §5.
• S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári (2010) Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §2.
• D. Fu and J. He (2021) SDG: A simplified and dynamic graph neural network. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pp. 2273–2277. External Links: Cited by: §6.
• F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: §6.
• A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580. Cited by: §2, §5, §5.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.
• J. Langford and T. Zhang (2008)

The epoch-greedy algorithm for multi-armed bandits with side information

.
In Advances in neural information processing systems, pp. 817–824. Cited by: §1.
• T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §1.
• Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: Remark 4.1.
• Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §6.
• L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §1, §1, §2, item 1, §6.1.
• S. Li, A. Karatzoglou, and C. Gentile (2016) Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 539–548. Cited by: §2.
• X. Lu and B. Van Roy (2017) Ensemble sampling. arXiv preprint arXiv:1705.07347. Cited by: §2.
• C. Riquelme, G. Tucker, and J. Snoek (2018) Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127. Cited by: §2.
• S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: Remark 4.2.
• B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2001) Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285–295. Cited by: Appendix C.
• W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.
• M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini (2013a) Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869. Cited by: §2, item 2, §6.1.
• M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini (2013b) Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869. Cited by: §1, §2, Remark 5.2, §6.1, §6.
• L. Van Der Maaten, E. Postma, J. Van den Herik, et al. (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10 (66-71), pp. 13. Cited by: Remark 4.2.
• Q. Wu, H. Wang, Q. Gu, and H. Wang (2016) Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 529–538. Cited by: §1, §1.
• P. Xu, Z. Wen, H. Zhao, and Q. Gu (2020) Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780. Cited by: §2.
• W. Zhang, D. Zhou, L. Li, and Q. Gu (2020) Neural thompson sampling. arXiv preprint arXiv:2010.00827. Cited by: Table 1, §1, §2, §3, Remark 5.2, §5, §5, item 6, §6.1, §6, §6.
• D. Zhou, L. Li, and Q. Gu (2020) Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492–11502. Cited by: Appendix D, Appendix D, Table 1, §1, §1, §2, §3, §4, §5, §5, item 5, §6.1, §6.

## Appendix A Ablation Study

In this section, we conduct ablation study regarding the label function for exploration network and seting of decision maker on two representative datasets Movielens and Mnist.

Label function . In this paper, we use to measure the potential gain of an arm, as the label of . Moreover, we provide other two intuitive form and . Figure 3 shows the regret with different , where "EE-Net" denotes our method with default , "EE-Net-abs" represents the one with and "EE-Net-ReLU" is with . On Movielens and Mnist datasets, EE-Net slightly outperforms EE-Net-abs and EE-Net-ReLU. In fact, can effectively represent the positive potential gain and negative potential gain, such that intends to score the arm with positive gain higher and score the arm with negative gain lower. However, treats the positive/negative potential gain evenly, weakening the discriminative ability. can recognize the positive gain while neglecting the difference of negative gain. Therefore, usually is the most effective one for empirical performance.

Setting of . can be set as linear function or non-linear function. In the experiment, we test the simple linear function , denoted by "EE-Net-Lin", and a non-linear function represented by a 2-layer 20-width fully-connected neural network, denoted by "EE-Net-NoLin". For the default hybrid setting, denoted by "EE-Net", when rounds , ; Otherwise, is the neural network. Figure 4 reports the regret with these three different modes. EE-Net achieves the best performance with small standard deviation. In contrast, EE-Net-NoLin obtains the worst performance and largest standard deviation. However, notice that EE-Net-NoLin can achieve the best performance in certain running (the green shallow) but it is erratic. Because in the begin phase, without enough training samples, EE-Net-NoLin strongly relies on the quality of collected samples. With appropriate training samples, gradient descent can lead to global optimum. On the other hand, with misleading training samples, gradient descent can deviate from global optimum. Therefore, EE-Net-NoLin shows very unstable performance. In contrast, EE-Net-Lin is inspired by the UCB strategy, i.e., the exploitation plus the exploration, exhibiting stable performance. To combine their advantages together, we propose the hybrid approach, EE-Net, achieving the best performance with strong stability.

## Appendix B Proof of Theorem 1

In this section, we provide the proof of Theorem 1 and related lemmas.

Theorem 1. Let follow the setting of (Eq. (3) ) with width respectively and same depth . Let be loss functions defined in Algorithm 1. Set as . Given two constants , , suppose

 m ≥poly(T,n,L,log(1/δ)⋅d⋅e√log1/δ), m′≥Ω(m2L) (6) η1 =Θ(dδpoly(T,n,L)⋅m),  η2=Θ(O(m2L)δpoly(T,n,L)⋅m′) K1 =Θ(poly(T,n,L)δ2⋅log((ϵ1/2)−1)),  K2=Θ(poly(T,n,L)δ2⋅log(ϵ−12)),

then with probability at least , the expected cumulative regret of EE-Net in rounds satisfies

 RT≤O((2√T−1)√2ϵ2)+O((ξ2+ϵ1)(2√T−1)√2log(O(Tn)/δ)).

where

 ξ2=O(T4nL√O(m2L)logm′δ√m′)+O(T5nL2√O(m2L)log11/6m′δm′1/6)<1.

When , we have

 RT≤O(1)+O((ξ2+ϵ1)(2√T−1)√2log(O(Tn)/δ)).

###### Proof.

In round , given the arms , let denote their expected rewards. For brevity, for the selected arm in round , let be its expected reward and be the optimal arm in round .

Then, the expected regret of round is computed by

 Rt =h(x∗t)−h(xt) =h(x∗t)−f3(xt)+f3(xt)−h(xt) ≤h(x∗t)−f3(x∗t)+f3(xt)−f3(xt)I1+f3(xt)−h(xt) =h(x∗t)−f3(x∗t)+f3(xt)−h(xt) ≤maxi∈[n](E[(rt,i−f3(xt,i))|xt,i])+E[(f3(xt)−rt)|xt] =maxi∈[n](E[(rt,i−(f1(xt,i;θ1t−1)+f2(▽θ1f1/c√mL;θ2t−1)))|xt,i]) +E[(f1(xi;θ1t−1)+