I Introduction
The future Internet of thing (IoT) technology interconnects numerous sensing devices with communications capability for a wide range of applications, e.g., remote monitoring, automatic control, diagnosis and maintenance [1]. Recently, a new communication paradigm named ambient backscatter (AB) communication is widely studied as an energyefficient method applicable in IoT system [2]. In particular, a tag transmitter in AB communication system communicates with its receiver by backscattering its ambient radio frequency (RF) signals. Specifically, a transmitter tag transmits ‘0’ or ‘1’ by switching its antenna to nonreflecting or reflecting mode, respectively. Compared to the conventional backscatter communication scheme in radio frequency identification (RFID) systems, AB communication does not require a dedicated energyemitting reader, and relies solely on external energy sources in the ambient environment, such as WiFi, public radio, and cellular transmit power. As such, the application of AB communication can effectively reduce the deployment cost of largesize IoT network, such as smart homes, smart cities, and environment monitoring [3], [4], [5].
There has been tremendous research interests recently on ambient backscatter communications [6], [7]. For instance, [8] analyzed the bit error rate of an AB communication link when the receiver uses an energy detector to detect the 1bit information transmitted per channel use. [9] integrates the AB communication with conventional harvestthentransmit (HTT) protocol in the radio frequencypowered cognitive radio networks, where the backscatter tag can choose to backscatter the ambient RF signal to the receiver or harvest energy for later active transmissions. To achieve the optimal throughput performance, the authors assume a fixed channel model and optimize the time allocation on backscattering communication, energy harvesting, and active information transmissions. AB communication has also been integrated in wireless powered communication network, where the wireless devices’ information transmissions are powered by means of wireless power transfer [10], [11]. For instance, [12], [13] consider using backscatter communication to reuse wireless power transfer for simultaneous energy harvesting and information exchange between two cooperating users, It shows that the use of passive backscatterassisted cooperation can significantly improve system throughput performance compared to conventional active information transmissions.
The above studies mostly focus on the system performance optimization under given channel state within a time slot, while the channel fading effects across consecutive wireless channels are not considered. In practice, the wireless channel fading cause the ambient signal strength to vary over time, which directly results in a timevarying communication performance. In general, the choice of current operating mode, i.e., backscattering communication or harvesting energy, to maximize the data rate depends on several factors, such as the current channel conditions, battery energy, and the circuit consumption, etc. However, the dynamic operating mode selection problem in fading channel environment has not been well addressed so far.
In this paper, we concentrate on maximizing the longterm average throughput of an AB communication system by optimizing the realtime operating mode strategy of a transmitter tag in fading channel. Particularly, we model it as an infinite multistage decision problem and formulate an infinitehorizon Markov Decision Process (MDP) problem. When the channel distribution is known, we apply the value iteration method to obtain the optimal decision strategy. In practice, however, such channel state knowledge is often hard to obtain, and accordingly, we propose a Qlearning (QL) based reinforcement learning method that obtains a suboptimal strategy without knowing the channel distribution. Simulation results show that the QL method produces closetooptimal throughput performance, and significantly outperforms the other representative benchmark methods.
Ii System Model
Iia Channel Model
As shown in Fig. 1, we consider an AB communication system consisting of one RF source, and a pair of AB transmitter and receiver, all of which are equipped with single antenna. All the channels are assumed to follow quasistatic flatfading, such that all the channels coefficients remain constant during each block transmission time , but can vary from different blocks. The channel coefficients, between the RF source and the tag, between the RF source and the receiver, and between the tag and the receiver, are denoted by , , and , respectively. Correspondingly, we use , , and to denote their channel gains, separately, where represents the 2norm operator.
We consider consecutive decision epochs in Fig. 1, where two adjacent epochs are separated by equal duration
. At the th transmission block, the received signals at the tag can be expressed as(1) 
where is the RF signal transmitted from the ambient RF source, denotes the channel power gains between RF source and backscattertag, and denotes the additive white gaussian noise (AWGN) between RF source and tag. At the beginning of each epoch, the tag makes a decision on either operating at signalbackscattering mode or energyharvesting mode. The circuit block diagram of the tag is illustrated in Fig. 2, and can switch the connection point to change their operating mode in realtime.
When the switch , the tag operates in energy harvesting mode. The energy harvesting circuit converts RF signal into direct current (DC) power to charge the battery. The collected energy is used for data transmission or replenishing circuit consumption. The harvested energy can be expressed as
(2) 
where is the battery energy harvesting efficiency and is the fixed power of RF source. denotes the channel power gain in the th time slot. For simplicity of illustration, we consider a truncated channel gain (e.g, cumulative distribution) and quantize it into () levels . Therefore, the possible harvested energy can be divided into () uniform levels, such that
(3) 
where, denotes the unit energy considered for quantization. Notice that the tag may harvest zero energy when the received signal is too weak.
When and , the tag switches to signalbackscattering mode. In this case, the energy collected by the tag is approximately zero. The received signal at the receiver, as a combination of signal transmitted by the RF source and backscattered by the tag, is
(4) 
where is the reflection coefficient at the tag, is the channel coefficient from the tag to the receiver that remain fixed in the considered period and denotes the decision of a backscatter tag in the th time slot. Generally, the distance from RF source to tag and the distance from RF source to receiver are much larger than the distance between the tag and receiver. We therefore assume that the received signal strengths at the tag and receiver are the same, i.e., .
We assume the tag transmits with a fixed data rate bits per second and the sampling rate of the receiver is , such that the receiver takes samples of every onebit transmissions. In the following, we derive the BER of the receiver using an optimal energy detector to decode the received information.
Lemma 1.
Let and
represent the variance of an addition noise introduced by the receiver RF circuit and the decoding circuit, respectively. Using an optimal energy detector, denote the BER at the receiver
can be expressed as(5) 
Proof.
Please refer to Appendix A. ∎
We denote the BER in the th time slot as . Then, the capacity of the binary symmetric channel is
(6) 
Therefore, the data rate of the backscatter communication in the current time slot is
(7) 
IiB Battery Model
We quantize the battery capacity by into units, where is assumed without loss of generality to be an integer. The tag consumes units energy for maintaining the basic energy consumption of the circuit when operating on the energyharvesting mode and units of energy in signalbackscattering mode, where . At the beginning of epoch , the tag can operate on the signalbackscattering mode only when the energy state . Otherwise, it must harvest enough energy by operating in the energy harvesting mode. We let denote the operating mode selection, where indicates energy harvesting mode and otherwise. Accordingly, the dynamic of the battery energy can be expressed as
(8) 
for , where represents the initial status of the tag battery.
IiC Problem Formulation
As shown in Fig. 1, we intend to maximize the longterm throughput of a tag in a very large number of time slots. Here, we use to represent a static decision strategy in choosing the operating mode. and denote the achievable data rate and battery energy state as a result of the strategy at the th time slot. The objective is to find an optimal policy to maximize the average throughput. Mathematically, the problem can be formulated as
(9)  
where is the discount factor.
Iii Reinforcement Learning Approach
Iiia Markov Decision Process
Depending on the knowledge of the distribution of ambient RF signal strength, we propose in this section to solve (9) using both optimal modelbased Markov decision process method and modelfree reinforcement learning method. When the distribution of the ambient RF signal strength follows a Markov process and is known, the discrete timeslots decision problem in (9) can be described as an MDP. In the following, we define the five major elements of an MDP for solving (9): states (), actions (
), transition probability (
), immediate reward (), and discounter (). First of all, we define a state by the unit(s) of current battery energy and the channel gain in a decision epoch. That is, . Because there are in total discrete energy state, the cardinality of the state space is . The tag takes an action on choosing either energyharvesting mode or signalsbackscattering mode at every decision epoch, which are described as follows
Actions : a backscatter tag adaptively switches between energyharvesting mode and signalbackscattering mode based on its current state. Let represent the action set, where and denote energyharvesting and signalbackscattering mode, respectively.

Rewards : we define the immediate reward received by the tag as as the amount of information successfully transmitted to the receiver. Here and denote the current state and the state of the next decision epoch. With a bit abuse of notation, we denote as the channel capacity when the system is at state . Then, the reward is
(10) Notice that a tag may receive immediate reward only when operating in signalbackscattering mode (). Operating in energyharvesting mode (a=0) has no immediate reward, but the energy collected at the current slot can be used to support data transmission in latter slots.

Transition Probabilities : the channel state transition probability is assumed to be static throughout all the time slots. We define transition probability matrix with its elements , as the probability of transiting to when taking an action at sate . With random energy arrival , the battery state has been given in (8). For each stateaction pair , it satisfies
(11)
We aim to find an optimal policy for every state , which maximizes the average throughput reward over a long time. Based on the knowledge of transition probability, we can get the global optimal policy with the value iteration algorithm, which is one widely used algorithm for solving discounted MDP problems [14] and detailed as follows.
The value iteration algorithm aims to estimate the expected reward received at each state
, denoted by , for all . In particular, iteration starts with setting , for all , and chooses the next state by taking an local optimal action that maximizes its expected reward in the current stage. The two action in each state during a iteration will have different immediate reward and discounted future reward. At the end of each iteration, we select the best action for each state and update the reward function for each state. Overall, the value function is updated by(12) 
The iterations proceed until the maximum difference among all the states between two consecutive iterations is less than a certain threshold , i.e.,
(13) 
where the superscript denotes the th iteration. The convergence of the algorithm is guaranteed when sufficient number of iterations are taken [15]. We denote the value function after convergence as . Then, the optimal strategy of value iteration algorithm is therefore
(14) 
Through extensive experiments, we observe for the optimal policy that, when the battery energy is low but the channel condition is good, it tends to harvest energy for transmitting data in latter slots. Conversely, when the battery energy is high, it is inclined to transfer data to consume energy to avoid the harvest energy overcharging the battery. When the battery energy of the tag is moderate, it chooses to harvest energy when the channel condition is poor and transmits data when the channel conditions are relatively good. The detailed performance of the value iteration method will be shown in simulations.
IiiB QL Algorithm
When the distribution of the ambient RF signal strength is not known, we consider using a QL based online algorithm to find a suboptimal mode selection strategy. In each decision epoch, the tag chooses an action based on the Qvalue in a constructed stateaction value table subject to constant update upon iterative interactions with the environment. The table is initialized by setting , for all states and actions . The iterations start by picking a random state . To update Qtable, greedy method is used for balancing exploration and exploitation. With greedy, the tag selects a random action with probability , where and reduces over time. Then, with probability (1), the entry corresponds to the in the Qtable is updated by
(15) 
where is a small learning rate. The tag takes the action that maximizes (15) and receives immediate reward in (10) if . After taking the action, the tag observes the next state following (8) and the unknown channel transition probability.
The tag will make better mode selection over time, and after sufficiently long learning period the values in the Qtable will stabilize. We use to represent its the mode selection of QL algorithm after the Qtable stabilizes. The details of QL algorithm is showed in Algorithm 1.
Iv Simulation Results
In this section, we evaluate the performance of the proposed algorithms. In all simulations, we assume the energy harvesting efficiency and the exploration probability . Without loss of generality, the duration of each block transmission time is set to 1. In addition, the noise power is assumed as . We set and the channel levels , , , , are set as , , , , , respectively. Unless otherwise stated, is fixed to be a constant during all time slots, i.e., . equals to Kbits per second. We set the unit energy as . Therefore, the harvested energy after quantization is units energy when , where , and if . Without loss of generality, we assume battery capacity . Besides, the tags consumes and units of circuit energy when operating in energy harvesting and signal backscattering mode, respectively. The channel transition probability from to in consecutive time slots is denoted as , and is showed in Table. I.
ij  G0  G1  G2  G3  G4 
G0  0.40  0.30  0.15  0.10  0.05 
G1  0.05  0.40  0.30  0.15  0.10 
G2  0.10  0.05  0.40  0.30  0.15 
G3  0.15  0.10  0.05  0.40  0.30 
G4  0.30  0.15  0.10  0.05  0.40 
As a benchmark method for performance comparison, we consider a greedy policy. Specifically, the tag chooses signalbackscattering mode if it has sufficient energy to transmit information, or energyharvesting mode otherwise. That is,
(16) 
We first show in Fig. 3 the average throughput achieved by the QL algorithm as the number of iterations. Each point in the figure is a rolling average of the past time slots. The four curves from bottom to top represent the cases where the power of the RF source increases from to . As expected, a higher source power leads to higher average throughput of the energy harvesting device. Besides, we see that the average throughput performance under different transmit power gradually increases as the iterations proceed, and saturates at around iterations. In other words, the tag makes better mode operating decisions over time, and the Qtable eventually becomes stable after sufficiently long interaction with the environment.
In Fig. 4, we compare the average throughput performance of three methods: the value iteration, QL and greedy algorithms, when the source transmit power varies from to Watts. In particular, for the value iteration and QL methods, we use the mode selection strategies after both methods converge. For fair comparisons, we evaluate the three methods in time slots, where the channel realizations follow that in Fig. 3. Each point in the figure is the average throughput achieved within the time slots. It is evident in Fig. 4 that the average throughputs increase with . The QL algorithm achieves very close throughput performance to the optimal value iteration algorithm, and significantly outperforms the greedy method. Specifically, the performance loss is less than 0.56% when the transmitter power . On average, the QL method achieves 98.17% of the optimal throughput performance, and the greedy method achieves 90.33% of the optimal performance, while the performance gap of the greedy method gradually increases as becomes larger. The throughput performance of the greedy method is worse than that of the other two because it only considers maximizing the current reward while neglecting the significant future reward achievable by operating in energyharvesting mode. The QL method, although has no knowledge of the channel distribution, achieves closetooptimal performance when transmit power is large.
In Fig. 5, we simulate the performance of the three mode selection methods in
time slots, and plot the probability distribution of the battery energy levels of the tag during the entire simulation. Here, we consider two different tagtoreceiver channel conditions
and . In both cases, we can see that the greedy method results in low battery energy states in both cases, where more than 80% of the time the tag has less than units of energy left in the battery and has not even reached units of energy throughout the simulation. This is due to its greedy nature in exhausting any energy available, thus transmission outage happens frequently when a favorable transmission opportunity occurs, resulting significant loss of data rate. Conversely, the optimal value iteration algorithm results in a much more balanced battery energy distribution in different energy states, such that it leaves sufficient “energy buffer” for transmitting information when a favorable slot occurs, the QL algorithm closely follows the energy distribution of value iteration algorithm, which shows its ability to jointly consider both current and future data transmission opportunities.V Conclusion
In this paper, we studied the optimal operating mode selection problem in the AB communication system, where the backscattering tag dynamically chooses between energyharvesting and information backscattering modes to maximize the average throughput. We formulated the problem into an infinitehorizon MDP problem. When the the distribution of the ambient RF signal strength is known, we applied value iteration algorithm to find the optimal decision strategy. Otherwise, when the signal strength distribution is not known, we proposed to employ reinforcement QL algorithm to maximize the longterm average throughput. Finally, our simulations showed that the proposed QL method can achieve closetooptimal throughput performance and significantly outperforms the benchmark greedy method in the AB communication system.
Appendix A
Proof of lemma 1
Let denotes the information bit transmitted in the current time slot, the received signal at the receiving end in backscatter communication system, can be expressed as
(17) 
where denotes the binary information bits, , and the signal at information decoder is
(18) 
where , the average power harvested in the corresponding symbol is
(19) 
It can be clearly shown that the following equalities hold
(20) 
When
is sufficiently large, by the central limit theorem, the test statistic
for both cases can be expressed as(21)  
By defining , we have
(22)  
We assume that ‘0’ and ‘1’ are transmitted with equal probability. Thus, the bit error probability (BER) can be obtained as
(23)  
where is the Gaussian function, which is defined as
(24) 
References
 [1] L. Tan and N. Wang, “Future internet: the internet of things,” in Proc. IEEE ICACTE’10, pp. 376380, Aug. 2010.
 [2] V. Liu, A. Parks, Vamsi. Talla, S. Gollakota, D. Wetherall, and J. R. Smith, “Ambient backscatter: wireless communication out of thin air,” in Proc. of the 2015 ACM Conference on Special Internet Group on Data Communication (SIGCOMM), Hong Kong, China, Aug. 2013.
 [3] M. Khan, MF. Haque, S. Rahman, and A Siddiqa, ”Smart home automation system based on environmental monitoring system.” National Conference of Electronics and Ict, April. 2017.
 [4] L. Hui, Z. Meng, and S. Cui, “A wireless sensor network prototype for environmental monitoring in greenhouses.” in International Conference on Wireless Communications, Networking and Mobile Computing, pp. 23442347, Sep. 2007.
 [5] B. Alsinglawi, M. Elkhodr, Q. V. Nguyen, et al., “RFID localisation interent of things smart homes: a survey,” in Inernational Journal of Comuter Networks Communications, Vol. 9, no. 1, pp. 8199, Jan. 2017.
 [6] A. Parks, A. Sample, Y. Zhao, and J. R. Smith, “A wireless sensing platform utilizing ambient RF energy,” in Proc.IEEE BiowireleSS’13, Austin, TX, pp. 154156, Jan. 2013.
 [7] M. Pinuela, P. D. Mitcheson, and S. Lucyszyn, “Ambient RF energy harvesting in urban and semiurban environment,” in IEEE Trans. Microw. Theory Techn., vol. 61, no. 7, pp. 27152726, Jul. 2013.
 [8] Kang. Lu, G. Wang, F. Qu, and Z. Zhong, “Signal detection and BER analysis for RFpowered devices ultizing ambient backscatter communication systems,” in International Conference on Wireless Communications Signal Processing, pp. 16, 2015.
 [9] DT. Hoang, D. Niyato, P. Wang, D. I. Kim, and Z. Han, “Ambient backscatter: a new approach to improve network performance for RFpowered cognitive radio networks,” IEEE Transactions on Communications, vol. 65, no. 9, pp. 36593674, Sep. 2017.
 [10] S. Bi, Y. Zeng, and R. Zhang, “Wireless powered communication networks: an overview,” in IEEE Wireless Communications., vol. 23, no. 4, pp. 1018, Apr. 2016.
 [11] S. Bi and Y. J. Zhang, “Computation rate maximization for wireless powered mobileedge computing with binary computation offloading,” IEEE Transactions on Wireless Communications, vol. 17, no. 6, pp. 41774190, June 2018.

[12]
Y. Zheng, S. Bi, and X. Lin, “Backscatterassisted relaying in wireless powered communications network,”
the International Conference on Machine Learning and Intelligent Communications (MLICOM)
, HangZhou, China, Jul. 2018.  [13] W. Xu, S. Bi, X. Lin, and J. Wang, “Reusing wireless power transfer for backscatterassisted cooperation in WPCN,” the International Conference on Machine Learning and Intelligent Communications (MLICOM), HangZhou, China, Jul. 2018.
 [14] R.E. Bellman, Dynamic Programming. Princeton University Press, Princeton, New Jersey, 1957.

[15]
N. L. Zhang and W. Zhang, “Speeding up the convergence of value iteration in partially observable markov decision processes,”
Journal of Artificial Intelligence Research
, vol. 14, pp. 2951, Feb. 2001.