Due to the tremendous increase in the number of battery-powered wireless communication devices over the past decade, replenishing the batteries of these devices by harvesting energy from natural resources has become an important research area . Transmitters may harvest energy via wind turbines, photovoltaic cells, thermoelectric generators, or from mechanical vibrations through piezoelectric or electromagnetic technology . Regardless of which type of energy harvesting (EH) device and natural energy source is employed, a main concern is the stochastic nature of the EH process driving the wireless communications. The associated battery recharging process can be modeled either as a continuous, or a discrete, [3, 4] stochastic process.
We consider a wireless point-to-point link with a transmitter equipped with a finite-capacity battery fed by an EH device. At each time slot, a unit of energy is harvested by the transmitter according to a binary random process independent over time111Typically, the EH process is neither memoryless nor discrete, and the energy is accumulated continuously over time. However, in order to develop the analytical model underlying this paper, we follow the common assumption in the literature , and assume that the continuous energy arrival is accumulated in an intermediate energy storage device to form quantas.
. We assume that the transmitter can accurately observe the current energy level of the battery, and it has the knowledge of the statistics of the EH process. The wireless channel is time-varying and has memory across time. The channel memory is modeled with a finite state Markov chain
, where the next channel state depends only on the current state. A convenient and often-employed simplification of the Markov model is a two state Markov chain, known as the Gilbert-Elliot channel. This model assumes that the channel can be either in a good or a bad state. We assume that in the bad state, transmitter cannot transmit any information reliably, while in the good state it may transmit bits per time slot by spending exactly one unit of energy from its battery.
In this work, differently from most of the literature on EH systems, we take into account the energy cost of acquiring channel state information (CSI). At the beginning of each time slot, without knowing the current CSI, EH transmitter has three possible actions: i) deferring the transmission to save its energy for future use, ii) transmitting at a rate of bits per time slot, and iii) sensing the channel to reveal the channel state by consuming a portion of its energy and transmission time, followed by transmission at a reduced rate consuming the remainder of the energy unit, if the channel is in the good state. If the channel is in a bad state, the transmitter remains silent in the rest of the time slot, saving its energy for future. If the level of the battery is less than a unit of energy at the beginning of a time slot, no transmission is possible. Our objective is to maximize the total expected discounted number of bits transmitted over an infinite time horizon.
Markov decision process (MDP) tools have been extensively utilized in the recent literature in solving communication problems involving EH devices. In  authors propose a simple single-threshold policy for a solar-powered sensor operating over a fading wireless channel. Optimality of a single-threshold policy is proven  when transmitting packets with importance values on EH transmitter. Problem of energy allocation for gathering and transmitting data in an EH communication system is studied in  and . The scheduling of EH transmitters with time correlated energy arrivals to optimize the long term sum throughput is investigated in . The allocation of energy over a finite horizon to optimize the throughput is considered in , where it is assumed that either the current or the future energy and channel states are provided to the transmitter. In , for a Markov EH process, and a static channel, a discrete power allocation problem is studied to maximize the throughput. In  throughput is optimized over a multiple access channel with collisions, considering spatially correlated energy arrivals at the transmitters.
In a closely related work , scheduling of an EH transmitter over a Gilbert-Elliot channel is considered. However, unlike our work, the transmitter  always has perfect CSI, obtained by sensing at every time slot, and makes a decision to defer or to transmit, based on the current CSI and battery state. Similarly, without considering the channel sensing capability,  addresses the problem of optimal power management for an EH sensor over a multi-state wireless channel with memory, using the ACK/NACK channel feedback to track the channel state. In our work, instead, we take into account the energy cost of channel sensing which can be significant for EH transmitters. Therefore, the EH transmitter does not necessarily have perfect CSI, but keeps an updated belief of the channel state according to its past observations. Hence, the transmitter may occasionally take a third decision (in addition to defer and to transmit) of sensing the current channel state to improve its belief. Channel sensing is an essential part of opportunistic and cognitive spectrum access. In , the authors investigate the problem of optimal access to a Gilbert-Elliot channel, wherein an energy-unlimited transmitter senses the channel at every time slot. In  channel sensing is done only occasionally. The transmitter can decide to transmit at a high or a low rate without sensing the channel; or can first sense the channel and transmit at a reduced rate due to the time spent for sensing. The energy cost of sensing is ignored in .
In Section II we explain the channel and EH process models under consideration, and elaborate on the transmission protocol. In Section III, we formulate the problem as a two state partially observable MDP (POMDP) which is then converted to a continuous-state MDP by introducing a belief state. In Section IV we show that the optimal policy is of threshold type, for which the optimal threshold values depend on the state of the battery. In Section V we present simulation results that numerically obtain the optimal threshold values and the optimal performance. In Section VI we conclude the paper and present future research directions.
Ii System Model
Ii-a Channel and energy harvesting models
Consider the communication system illustrated in Fig. 1, in which an EH transmitter communicates over a slotted Gilbert-Elliot channel. Let denote the state of the channel at time slot which is modeled as a one-dimensional Markov chain with two states: a good state denoted by , and a bad state denoted by
. Channel transitions occur at the beginning of each time slot. The transition probabilities are given byand . The transmitter can transmit bits per time slot if , and zero bits if .
A unit of energy arrives at the end of time slot according to an independent and identically distributed (i.i.d.) Bernoulli process, denoted by , with probability , i.e., for all . The transmitter stores the energy packets in a battery with a storage capacity of units of energy. We denote the state of the battery, i.e., the energy available in the battery at the beginning of time slot , by . An energy unit is consumed at each slot if the transmitter decides to transmit in that slot. A unit of energy consumed per slot includes the energy cost of sensing (if the transmitter decides to sense the channel), transmission of the message, and the reception of ACK or NACK from the receiver. We assume that the transmitter has an infinitely backlogged data queue, and thus, it always has a packet to transmit.
Ii-B Transmission protocol
At the beginning of each time slot, the transmitter may choose among three possible actions: deferring the transmission, channel sensing and transmitting opportunistically, and transmitting without sensing.
Deferring the transmission: This action (denoted by ) corresponds to the case in which the transmitter either believes that the channel is in a bad state, or observes that its battery has low energy. If this action is chosen, there is no message exchange between the transmitter and the receiver. Hence, the receiver does not send any feedback, and therefore the transmitter cannot obtain any knowledge about the current channel state. The scenario in which the transmitter is informed about the current channel state even when it does not transmit any data packet is equivalent to the system model investigated in .
Channel sensing and transmitting opportunistically: This action (denoted by ) corresponds to the case in which the transmitter decides to sense the channel at the beginning of the time slot. We assume that sensing consumes a fraction of an energy unit. Sensing is carried out by the transmitter first sending a control/probing packet, to which, the receiver responds with a packet indicating the channel state. We assume that the time it takes to sense the channel is seconds and the transmitter consumes on average the same power as data transmission over the sensing period. Therefore, we equivalently assume for simplicity that for some . In the remaining seconds, the transmitter may choose to transmit data at the same rate it would without channel sensing, which means that by the end of the time slot it transmits bits per time slot.
If the channel is revealed to be in the bad state, transmitter defers its transmission and saves the rest of the energy unit (i.e., ). Note that thanks to the channel sensing capability, in the case of a bad state, the transmitter wastes only portion of a unit energy packet, and saves the remaining energy by deferring its transmission, which as we will show later in this paper, is an important advantage in EH networks with scarce energy sources.
Transmitting without sensing: This action (denoted by ) corresponds to the case when transmitter attempts to transmit bits in the current time slot without sensing the channel. If the channel is in a good state, the transmission is successful and the receiver sends an ACK. Otherwise, the transmission fails, and the receiver sends a NACK. Note that, at the end of the slot the transmitter has the perfect knowledge of the current channel state.
Iii Partially Observable Markov Decision Process (POMDP) formulation
At the beginning of each time slot, the transmitter chooses among the three possible actions based on the state of its battery, and its belief about the channel state to maximize a long-term discounted reward to be defined shortly. Although the transmitter is perfectly aware of its battery state, it cannot directly observe the current channel state. Hence, the problem in hand becomes a partially observable Markov decision process (POMDP).
Let the state of the system at time be denoted by . We define the of the transmitter at time slot , denoted by , as the conditional probability that the channel is in the good state at the beginning of the current slot, i.e., , given the history , where represents all the past actions and observations of the transmitter up to slot . The transmitter’s belief constitutes a sufficient statistic to characterize its optimal actions . Note that with this definition of the state, the POMDP problem is converted into a MDP with an uncountable state space 222Note that since sensing without transmission is possible, i.e., consuming only fraction of the energy unit, the battery can take fraction of units as states..
A transmission policy describes a set of rules that dictates which action to take depending on the history. Let be the expected infinite-horizon discounted reward with initial state under policy with discount factor . The use of the expected discounted reward allows us to obtain a tractable solution, and one can gain insights into the optimal policy for the average reward when is close to 1. It is also discussed in  that can be interpreted as the probability that a particular user is allowed to use the channel, or as the probability of the transmitter to remain active at each time slot as in . For an initial belief the expected discounted reward has the following expression
where is the time index, is the action chosen at time , and is the expected reward acquired when action is taken at state . The expectation in (1) is over state sequence distribution induced by the given transmission policy . The expected reward when action is chosen at state is given as follows:
Since at least one energy unit is required for transmission, if the battery state is less than one unit, the reward becomes zero. Hence, in explaining the expected reward function in (2), we consider actions when the battery state is greater than or equal to one. If the action of transmitting without sensing is chosen, bits per time slot are transmitted successfully if the channel is in a good state, and bits if the channel is in a bad state. Since the belief, , represents the probability of the channel being in a good state, the expected reward is given by . If the action of transmitting opportunistically is chosen, fraction of energy unit is spent sensing the channel with the remaining energy being used for transmission if the channel is sensed to be in a good state. In this case, bits per time slot are transmitted successfully. If the channel is sensed to be in a bad state, the transmitter remains silent in the rest of the time slot. The expected reward in this case is . Finally, if the action of deferring the transmission is taken the transmitter neither senses the channel nor transmits, so the reward is zero.
Define the value function as
It is well known that the optimal value of the infinite-horizon expected reward can be achieved by a stationary policy, i.e., there exists a stationary policy such that . The value function satisfies the Bellman equation
where is the action-value function, defined as the expected infinite-horizon discounted reward acquired by taking action when the state is , and is given by
where denotes the next state when action is chosen at state . The expectation in (5) is over the distribution of possible next states. In the following, we define and explain the value function , and how the system state evolves for each action.
Deferring the transmission: If this action is taken, since there is no transmission, there is no ACK or NAK from the receiver, and thus, the transmitter does not learn the state of the channel. Therefore the next belief is obtained as the probability of finding the channel in a good state given the current belief state. If the transmitter had a belief at time slot , after taking action D, its belief at the beginning of the next slot is updated as
In every time slot, a unit of energy is harvested with probability . Thus, after taking action D, the value function evolves as follows:
Note that the term is used to ensure that the battery state does not exceed the battery capacity, .
Channel sensing and transmitting opportunistically: For this action, two scenarios are possible. If and EH decides to transmit opportunistically, then it consumes fraction of energy to first sense the channel and obtain the current channel state. Based on the outcome of the channel sensing, if the channel is found to be in a good state, units of energy is used to transmit bits per time slot. Also, the belief state is updated as for the next time slot.
On the other hand, if the outcome of the channel sensing reveals the channel to be in a bad state, then the transmitter defers its transmission, and saves units of energy for possible future transmissions. Also, the channel belief is updated as for the next time slot. Based on the aforementioned discussion, for the evolution of the value function can be written as:
If , then transmission is not possible since transmission requires at least one unit of energy. However, it is still possible to sense the channel, since it only requires fraction of energy. This may happen when transmitter believes that learning the channel state will help its decision in the future. Thus for , the value function evolves as:
Transmitting without sensing: This action can only be chosen if the battery state is greater than or equal to one, i.e., 333Note that we are aware that in the generic MDP formulation, in every state, we should have the same set of actions. We can re-define the reward function by assigning reward for those actions that are not possible to be taken in specific states to account for this issue. For the ease of comprehension, we chose to present the formulation in this manner.. Under this action, the transmitter transmits regardless of the actual state of the channel, costing one unit of energy. If the channel is in the good state, bits per time slot are successfully delivered to the receiver, and the receiver sends back an ACK. Otherwise, the channel is in the bad state, so the transmission fails, and the receiver sends back a NAK. Meanwhile, the channel is in a good state with probability , i.e., the current belief state, and the belief in the next time slot will be . Also the channel is in a bad state with probability and the belief in the next time slot will be . Hence, the value function evolves as:
Iv Structure of The Optimal Policy
In this section, we prove that the optimal policy has a threshold type structure on the belief state. First, we need to prove some of the properties of the value function. We begin with establishing the convexity of the optimal value function with respect to the belief state.
For any given , V(b, p) is convex in .
The proof is given in . ∎
In the following lemma, we show that the value function is a non decreasing function of battery state, . This lemma provides the intuition why deferring or sensing actions are advantageous in some states. The incentive of taking these actions is that the value function transitions into higher values without consuming any energy. Moreover, it states that the value function is also non-decreasing with respect to the belief state, .
Given any belief , when . Moreover, for a fixed battery state , if then .
The proof is given in . ∎
Finally, Theorem 1 below shows that the optimal solution of the problem is a threshold policy with two or three thresholds depending on the system parameters. The threshold values depend on the state of the battery.
For any and , there exists thresholds , all of which are functions of the battery state , such that, for
and for ,
The detailed proof of the theorem is given in . ∎
Theorem 1 proves that at any battery state , at most three threshold values are sufficient to characterize the optimal policy; whereas two thresholds suffice for . However the optimal policy can even be simpler for some battery states and some instances of the problem as it is possible to have , or even .
V Numerical Results
In this section we use numerical techniques to characterize the optimal policy, and evaluate its performance. We utilize the value iteration algorithm to calculate the optimal value function. We numerically identify the thresholds for the optimal policy for different scenarios. We also evaluate the performance of the optimal policy, and compare it with some alternative policies in terms of throughput.
V-a Optimal policy evaluation
In the following, we assume that , , , , , and . The optimal policy is evaluated using the value iteration algorithm. In Fig. 2 each state is illustrated with a different color corresponding to the optimal policy at that state. In the figure, the areas highlighted with blue color correspond to those states at which deferring the transmission is optimal, green areas correspond to the states at which transmitting opportunistically is optimal, and finally yellow areas correspond to the states for which transmitting without sensing is optimal. As seen in Fig. 2 any of the three policies (one, two, or three threshold policies) may be optimal depending on the level of the battery state. For example, when the battery state is , one-threshold policy is optimal. The transmitter defers transmission up to a belief of state of and starts transmitting without sensing beyond this value. For no value of the belief state it opts for sensing the channel. On the other hand, when the battery state is , two-threshold policy is optimal, and when the battery state is , three-threshold policy is optimal. Considering the low probability of energy arrivals () and the relative high cost of sensing (), it is interesting to notice that the transmitter senses the channel even when its battery state is below the transmission threshold, i.e., .
Next, we investigate the effect of the sensing cost, , on the optimal policy. To illustrate this effect, we choose the system parameters as before, but increase the sensing cost from to . Optimal action regions for this setup are shown in Fig. 3. By comparing Fig. 2 and Fig. 3, it is evident that a higher cost of sensing results in less incentive for sensing the channel. We observe in Fig. 3 that the green area has shrunk almost to nothing, i.e, the transmitter is more likely to take a risk and transmit without sensing, or defer its transmission, when sensing consumes more energy.
V-B Throughput performance
In this section, we compare the performance of the optimal policy with two alternative policies, namely a greedy policy and a single-threshold policy. In the greedy policy, the transmitter transmits whenever it has energy in its battery. In the single-threshold policy there are only two actions: defer (D) or transmit (T). We optimize the threshold corresponding to each battery state for the single-threshold policy using the value iteration algorithm. By choosing the parameters , , , , , , the throughput achieved by these three policies are plotted in Fig. 4 with respect to the EH rate .
As expected we observe that the greedy policy performs the worst as it does not exploit the transmitter’s knowledge about the state of the channel. We can see that by simply exploiting the ACK/NACK feedback from the receiver, it is possible to achieve a higher throughput than the greedy policy for all values of the EH rate. On the other hand, by further introducing the channel sensing action the throughput of the system is substantially increased. The improvement is particularly higher for the mid-range of values, for which the transmitter benefits more from the flexibility offered by three actions.
Vi Conclusions and Future Work
In this work we considered an EH transmitter equipped with a battery, operating over a time varying finite-capacity wireless channel with memory, modeled as a Gilbert-Elliot channel. The transmitter receives ACK/NACK feedback after each transmission, which can be used to track the channel. We further consider channel sensing, which the transmitter can use to learn the current channel state at a certain energy and time cost. Therefore, at the beginning of each time slot, the transmitter has three possible actions to maximize the total expected discounted number of bits transmitted over an infinite time horizon: i) deferring the transmission to save its energy for future use, ii) transmitting at a rate of bits per time slot, and iii) sensing the channel to reveal the current channel state by consuming a portion of its energy and time, followed by transmission at a reduced rate consuming the remainder of the energy unit, only if the channel is in the good state. We formulated the problem as a POMDP, which is then converted into a MDP with continuous state space by introducing a belief parameter for the channel state. Then we proved that the optimal policy is a threshold policy, where the threshold values on the belief parameter depends on the battery state. We find the optimal threshold values numerically using the value iteration algorithm. In terms of throughput, we compared the optimal policy to the alternative policies, the greedy policy and a single-threshold policy which does not have channel sensing capability. We have shown through simulations that the channel sensing capability improves the performance significantly, thanks to the increased adaptability to the channel conditions it provides. For future studies, we will consider the case where the sensing is not perfect. Another interesting problem is to consider the case in which the EH transmitter has the option to choose the duration of the sensing which determines its accuracy.
-  J. A. Paradiso and T. Starner. Energy scavenging for mobile and wireless electronics. IEEE Pervasive Computing, 4(1):18–27, Jan. 2005.
-  G. Park, T. Rosing, M. D. Todd, C. R. Farrar, and W. Hodgkiss. Energy harvesting for structural health monitoring sensor networks. Journal of Infrastructure Systems, 14(1):64–79, Mar. 2008.
-  D. Gunduz, K. Stamatiou, N. Michelusi, and M. Zorzi. Designing intelligent energy harvesting communication systems. IEEE Communications Magazine, 52(1):210–216, Jan. 2014.
-  H. Li, C. Huang, P. Zhang, S. Cui, and J. Zhang. Distributed opportunistic scheduling for energy harvesting based wireless networks: A two-stage probing approach. IEEE/ACM Transactions on Networking, 24(3):1618–1631, Jun. 2016.
-  Q. Zhang and S. A. Kassam. Finite-state Markov model for Rayleigh fading channels. IEEE Trans. on Communs, 47(11):1688–1692, Nov. 1999.
-  E. N. Gilbert. Capacity of a burst-noise channel. The Bell System Technical Journal, 39(5):1253–1265, Sep. 1960.
-  M. L. Ku, Y. Chen, and K. J. R. Liu. Data-driven stochastic models and policies for energy harvesting sensor communications. IEEE Journal on Selected Areas in Communications, 33(8):1505–1520, Aug. 2015.
-  N. Michelusi, K. Stamatiou, and M. Zorzi. Transmission policies for energy harvesting sensors with time-correlated energy supply. IEEE Transactions on Communications, 61(7):2988–3001, Jul. 2013.
-  A. Hentati, F. Abdelkefi, and W. Ajib. Energy allocation for sensing and transmission in WSNs with energy harvesting Tx/Rx. In IEEE Vehicular Technology Conf. (VTC Fall),, pages 1–5, Sep. 2015.
-  S. Mao, M. H. Cheung, and V. W. S. Wong. Joint energy allocation for sensing and transmission in rechargeable wireless sensor networks. IEEE Transactions on Vehicular Technology, 63(6):2862–2875, Jul. 2014.
-  P. Blasco and D. Gunduz. Multi-access communications with energy harvesting: A multi-armed bandit model and the optimality of the myopic policy. IEEE Journal on Selected Areas in Communications, 33(3):585–597, Mar. 2015.
-  C. K. Ho and R. Zhang. Optimal energy allocation for wireless communications with energy harvesting constraints. IEEE Transactions on Signal Processing, 60(9):4808–4818, Sep. 2012.
-  B. T. Bacinoglu and E. Uysal-Biyikoglu. Finite-horizon online transmission scheduling on an energy harvesting communication link with a discrete set of rates. Journal of Communications and Networks, 16(3):393–300, Jun. 2014.
-  M. S. H. Abad, D. Gunduz, and O. Ercetin. Energy harvesting wireless networks with correlated energy sources. In 2016 IEEE Wireless Communications and Networking Conference, pages 1–6, Apr. 2016.
-  M. Kashef and A. Ephremides. Optimal packet scheduling for energy harvesting sources on time varying wireless channels. Journal of Communications and Networks, 14(2):121–129, Apr. 2012.
-  A. Aprem, C. R. Murthy, and N. B. Mehta. Transmit power control policies for energy harvesting sensors with retransmissions. IEEE Journal of Selected Topics in Signal Processing, 7(5):895–906, Oct. 2013.
-  Q. Zhao, L. Tong, A. Swami, and Y. Chen. Decentralized cognitive MAC for opportunistic spectrum access in ad hoc networks: A POMDP framework. IEEE Journal on Selected Areas in Communications, 25(3):589–600, Apr. 2007.
-  A. Laourine and L. Tong. Betting on Gilbert-Elliot channels. IEEE Transactions on Wireless Communications, 9(2):723–733, Feb. 2010.
-  William S. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28(1):47–65, Dec. 1991.
-  P. Blasco, D. Gunduz, and M. Dohler. A learning theoretic approach to energy harvesting communication system optimization. IEEE Transactions on Wireless Communications, 12(4):1872–1882, Apr. 2013.
-  M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
-  M.S.H. Abad, D. Gunduz, and O. Ercetin. Channel sensing and communication over a time-correlated channel with an energy harvesting transmitter. arXiv:1703.10519 [cs.IT], Mar. 2017.