The intermittence of renewable energy sources in the present electric energy generation systems poses an issue to electric utilities and electricity markets. These are built on the notion that supply follows demand (Denholm et al., 2015). Their integration in the present energy markets, due to the lack of scalable storage solutions (de Sisternes et al., 2016), can be most efficiently addressed by demand response (Bird et al., 2013). Demand response defines changes in the electric usage by consumers from their normal preferences caused by changes in electricity prices or by other incentives (Agency et al., 2003). With demand response through real time pricing (Agency et al., 2003), utilities can shift the customer load demand to periods of excess renewable energy generation. These peaks of generated energy are caused by high renewable energy generation, while the load demand is low. The shifting of customer load demand can result with reducing the greenhouse gasses emissions by natural gas and coal electricity generation plants, used at times of peak demand or for back up generation (Bird et al., 2013).
2 Related work
Previous research on using reinforcement learning for dynamic pricing in hierarchical markets has shown promising results, both in balancing of the supply and demand, as well as in cutting the costs for customers and electric utilities alike (Lu et al., 2018). If a reinforcement learning model for customer load scheduling is also included, there is a cost reduction on the both sides when compared to myopic optimization (Kim et al., 2016)
. However, there is no application of these methods in a physical environment. This is due to the lack of a simulation environment which can provide a reliable estimate of the safety and cost of the proposed method. To this end, we propose a simulation environment and adding weather data as an input to our proposed agent.
3 Reinforcement Learning Approach
To model this problem as an reinforcement learning problem (Sutton and Barto, 2018), we choose the electric utility to represent the pricing agent. For the state st, we choose to be represented by the momentary and future renewable electricity supply to the electric utility, the momentary customer load demand together with the momentary and future features of the real world which are used for electricity demand forecasting (Gajowniczek and Zabkowski, 2017; Paterakis et al., 2017; Mocanu et al., 2016). At each timestep t, the electric utility as the agent selects an action at, which is represented by the momentary and future electricity prices. This action is then transmitted back to the customers, which as a part of the environment responds with a load demand. This load demand is then used to calculate the reward rt. Now, the pricing agent is in another state st+1 and this whole process is repeated for the duration of the simulation period, which is previously chosen. The environment and the agent are shown in Fig. 1.
Since it is very often that the renewable electricity sources are not in the vicinity of customers, we set the future supply to be given as an input. The momentary demand, momentary and future renewable energy, its price, weather data and the temporal data are of dimensions , while the energy selling price, i.e. the action, is of dimension 1. This way, the pricing agent formulates the price having as input the expectation for the future states. The size of the timestep and
are treated as hyperparameters of the learning problem.
3.1 Reward function
The two objectives of the proposed pricing agent is to decrease the difference between the supply of renewable energy and demand and to keep the energy utility profitable. We propose the global reward function to be a linear combination of two sub-rewards as in the multi-objective analysis of return, shown in Eq. (1), proposed by Dulac-Arnold et al. (2019).
The coefficients in (1) represent hyperparameters of the learning problem and are initially set to 1. The sub-reward function shown in (2) calculates the profit for the pricing agent as the difference between the purchase price and the selling price. The sub-reward function shown in (3) calculates the square of the difference between the renewable energy available and the demand. The negative sign and the square form of the sub-reward function (3) show the objective of reducing both, the positive and the negative difference between the renewable supply and demand.
4.1 Simulation environment
Since real time testing of the proposed model is costly and there isn’t available dataset for batch off-line training, we propose building a new simulation environment. In the proposed environment, customers are represented by a number of previously trained demand response agents. As a training ground for these agents, we propose the CityLearn (Vázquez-Canteli et al., 2019), an OpenAI Gym environment (Brockman et al., 2016). CityLearn enables training and evaluation of autonomous and collaborative reinforcement learning models for demand response in resident buildings. In order to be able to model buildings which are not fitted with energy storage capabilities, we remove the storage capabilities of some of the buildings in CityLearn. The cost function for the customer agents is chosen to minimize both the peak energy demand and the cumulative energy cost. To improve the generalization of the pricing agent with emerging smart cities and neighbourhoods, we propose training both independent and cooperative customer agents. The independent agent is not aware for the actions of other agents, while the cooperative agent value its actions in conjunction with the actions of other agents (Claus and Boutilier, 1998). After evaluating the performance of the customer agents, a number of different simulation environments are built by combining trained customer agents. The distribution of customer agents in an environment is set as such to ensure that the pool of simulation environments properly models the momentary and future price responsiveness of the physical environment.
The training of the pricing agent is done in two phases. In the first phase, the pricing agent is trained using model-agnostic meta-learning (Finn et al., 2017) across all simulation environments. The second phase, where the pricing agent is trained and ran in the physical environment, is started after reaching a certain performance threshold in the simulation environment. The training in the first phase should increase the sample efficiency of the pricing agent in the second phase and should reduce the costs and the risks of training in the physical environment. Training on multiple environments with different distributions of customer agents should also increase the robustness of the pricing agent in the second phase (Dulac-Arnold et al., 2019).
4.3 Safety and Explainability
To ensure the safety of the pricing agent operating in a physical environment, we propose using a constraints on the price it signals to the customers. The value of the constraints should be set by the electricity utility, according to their pricing policy and to the market regulations. In order to evaluate the safety of an algorithm in respect to the constraints, we propose using summary of all violations to the constraints, such as in Dalal et al. (2018). Regarding evaluation of the impact these constraints have on the performance of the pricing agent, we propose learning a policy as a function of the constraint level as in Boutilier and Lu (2016); Carrara et al. (2018). This should provide information about the trade-offs between the constraint level and the expected return to the human operators of the pricing agent (Dulac-Arnold et al., 2019). In order to further improve the explainability of the pricing agent, we propose tracking the performance on the two objectives of the reward function. This way, the human operators can have an insight in the performance of the used policy.
5 Conclusion and Further Work
In this paper, we propose a pricing agent and an appropriate simulation environment, which can be used for training and evaluation of the agent. We address the challenges of safety, robustness and sample efficiency of the pricing agent which can increase the cost of deployment in a physical environment. After implementing the proposed pricing agent and evaluation of the results, we propose further training of the customer agents from the simulation environment. They would keep their original reward function, but now they will be trained in an environment where the price signal is responsive to their actions. This could further improve their performance in terms of reducing the peak energy demand. These additionally trained customer agents could amount to an environment for evaluation of the pricing agent when the customers are fully responsive to the price signals.
- The power to choose: demand response in liberalised electricity markets. Energy market reform, OECD/IEA. External Links: Cited by: §1.
- Integrating variable renewable energy: challenges and solutions. NREL Technical Report. External Links: Cited by: §1.
Budget allocation using weakly coupled, constrained markov decision processes. In
Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI-16), New York, NY, pp. 52–61. Cited by: §4.3.
- OpenAI gym. External Links: Cited by: §4.1.
- A fitted-q algorithm for budgeted mdps. In European Workshop on Reinforcement Learning (EWRL), External Links: Cited by: §4.3.
- The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI ’98/IAAI ’98, USA, pp. 746–752. External Links: Cited by: §4.1.
- Safe exploration in continuous action spaces. CoRR abs/1801.08757. External Links: Cited by: §4.3.
- The value of energy storage in decarbonizing the electricity sector. Applied Energy 175, pp. 368 – 379. External Links: Cited by: §1.
- Overgeneration from solar energy in california. a field guide to the duck chart. NREL Technical Report. External Links: Cited by: §1.
- Challenges of real-world reinforcement learning. CoRR abs/1904.12901. External Links: Cited by: §3.1, §4.2, §4.3.
- Model-agnostic meta-learning for fast adaptation of deep networks. CoRR abs/1703.03400. External Links: Cited by: §4.2.
Two-stage electricity demand modeling using machine learning algorithms. Energies 2017, pp. 1547–1571. External Links: Cited by: §3.
- Dynamic pricing and energy consumption scheduling with reinforcement learning. IEEE Transactions on Smart Grid 7 (5), pp. 2187–2198. External Links: Cited by: §2.
- A Dynamic pricing demand response algorithm for smart grid: Reinforcement learning approach. Applied Energy 220 (C), pp. 220–230. External Links: Cited by: §2.
Demand forecasting at low aggregation levels using factored conditional restricted boltzmann machine. In 2016 Power Systems Computation Conference (PSCC), Vol. , pp. 1–7. External Links: Cited by: §3.
- Deep learning versus traditional machine learning methods for aggregated energy demand prediction. In 2017 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Vol. , pp. 1–6. External Links: Cited by: §3.
- Reinforcement learning: an introduction. Adaptive Computation and Machine Learning series, MIT Press. External Links: Cited by: §3.
- CityLearn v1.0: an openai gym environment for demand response with deep reinforcement learning. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, BuildSys ’19, New York, NY, USA, pp. 356–357. External Links: Cited by: §4.1.