Can an AI agent hit a moving target?

10/06/2021
by   , et al.
University of Warwick
0

As the economies we live in are evolving over time, it is imperative that economic agents in models form expectations that can adjust to changes in the environment. This exercise offers a plausible expectation formation model that connects to computer science, psychology and neural science research on learning and decision-making, and applies it to an economy with a policy regime change. Employing the actor-critic model of reinforcement learning, the agent born in a fresh environment learns through first interacting with the environment. This involves taking exploratory actions and observing the corresponding stimulus signals. This interactive experience is then used to update its subjective belief about the world. I show, through several simulation experiments, that the agent adjusts its subjective belief facing an increase of inflation target. Moreover, the subjective belief evolves according to the agent's experience in the world.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/12/2020

Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning

Exploration in multi-agent reinforcement learning is a challenging probl...
12/20/2017

Pseudorehearsal in actor-critic agents with neural network function approximation

Catastrophic forgetting has a significant negative impact in reinforceme...
10/15/2020

Cooperative-Competitive Reinforcement Learning with History-Dependent Rewards

Consider a typical organization whose worker agents seek to collectively...
03/25/2018

Maneuver Control based on Reinforcement Learning for Automated Vehicles in An Interactive Environment

Operating a robot safely and efficiently can be considerably challenging...
05/29/2018

The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making

Off-policy reinforcement learning enables near-optimal policy from subop...
08/13/2019

Inverse Rational Control with Partially Observable Continuous Nonlinear Dynamics

Continuous control and planning remains a major challenge in robotics an...
11/06/2018

Towards a Science of Mind

The ancient mind/body problem continues to be one of deepest mysteries o...

1 Introduction

Agents’ expectation formation has long been a fundamental building block of many macroeconomic models. The economies we live in is constantly evolving in ways that are not perfectly understood by neither private agents nor policy makers. How an economic agent adjust its expectations facing these non-stationarities is imperative in understanding where the economy is heading and assessing relevant policies. This paper builds on the long line of research in modelling economic agents’ expectation formation process, particularly those that study deviations from full information rational expectation assumption. It proposes a plausible model for expectation formation that agents face constraints in processing information and do not know how to make model-consistent beliefs. To test this expectation formation model, an experiment is conducted. Inspired by the early literature on accelerationist controversy, it observes how this agent adjusts its subjective belief in an environment, in which a government changes its inflation target.

Borrowing from the artificial intelligence (AI) literature, how an agent forms expectation is modelled through the actor-critic framework of reinforcement learning. AI algorithms produced many successes in fields outside of economics. For example, WaveNet is used for google assistant and latest android devices for voice recognition

(Wavenet); Deepmind AI reduces Google data centre cooling bill by 40% (Googlecooling). The key motivation for its application in this paper is how the algorithm naturally incorporates bounded rationality, and how it is connected closely to psychology and neural science research on human decision making.

In this setting, the agent first needs to interact with an environment by taking random actions and receiving corresponding reward signals, which is inspired by learning through trail and error in the psychology of animal learning (SB2018)

. The randomness of the action is linked to how exploratory the agent is at trying different options in an action space, which can be viewed as a hardwired trait in the agent’s brain. The agent has to exploit what it has already experienced in order to obtain rewards, but it also has to explore in order to make better action selections in the future. Moreover, exploration is directly linked to the agent’s experience. With a high level of exploration, the agent could experience a wide range of possibilities. This ensures that the agent has sufficient amount of experience to learn that the environment is undergoing a change (e.g., a monetary policy target shift). The experience collected is processed with the goal of finding a decision-making strategy that maximises expected future return. This expected future return is a subjective belief of this learning agent and it evolves based on past experience. Lastly, both a decision-making strategy and a value function are learnt from randomly initialised neural networks. This implies that the agent is not learning about a particular (group) of model parameters, nor how to process a certain set of information, but learning in general how to make decisions based on its past experience, and adjust its subjective beliefs about the world.

To highlight its fitness in modelling economic agents’ expectation formation process, I employ an environment where a government changes its inflation target to ‘fool’ the agent in this economy. This is motivated by the early literature in accelerationist controversy. Accelerationist or backward looking Phillips curve with adaptive expectation permits a trade-off between inflation and unemployment. It was then argued that a government could exploit such an opportunity to maintain a low rate of unemployment through accelerating the money supply process. As the agent in this AI algorithm also learns from past experience, it is a form of adaptive expectation. Would this AI agent also make systematic errors, and if not, how does it adjust to the new policy regime? In other words, can this AI agent learn to converge from one rational expectation equilibrium to another, and hit the moving target of monetary policy?

Related Literature

This paper builds on the vast existing literature on modelling agents’ expectation formation process. The importance of agents’ expectations can be traced back to Keynes on his idea of how expectation determines output and employment (Keynes:1936). Fast forward two decades later, Cagan1956 and Friedman1957 formalised the idea of adaptive expectation. In combination with the Phillips curve, it generated a large debate on if and how a government could exploit a possible negative relationship between inflation and unemployment. However, it was criticised for assuming that an agent would make an inflation forecast the same as past period inflation. What came as an alternative and soon revolutionised macroeconomics was the rational expectation hypothesis (LUCAS1972103; LUCAS197619; Sargent1971; Sargent19733). This is when agents are assumed to make model-consistent beliefs, and they know the ins and outs of an economy. In other words, the agents go from very naive (adaptive expectation) to very smart. It has many advantages, and one of which is its usefulness in thinking about policy experiments in a relatively stationary environment. It also has, as most methods are, its disadvantages. One of which is providing convincing dynamics of inflation in response to shocks. Many techniques are proposed to model an agent that is neither as naive as an adaptive expectation agent, nor smart enough to utilise full information available and make model-consistent beliefs. These methods aim at deviating from the full information rational expectation assumption, and can be broadly viewed as two groups of literature.

One group pursues the implications of information rigidities, and this includes sticky information (MankiwReis2002; Balletal2005), noisy information (Woodford2001) and rational inattention (SIMS2003665). The main focus is that agents are constrained to obtain or process information, and thus only use a portion of the full information to make ‘optimal’ decisions, i.e., still hold model-consistent beliefs but with less than full information. Similar to this literature, this paper argues that agents are constrained on the amount of information they can both collect and process at any given time. What makes it different from this line of research is how the constraints are integrated, and moreover, agents’ subjective beliefs about the world constantly evolve.

The other group that is also closely linked to this work focuses on bounded rationality (Sargent1993) and adaptive learning (EvansHonkapohja1999). Schorfheide2005, OzdenWouter2021, Airaudo2021 also look at combining adaptive learning with Markov switching specifications to model learning agents with policy regime change. The main idea is that agents are believed to be as smart as econometricians, and thus learn about model parameters through running a regression with past data or applying Bayesian updating. Similarly, the AI learning agent updates its subjective belief based on past experience. Hence AI learning is a form of adaptive learning. What it contributes to the existing research is to show, drawing inspirations from psychology and neural science research, how this past experience is gathered and what information is used for updating its belief.

The methodology adopted is closely related to the fast-moving AI literature, and belongs to a class of algorithms called deep reinforcement learning (DRL) algorithms. The pioneer algorithm is called deep Q network algorithm (mnih-atari-2013), which is capable of human level performance on many Atari video games using unprocessed pixels for input. However, it can only handle discrete action spaces. Economic decision-making processes often involve continuous action space, and thus this paper applies lillicrap2015drl’s algorithm, namely deep deterministic policy gradient (DDPG). The application of DRL algorithms in macroeconomic models represents a new branch of research. In a companion paper, Shi2021learning adopts a DRL algorithm in a stochastic growth model environment to highlight how an AI agent can learn from no information on its environment and own preference and its ability in adapting to transitory and permanent income shocks. Shi2021deep apply a DRL algorithm in a model with different monetary and fiscal policy regimes, and show evidence that DRL agent can locally learn and converge to neighbouring regions of all equilibria in the model. This paper adopts similar methodology to both Shi2021learning and Shi2021deep, however, with the key difference that it accentuates the adaptability feature of AI agents when faced with a monetary policy regime change. Will this agent notice the shift in policy target and hence adapt it’s decision rule accordingly?

In the following sections, I first introduce the economic model adopted in this exercise. This is followed by the methodology section that consists of details on how an AI algorithm is implemented in the economic environment. Simulation experiments and results are then presented.

2 An Economic Model

In this section, I present an economic model with a representative household that follows the rational expectation assumption. In the following section, I illustrate how an AI learning agent is modelled, and what happens if it was living in this economic environment presented here.

2.1 A Representative Household

A representative household determines its consumption level and real money balance holding each period to maximise its expected lifetime utility,

(2.1)

subject to the nominal period budget constraint,

(2.2)

where is the price level of period , is the consumption level, is nominal money balance, is the stock of nominal bond that a household enters period with, and they pay out gross nominal interest rate , is the endowment or income of the agent. is the government transfer at . In real terms, Equation 2.2 is,

(2.3)

where refers to real money balances, and is real bond holding. Inflation is defined as

A Lagrangian for this household is:

(2.4)

The first-order conditions with respect to consumption, real money holding, and real bond holding are as follows. The utility function with a superscript refers to the derivative of this utility function with respect to the superscript variable, for example, represents the derivative of the utility function with respect to consumption, .

(2.5)
(2.6)
(2.7)

Equation 2.5 and 2.7 give the consumption Euler equation.

(2.8)

It states that utility lost from consumption today equals utility from consuming tomorrow adjusted for the (real) gain from keeping bonds.

Equation 2.5, 2.6, and 2.8 give the money demand equation of the agent, which equates the marginal rate of substitution between real money and consumption to their relative price.

(2.9)

2.2 Government: fiscal and monetary policy

The monetary authority follows an interest rate rule, specified as equation 2.10.

(2.10)

where denotes the inflation target, is a parameter that governs how responsive the monetary authority is at a deviation from the inflation target. When , it is a passive monetary policy, whereas when , it is an active one.111The analyses presented in this paper are based on the specification of a passive monetary policy. However, the results equally hold for the case of an active monetary policy, which is available upon request.

Bond is in 0 net supply. The government incurs no consumption, and its budget is balanced every period, i.e.,

(2.11)

3 An AI Learning Model

In this section, I introduce the algorithm adopted and how to apply it to the economic environment specified in the previous section. For a comprehensive review of reinforcement learning, please see SB2018.

3.1 AI Learning Framework: actor-critic model

The deep reinforcement learning algorithm adopted here was first introduced by lillicrap2015drl

, namely deep deterministic policy gradient (DDPG). Its core follows the actor-critic model of reinforcement learning, and it uses the formal framework of a Markov decision process to define the interaction between a learning agent and its environment in terms of states, actions, and rewards (Figure

1).

[width=12cm,height=5cm]fig1.png

Source: SB2018

Figure 1: The agent-environment interaction in a reinforcement learning setting

State

is a random variable from a bounded and compact set of state space

222Latest research on reinforcement learning also investigates the setting with unbounded state space, e.g., shah2020stable., i.e., . Taking an action , which belongs to an action space ,

, is how the agent interacts with an environment. The state evolves through time following a probability function,

, which is defined as,

(3.12)

It shows the probability for the random variable state occurring at time , given the preceding values of state, , and action, .

Reward is a random variable and can be generated from a reward function, .

Return from a state is defined as the sum of discounted future reward,

(3.13)

where is the discount factor.

In a standard setup of reinforcement learning, an agent’s behaviour is described by a policy (also known as an actor) that maps states to probabilities of selecting each possible action. In the actor-critic framework adopted here, a deterministic policy is adopted, that is the policy is a function , which maps a state from the state space to an action from the action space.

A value function333In reinforcement learning literature, two types of value functions are defined. What is defined here is normally referred to as an action value function. To not complicate the matter, when talking about value function in this paper, it means action value function in the reinforcement learning literature. (known as critic) shows the ‘expected’ return of taking an action in a state and thereafter following policy . Expectation here is a subjective belief that depends on past experience. The value function is defined as,

(3.14)

where means the action value function follows policy , and reflects that it’s a subjective belief that depends on a policy that is formed by past experience444This forms of notation, e.g., , largely follows the handbook for reinforcement learning by SB2018..

Many approaches in reinforcement learning makes use of the recursive relationship known as Bellman equation,

(3.15)

where .

Reinforcement learning methods accentuate how the agent’s policy and value functions change as a result of its experience. The DDPG algorithm makes use of two neural networks to approximate policy and value functions respectively: the actor network is denoted as , where represents parameters of the neural network; the critic network is denoted as , and is its parameters. and are updated during learning, and can be viewed as the coefficients of two functions and the probabilities involved in making subjective expectations. Two neural networks are updated with respect to each other. In the following passages, I highlight key elements on how the actor and critic networks are updated. The full algorithm is presented in Section 3.3.

The goal of this learning agent is to continuously update its subjective belief about the world based on experience, and to form a decision-making strategy (approximated by the actor network) that produces the highest discounted future return (approximated by the critic network). The actor network is updated with the goal of maximising the corresponding critic network. In other words, the actor network is updated based on what the agent believes, at that time, to be a strategy that produces high ‘expected’ returns. ‘at that time’ means that the critic network evolves - what the agent follows as the critic network at period is most possibly different from what it is at . Expectation here is the learning agent’s subjective belief that is formed from past experience.

The critic network, in a nutshell, is updated with the goal of minimising, what is named, a TD error (a temporal difference error).555The full algorithm is in the next section. The TD error follows the form,

(3.16)

where is called a TD target, and it is the addition of reward based on a state-action pair and the discounted value of next state and action, i.e.,

(3.17)

and the next period action is assumed to follow the actor network at that time 666It needs not be the same as the true policy..

(3.18)

This TD target means that if following the subjective belief formed at that time, what would the best outcome be given a state-action pair chosen.

Learning agent’s value function is updated to minimise the TD error. In neural science research, it is revealed that dopamine neuron firing rate in the brain resembles the TD error sequence during learning

(Botvinick2019). This also inspires further research in neural science to model decision-making in connection with reinforcement learning algorithms.

As highlighted in Section 1 Introduction, exploration plays a crucial rule ensuring that the learning agent collects a wide range of information. To ensure that the agent explores its environment and tries out new actions, an exploratory policy is adopted and takes the following form,

(3.19)

This shows that the final action the agent takes, i.e., what outputs, depends on the actor network output , and a random variable sampled from a noise process . Following lillicrap2015drl, is sampled from a discretised Ornstein-Uhlenbeck (OU) process.777There is a strain of literature in computer science solely focus on different exploration strategies to achieve the best performance for a given task. It is out of the scope of this current exercise, and not discussed in details. This exploratory policy produces a random action. The randomness decreases over time (by design) but it never disappears in this paper. The implication is that in a stationary environment, the policy network becomes closer to the true underlying policy as it learns but it is never identical to it. However, in a non-stationary environment, it allows the policy network to adjust and be flexible facing changes in the environment. Connect it with economic concepts, it means that the policy will converge to a close region of the rational expectation solution (if exists), but will not be identical to it. In an economic model that is subject to structural breaks or regime changes, this exploratory policy allows the learning agent to adjust its expectation and adapt its policy to a new regime.

As a solution method, reinforcement learning algorithms are connected to dynamic programming that is widely employed in macroeconomics. Most reinforcement learning algorithms attempt to achieve similar results as dynamic programming but with less computation and without assuming for a perfect model of the environment (SB2018). This paper focuses on the dimension of the learning process rather than adopting the framework as a solution method, and it allows the agent to never stop exploring (as we do in real life).

On the dimension of learning from past data, reinforcement learning algorithms are also similar and connected to the adaptive learning methods. Both methods update some desired parameters with past data. Reinforcement learning algorithms could provide a plausible and flexible expectation formation model that is connected to psychology and neural science. Moreover, it is computationally flexible given the use of neural networks.

3.2 Connecting to the Economic Model

To implement this expectation formation framework in an economic setting, I first translate the economic model into the aforementioned components within a Markov decision process, which are presented in Table 1.

Assume for a logarithmic utility function of the form , where is a preference parameter. Assume for for all and for simplicity.

Terminologies Description
Representation in the
economic environment
State,
A random variable from a state space,
Actions,
A random variable from an action space,
inflation belief,
Rewards, A function of state and action
a function of the forecast error,
Policy function,
A mapping from state to action,
Approximated by a neural network,
ie., actor network;
parameterised by
to be updated during learning
Value function,
the ‘expected’ (subjective belief)
return of taking an action in a state
Approximated by a neural network,
ie., critic network;
parameterised by
to be updated during learning
Table 1: RL components and the economic environment

As Table 1 shows, the state of this economy contains past period inflation, inflation belief, and real money holding. The action of this AI agent is to form an inflation belief, denoted by , where is this AI agent’s subjective belief based on past experience. The reward of this agent correlates negatively to its forecast errors. How this reward is generated is unknown to the AI agent. Policy and value functions are both approximated by neural networks. As the AI agent does not have any information on the environment and its own preference, it must gather information by taking an action each period and observes its corresponding rewards. The agent also does not know how state transitions after it takes an action at each period. This transitional dynamics involves Equation 2.8, 2.10, and 2.9. Given an agent’s inflation belief, Equation 2.8, , gives the interest rate that is consistent with the agent’s optimal intertemporal allocation. Given this interest rate, Equation 2.10, , provides the actual inflation that leads to central bank’s nominal interest rate decision. The job of this agent is to learn about its preference and the aforementioned state transition dynamics as best as possible, so as to come up with a decision-making strategy, i.e., a policy function, that maximises the value function (according to the agent’s subjective belief) in states of relevance.

3.3 Full Algorithm and Sequence of Events

The full algorithm consists of three main steps:

Step I: Initialisation

  • Sep up two neural networks: an actor network takes the argument of state and outputs an action; a critic network takes the argument of a state-action pair and outputs a value.

  • and represent the parameters of the two networks respectively. Both are initialised randomly. Both parameters update during the learning process so that the networks will move towards the true policy and value functions.

  • Define a replay buffer (called transitions in the DRL literature), which is a memory that stores information that is collected by a DRL agent during the agent-environment interactive process. A transition is characterised by a sequence of variables .

  • Define a length of , which is the size of a mini-batch. A mini-batch refers to a sample from the memory.

  • Define the total number of episodes and simulation periods per episode. Each episode contains simulation periods. The higher the episodes, the longer the learning periods.888In the DRL literature, AL agent is usually set to learn a particular task or an Atari game. An episode, thus, represents re-starting the game or task, and it ends with a terminal state (i.e., the end result of a game). In an economic environment, however, a clear terminal state can be difficult to specify. Therefore, the concept of episodes only correlates to how long an agent has been learning.

For each episode, loop over step II and III.

Step II: The AI agent starts to interact with the environment.

  • The agent observes the current state real money holding of previous period , previous period realised inflation , and its last period inflation belief . The agent then forms an inflation belief according to its actor network, i.e., , which consists of the current policy and exploration noise .

  • Execute action , and observe a reward . The state transitions to the next, i.e., . The state transition dynamics are as follows.

    • Given the agent’s inflation expectation, an Euler equation (Equation 2.8) consistent interest rate can be obtained, .

    • Real money holding follows the money demand equation (Equation 2.9), .

    • Lastly, given the interest rate, to be consistent with the central bank’s interest rate rule (Equation 2.10), realised inflation is .

    This transitional dynamics illustrate how private agent’s expectation and macroeconomic variables (e.g., inflation, and nominal interest rate) are linked.

  • Store a transition in the memory .

Step III: Training the AI agent (when the AI agent starts to learn) for period .

  • Sample a random mini-batch of N transitions from the memory .

  • Calculate a value for each transition following

    (3.20)

    for all , where is a prediction made by the critic network with state-action pair , and is a prediction made by the actor network with input .

  • Obtain from the critic network with input state-action pair

  • Calculate the average loss for this sample of transitions

    (3.21)
  • Update the critic network with the objective of minimising the loss function

    .999This involves applying back propagation and gradient descent procedures.

  • For the policy function, i.e., the actor network, the objective is to maximise its corresponding value function. This means that a value function that follows a particular policy. In other words, the input action of function is from the policy , . Define the objective function as,

    (3.22)
  • This objective function could also be rephrased as minimising . Update the actor network parameters with the objective of minimising .101010Similar to the critic network, the specific steps of updating ANN’s parameters by minimising an objective function involves back propagation and gradient descent.

4 Experiments

Motivated by the early literature on accelerationist controversy, I set up a price stationary environment, and then shift it to an inflation-stationary one. This is to investigate how the AI agent reacts to the monetary authority’s action in increasing inflation (perhaps with the goal of exploiting a long-run trade off between inflation and a real variable). More specifically, the AI agent in this exercise first lives in an environment where the inflation target is , i.e., it learns to form inflation expectation in this price stationary environment. The inflation target is then shifted to (i.e., an inflation stationary environment). The agent does not know this change is occurring. The aim is to observe its behaviour in response to this unforeseen target change, and how the economy transitions. I do not dive into the reason to this change or the probability of its occurrence. The main focus is if the AI agent can adapt its inflation belief with respect to the change of monetary policy regime. The steady state values under two targets are summarised in Table 2.

Steady State Values Target I Target II
Inflation Target
Inflation
Interest Rate
Real Money Holdings
Table 2: Steady State Values under Two Policy Targets

Table 2 shows that in the first regime with the inflation target, the steady state interest rate is and the real money holding value is 5.111111As the importance here is to show if and how the economy converges from one to the other steady state, these values are not designed to match the reality. In the inflation stationary environment, the nominal interest rate increases to , and the real money holding reduces to . These steady state values are used as a benchmark to observe which steady state the economy filled with an AI agent (gradually) converges to.

The main parameters are presented in Table 3. In this setup, a passive monetary policy is adopted with . The case of active monetary policy is also considered, and it does not affect the main findings in this paper.121212Results generated from an active monetary policy are available upon request. Discount factor chosen as for computational simplicity.131313A more realistic value such as was also considered, and it does not change the main findings of this paper.

Exploration level of the baseline agent is 0.2, this means that the standard deviation of the noise added to the policy function (i.e., Equation

3.19) is 0.2.

Parameters Baseline Agent
Policy rule parameter - 0.5
Discount Factor 0.8
Exploration Level 0.2
Table 3: Main Parameters

5 Results

This section highlights two main findings: 1. an AI agent that is not expecting a change in the policy target has the ability to adapt to this change, and this is reflected by its inflation forecasts. Following its adaptive behaviours in forming inflation expectations, based on state transitional dynamics (described by Equation 2.8, 2.9, and 2.10), the corresponding inflation, nominal interest rate, and real money holding move from one steady state to the other. This result depends crucially on agents’ ability to explore and continuously learn; 2. how well the economy converges to the new steady state, quantified by the distance between the actual steady state values and simulation results of inflation and nominal interest rate, depends on the past experience of an AI agent. With more experience in an environment with a changing target, holding everything else constant, the agent makes inflation forecasts that are more closely related to the realised inflation. Hence the economy converges to the new steady state better. The amount of past experience an AI agent has, in this setting, depends on how long it has been living in an environment.

This section presents simulation results during a policy regime change. Results showing how an AI agent learns from no information when the inflation target is 1.0 is available in Appendix A.

5.1 With and Without Exploration

Figure 2 and 3 plot simulation paths of an agent who cannot explore and learn when there is a shift in the policy target. During the transitional period of a policy target change, Figure 2 plots the simulation paths of the agent’s inflation belief and realised inflation when it does not explore its environment. The x-axis denotes simulation periods. The vertical black dashed line shows the timing of this target change. At period 0 in the figure, the inflation target changes from 1.0 (i.e., price stationary) to 1.1 (i.e., inflation stationary). However, given this agent does not explore its environment and learn from it, its inflation belief barely changes in response to the new target. The corresponding nominal interest rate and real money holdings, as plotted in Figure 3, are both off their new steady state values of 3.67 and 1.375, respectively. Given the monetary policy rule equation 2.10, an almost constant nominal interest rate plus an increasing inflation target correspond to a decrease in the realised inflation, as shown in Figure 2.

[width=19cm,height=8cm]Pihat0.png

Figure 2: Inflation and Expected Inflation, without exploration

[width=19cm,height=8cm]pihat0rm.png

Figure 3: Real Money Holdings and Nominal Interest Rate, without exploration

In contrast, Figure 4 to 6 show simulation behaviours when the AI agent has the ability to explore and learn from its experience gained through exploration. Its transitional behaviours different significantly from when it has no ability to explore. Figure 4 plots inflation belief and realised inflation during the simulation periods when inflation target changes. As shown in the figure, at period 0 when the inflation target changes, the agent responds with a gradual increase in its inflation expectation. A delay in the change of inflation belief can be observed right after period 0. This can be explained by how an agent learns in this algorithm. The agent only changes its behaviours or decision-making strategy once it has gained relevant experience. In this circumstance, once the inflation target changes, the only way for the agent to be aware of this change is by making an inflation forecast in this environment, and observe how the reward and the next state are different from its own past experience when making the same forecast. This delayed response is also consistent with the temporary dip of actual inflation in Figure 4, which gradually increases to the new steady state value of 1.1 at around period 8. Given the agent changes its inflation expectation, the corresponding real money holdings, as plotted in Figure 5, also converges to a value that is close to the new steady state 3.67. Figure 6 shows that the nominal interest rate, in response to the realised inflation, moves close to the new steady state value 1.375.

[width=19cm,height=8cm]Pihat10.png

Figure 4: Inflation and Expected Inflation, with exploration

[width=19cm,height=8cm]pihat10rm.png

Figure 5: Real Money Holdings, with exploration

[width=19cm,height=8cm]pihat10i.png

Figure 6: Nominal Interest Rate, with exploration

This result attests that with exploration and constant learning, this AI agent adjusts its inflation belief with respect to the new regime (with a delay). Given the general equilibrium setup of the economic environment, its behaviours result in that the economy shifts from the neighbourhood of price stationary rational expectation equilibrium to the inflation stationary one. One natural question to ask is what could impact how well the economy converges to the new steady state. If this agent were to experience and learn in this environment for longer period, would it lead to a better convergence of the economy to the new rational expectation equilibrium? This is illustrated in the following section.

5.2 More or Less Experience

In this subsection, I present results showing that the more an AI agent experiences and learns in a given environment, the better it is at making a decision that maximises its reward, which corresponds to the economy converging closer to the new rational expectation equilibrium.

[width=19cm,height=8cm]pihat01epi.png

Figure 7: Inflation expectation during an inflation target change

[width=19cm,height=8cm]pihat01pi.png

Figure 8: Inflation during an inflation target change

Figure 7 plots the simulated paths of inflation forecasts for four agents with different experience. All agents learnt separately. Their learning environments (i.e., the economic model and initial conditions) are identical. The x-axis plots simulation periods. The vertical black dashed line shows the timing of the inflation target change. The blue line labelled as ep20 (i.e., episode 20) represents the agent who has been learning for the longest, whereas the red line (episode 5) represents the agent who has been experiencing for the shortest among of time. It can be observed that all four agents, facing the inflation target change, shift their inflation belief (with a delay) to the neighbouring region of the new steady state under target II. However, the difference is that the blue line agent, the one who has more experience in changing environment, shifts its inflation expectation closer to the target of 1.1. The agent who has been learning for the shortest amount of period settles down at a level that is close of 1.08, which is the furthest to the 1.1 target among the four agents. Corresponding inflation is plotted in Figure 8. It shows that the inflation path corresponding to the blue line agent converges to the new target the best among the four. This shows that, the longer an agent interacts with an environment and the more experience it gains, the better it learns to make a decision that maximises its long term rewards. Under the current setup, this translates into a better convergence to the new rational expectation equilibrium.

This result agrees with the argument made by MalmendierNagel2016. They show that individuals of different ages disagree significantly in their inflation expectations, and this can be explained by differences in their lifetime experiences of inflation. In the simulation experiments here, as an AI agent gains more experience in an environment with a change in the monetary policy target, it learns and forms a better policy, and makes inflation forecasts that are more similar to the realised inflation. On the contrary, if an agent only has been experiencing in this changing environment for a short amount of time (e.g., similar to the red-line agent in Figure 7 and 8), it adapts its policy based on this limited experience and its inflation forecasts are thus less accurate.

6 Discussions

In addition to providing a plausible transitional dynamic of an economy facing an inflation target change, this exercise offers an explanation on how expectation is formed, and how an agent is adaptive and becoming aware of changes in its environment. An AI Agents under the DDPG algorithm becomes aware of any policy changes through interacting with the economic environment that it lives in. This means making a decision given a current state, and observing the next state and the reward signal corresponding to its decision. When it observes that its decision-making strategy does not generate a high reward like it used to, the agent starts to make changes of its policy so that it can obtain a higher reward in the long run, and this is how an agent adapts its behaviours with respect to a monetary policy target change. This offers a way to explain what Cavalloetal2017 observes in their experiments. They provide evidence showing that private agents are more likely to adjust their inflation expectations with respect to supermarket price changes than actual inflation statistics. Price changes in supermarkets are likely to have a direct impact on consumers’ welfare than the observed inflation statistics or a central bank announcement on changes in an inflation target.

One main criticism of DRL algorithms is the speed of learning. In this paper, it takes a significant amount of simulation periods for the agent to learn the solution of the model. This can be accelerated through modifying several training parameters. However, it remains a lengthy simulation periods (could be a lifetime) before the agent converges to a steady state solution.141414 As explained by Botvinick2019, the slow learning is mainly led by the incremental parameter adjustment and weak inductive bias within the algorithm. However, as it is a fast-evolving literature, many new DRL algorithms are proposed to mitigate this and speed up the learning process, for example, inspired by Gershman2017, DRL algorithms with episodic memory are being developed. This criticism matters more if the goal of applying the algorithm is for a learning agent to converge to a rational expectation solution. This involves the debate on if it is reasonable or sensible to assume that an economy is operating on a steady state. The economy might also be on a learning path that is away from an equilibrium.

7 Summary

In this exercise, I present a plausible expectation formation model, and show how a learning agent adjusts its subjective belief with respect to a monetary policy inflation target change. This agent’s expectation formation process is modelled with an innovative AI algorithm. More specifically, this agent is born in an unknown environment, and learns through interacting with the environment. This involves taking exploratory actions and observing corresponding stimulus signals. The experience is then being processed by artificial neural networks with the goal of forming a decision-making strategy that maximises the agent’s expected (subjective belief) future return. This subjective belief also evolves based on the agent’s past experience.

With this algorithm, I highlight that the AI agent notices and adapts to a monetary policy regime shift. In a money-in-utility model with an interest rate rule, I observe this AI agent’s behaviour once the inflation target of the monetary authority changes from 1.0 (i.e., price stationary) to 1.1 (i.e., inflation stationary). The AI agent living in this world recognises this monetary policy target change, and adapts its decision-making strategy so as to achieve high long-term rewards in the new environment. I argue that this result depends crucially on the agent’s ability to explore its environment. More specifically, with the exploration property, the AI agent can collect a wide range of information, and this may include those that help it recognise a change in its environment. This result becomes apparent when comparing an exploring agent with an non-exploring one. Without exploration, the agent does not collect new information (i.e., does not gain new experience) and still behave following the policy that is formed in the old regime. With exploration, however, the agent collects and processes new information, which feeds into its subjective belief updates. This allows the agent to adjust its policy, and move towards the new equilibrium. I also present simulation results that show when an AI agent has more experience, it learns to make inflation forecasts that contribute to a better convergence of the economy to the new regime steady state.

This exercise offers a plausible view on how expectation is formed, and how an AI agent is adaptive and becoming aware of changes in its environment. An AI agent becomes aware of any environment changes (include inflation target changes) through interacting with the economic environment that it lives in. This means that given a state, it makes a decision, and observes the next state and the stimulus signal corresponding to its decision. When it notices that its decision-making strategy does not generate a high reward like it used to, the agent starts to make changes to its policy so that it can obtain a higher reward in the long run, and this is how an agent adapts its behaviours with respect to a monetary policy target change. One implication of this is that it could undermine the effectiveness of forward guidance. If what is communicated to the public cannot be reflected onto issues that directly impact their welfare (i.e., not reflected onto their past experience), private agents may delay their actions that are desired by the policy makers.

References

Appendix A Appendix: Additional Results

This section presents simulation results illustrating how an AI agent learns from no information on the underlying economic environment and its own preference under monetary policy target I (Table 2).

[width=19cm,height=7cm]policyloss.png

Figure 9: Loss of the policy neural network during training

Figure 9 plots the loss incurred for the policy network during learning. It shows that the loss reduces with respect to increasing updates151515Each update refers to passing a training dataset to the ANN once. Here, not strictly following its definition, but it can be understood that an increasing updates correlates to an increasing learning periods.. As the agent learns, its policy network (i.e., decision-making strategy) is incurring less loss and making less mistakes.

[width=19cm,height=7cm]epi.png

Figure 10: Inflation belief

Figure 10 plots simulated paths of inflation belief (dashed lines) and realised inflation (solid lines) for an agent when it is at the beginning of a learning period and when it has been learning for an episode. At the beginning of a learning period (denoted by the green lines), with exploration and random initialisation of ANNs, the AI agent does not know how to make a good decision. Its inflation belief and corresponding realised inflation are thus very volatile. After the agent has been learning for an episode (denoted by the blue lines), when positioned in the same simulation environment, it can form an inflation belief that is more stable and close to the inflation target. The realised inflation in this economy is also close to the inflation target. This is driven by the AI agent’s lack of knowledge in the economic structure and its own preference at the beginning of a learning period. It does not know how to form the optimal inflation belief. In comparison, after the agent has been learning for an episode, the realised inflation is stable and close to the inflation target of 1.