1 Introduction
Recent advances in deep neural networks have enabled Reinforcement Learning (RL) to solve complex problems. Modelfree RL algorithms
Mnih et al. (2015); Oh et al. (2016); Gu et al. (2016b); Schulman et al. (2015a); Gu et al. (2016a); Lillicrap et al. (2015); Mnih et al. (2016); Schulman et al. (2015b) have shown their success in finding optimal control when it is difficult to precisely characterize all key elements of the targeted problems. However, the usage of RL in practical robotic applications is limited by its greybox nature. Analytical understanding and interpretation of the learnt control networks (policies) remain unsatisfactory, particularly of those acting in continuous state, observation, and control spaces. Compared to the classical analysis of controllers and controlled systems, the learnt control networks are missing explainable control mechanisms, analyticallyproved stabilities, or guaranteed asymptotical performance. In addition, there is no satisfied representation of the network internal states for fault detection and human intervention, or more importantly for knowledge distillation and transfer Barreto and et al. (2017).Many effort has been devoted to explainable neural networks Gunning (2017); Samek et al. (2017). Recently, finitestate representations of the learnt control networks for Atari games have been studied Koul et al. (2018), where each game can be viewed as a Partially Observable Markovian Decision Process (POMDP). It is found that the control policy for each game (a single POMDP) can be equivalently described by a Moore Machine Network (MMN) with states fewer than , where the mapping from states to actions is described by neural networks.
While a control strategy should be able to adapt to various environments (described by a set of multiple POMDPs) online. This philosophy has found its roots in existing literature of control designs, such as robust control Skogestad and Postlethwaite (2007), adaptive control Åström and Wittenmark (2013); Lu and Liu (2017, 2018), sliding mode control Edwards and Spurgeon (1998), Hinfinity control Doyle et al. (1989)
. Understanding the interplay between the environments and the controlled systems is essential for theoretical analysis and further improvement in control design. For example, the integration of tracking errors in ProportionalIntegrationDerivation (PID) controllers and the adaptation of the compensation terms in adaptive controllers are designed to estimate some unmodelled drift online, which varies across environments. However, manual design of these estimations has limited classical controllers’ applicability to complex problems, where modelfree RL has shown its success.
Therefore, this paper focuses on understanding a learnt recurrent control network that is able to solve a set of multiple POMDPs, where the transition function of each POMDP is determined by an environment. The recurrent control network has to be aware of environments for effective control. In fact such a control network closely relates to the field of “contextaware” RL and “meta” RL Ranganathan and Campbell (2003). In particular, we are interested in partially understanding the dynamical interplay among environments, environment awareness, and control strategies (captured in the RLlearnt control network). This paper also seeks to find the link between the extracted interplay mechanisms and the concepts of hybrid control, which may further offer some interpretation for fault detection and human intervention.
To this end, this paper studies the position regulation problems of a freefloating underwater platform with limited thrust capacities. The platform is further subject to various excessive external disturbances. As shown in Wang et al. (2019), the excessive disturbance forces at consecutive two time instants can not be viewed as independently identically distributed (i.i.d) (even conditioned on the platform state). These disturbance forces are better described as unknown functions of time. Together with the platform dynamics (in static water), each disturbance defines a transition function of the disturbed platform and thus a POMDP. The disturbance and its role in determining the POMDP have to be estimated online for effective and active disturbance rejection. This paper studies a Disturbance OBserver network (DOBnet) that solves a set of such POMDPs. This DOBnet has been proposed in our previous work Wang et al. (2019)
, which integrates a disturbance observer built upon Gated Recurrent Units (GRUs). The observer subnetwork is jointly trained with a controller subnetwork for optimal disturbance encoding and rejection
Woolfrey et al. (2019).This kind of position regulation problems arise from shallow water applications, e.g., inspecting bridge pile Woolfrey et al. (2016), and deep water operations, e.g., steering a cap to a spewing well Read (2011), where the disturbances come from the turbulent water and oil flows may frequently exceed the thrust capability of the underwater platform. Such problems also exist in controlling quadrotors for surveillance and inspection in windy conditions Waslander and Wang (2009). With the platform being stabilized in a range close to a targeted location, the onboard manipulators could compensate the platform oscillation and execute tasks. However, these unknown excessive disturbances inevitably bring adverse effects and may even destabilize the platform Xie and Guo (2000); Gao (2014); Li et al. (2014).
Following ”less is more“ from a poem by Robert Browning, this paper proposes an Attentionbased Abstraction (A) approach for extracting key memory states and state transitions that reflect the interplay between the DOBnet and the dynamical environments (i.e., disturbances). The proposed A aims to equivalently present the controlled platform as a finitestate automaton. The A extends the Quantized Bottle Network Insertion (QBNI) Koul et al. (2018) to the control problems that are better described by multiple POMDPs. The QBNI studies the control systems that are governed by a single POMDP with discrete observations and actions (pixels and keyboard actions). However, same to many control problems from practical applications, the position regulation problem of freefloating platforms are defined in continuous stateobservationcontrol spaces. More critically, these problems are better described by a (possibly infinite) number of POMDPs. Therefore, A involves two critical improvements to the QBNI approach, as introduced below. Note that in the remainder of this paper, the terminologies “action” and “control” are used interchangeably.
Contributions: The proposed A first builds a discrete representation of the controlled platform by learning optimal continuousdiscrete interfaces for observation and control, respectively. Since the DOBnet operates in continuous spaces, the interfaces between the discrete and the continuous spaces are required for generating KMMN. Instead of manually setting quantization levels, the A
learns a more compact quantization that brings about minimum DOBnet performance loss. From the perspective of hybrid control, the continuousdiscrete interfaces offers essential connections between continuous states and discrete modes. The switchings between discrete modes build an automaton that provides an interpretation of the switching mechanisms found later. From the perspective of machine learning, this continuousdiscrete interfaces become an autoencoder with a quantization layer as the encoding layer. In this paper, these autoencoders are first trained in a supervised manner
Hinton and Salakhutdinov (2006). Then they are finetuned within the RL framework. As a result, the interfaces are optimized with the attention on the subspace where the optimallycontrolled platform visits.In addition, we propose a simple recursive loss function to train an autoencoder for quantizing hidden states, which are key in memorizing and distilling the history of the observations and controls. We found that the autoencoder trained by the recursive loss results in a more stable DOBnet than the one trained by
Koul et al. (2018). A Moore Machine Network (MMN) is obtained after a minimization process Paull and Unger (1959), referred to as Partial Enumerative Solution (PES) for minimizing partial specified sequential switching functions. The MMN contains a large number of states and transitions, since the controlled platform inevitably undergoes multiple environments. Therefore, it is yet to have insights about the interplay between the environments and networks.The proposed A further selects the MMN states and the transitions that attract sufficient attention from the DOBnet in solving multiple POMDPs, resulting in a Key MNN (KMMN). Since we are interested in the interplay between control strategies and environments (i.e., POMDPs), the attention that each state attracts is defined as the number of POMDPs that visit this state. Intuitively, a state only visited by one POMDP is unique to this POMDP and is not critical to other POMDPs. Thus this state is ignored in the KMMN. These oftenvisited MMN states formulate a set of KMMN states. Then the transitions between KMMN states are created if concatenated transitions between a pair of KMMN states exist. The KMMN greatly reduces the number of states and transitions, illustrating the patterns of cyclic switchings.
Within the obtained KMMN, we found that about of tested episodes exhibit cyclic transitions between some KMMN states. Note that each episode involves one randomly generated environment (i.e., one disturbance pattern or one POMDP). Also, we found each KMMN state corresponds to a saturated control. This finding is coherent with the fact that oftensaturated systems can be described by switchingcontrolregulated models Zuo et al. (2010); Benzaouia et al. (2010); Yuan and Wu (2015); Dong et al. (2010). We also found that the learnt control network is able to activate a portion of the KMMN, corresponding to the disturbance pattern. We still cannot fully understand the DOBnet or analyze the stability of the controlled platform. However, the induced switching mechanism may offer some interpretation of the hidden states for analysis, debugging, and abnormality detection.
In this paper, some related work is shown in Section 2. Section 3 introduces the problem formulation of the position regulation tasks, followed by the scope of this paper. Our previous work on DOBnets is summarized in Section 4. Section 5 provides the detailed description of the proposed A approach for obtaining the KMMN. Then, Section 6 presents the switching mechanisms found in the learnt DOBnet. The switchings are analyzed via an analogy to hybrid control in Section 7, followed by conclusions in the last section.
2 Related Work
2.1 Disturbance Rejection
Disturbance rejection control Skogestad and Postlethwaite (2007); Åström and Wittenmark (2013); Lu and Liu (2018); Edwards and Spurgeon (1998) often assumes disturbances bounded and relatively smaller than the control saturation Ghafarirad et al. (2014). One popular improvement to these controllers is to add a feedforward compensation based on some disturbance estimation techniques Yang et al. (2010). Various disturbance estimation have been proposed and practiced, such as Disturbance OBserver (DOB) Ohishi et al. (1987); Chen et al. (2000); Umeno et al. (1993), unknown input observer in disturbance accommodation control Johnson (1968, 1971), and extended state observer Han (1995); Gao et al. (2001). However, these controllers fail to guarantee stability considering the actuator saturation Gao and Cai (2016) when disturbances frequently exceed control saturation.
To this end, model predictive control (MPC) Camacho and Alba (2013) is often applied due to its capability in dealing with constraints Gao and Cai (2016). It formulates a series of constrained optimization problems over receding time horizons based on predictions of the disturbed platform. A prediction method (e.g., autoregressive moving average) is required to forecast future disturbances based on the estimations of current disturbances (from DOBs). However, DOBs often require sufficient system modelling, which could be difficult for the underwater robots due to hydrodynamics effects. Otherwise, the disturbance estimations are lumped with large modelling uncertainties, which in fact are not functions of time and thus not suitable for timeseries prediction methods. On the other hand, current DOBs might not have insufficient capability in estimating fast timevarying disturbances, since their convergence analysis often assume disturbances timeinvariant. In addition, such separated processes of disturbance estimation, disturbance prediction, and control optimization might not be able to produce estimations and control signals that are mutual robust to each other and that jointly optimize performance, as evidenced in Brahmbhatt and Hays (2017); Karkus et al. (2018).
2.2 Understanding Recurrent Policy Networks
Recurrent Neural Network (RNN) memory (i.e, hidden states) is often in the form of a highdimensional vector in a continuous space and is recursively updated through gating networks. There has been some work on visualizing and understanding the learnt RNN Karpathy et al. (2015). RNN models have been linked to iterated function systems in Barnsley (2014), which further shows the relationship between the independent constraints on the state dynamics and the universal clustering behavior of the network states. Many others use training data to show the clustering and the correspondences between network internal states Cleeremans et al. (1989).
There has been a number of research on extracting finitestate machines from trained RNNs. Crutchfield has reported that the minimal finitestate machine could be induced from periodic sampling with a single decision boundary Crutchfield and Young (1988). An approach that forces the learning process to develop automaton representations has been proposed in Frasconi et al. (1996), which adds a regularization to constrain the weight space. Omlin has used hints to learn a finitestate automaton for secondorder recurrent networks Omlin and Giles (1992). Learning full binary networks is an orthogonal effort Hubara et al. (2016)
to the previously mentioned, where activation functions (and/or weights) are binary. A querybased approach has been proposed to extract a deterministic finitestate machine that characterizes the internal dynamics of hidden states
Weiss et al. (2017).Koul has proposed Quantized Bottleneck Network (QBN) insertion in Koul et al. (2018) for extracting a finitestate machine from discrete action networks. The QBNs are autoencoders, where the latent encoding is quantized. Given a trained RNN policy, the QBNs are trained to encode and quantize the hidden states and observations in a supervised manner.
3 CONTROL PROBLEM and SCOPE
The optimal control problems considered in this study involve a freefloating platform (a rigid body) under translational motion, thanks to its huge restoring forces and its sufficiently large torque capacity on the heading control. The position of this platform is denoted as . The platform’s velocities and accelerations are denoted by and , respectively. It is assumed that and are observable without errors in this study, nevertheless RL approaches are in general robust to reasonable observation noises.
Then the platform’s dynamics (also referred to as the system) is given by
(1) 
where is the inertia matrix and is the vector of the gravity and buoyancy forces. This matrix , vector, and external disturbances are assumed unknown to the controller (the trained DOBnet). The platform control is saturated at an upper bound and a lower bound , where and are dimensionwise operators. Let and the platform dynamics in discrete time can be written as
(2) 
In the remainder of this paper, the time indices in equations are in parentheses and the ones in figures are subscripts for compactness.
The external disturbances are described by the disturbance forces, which are timevariant and are superpositions of sinusoidal functions as
(3) 
where , and denotes the number of components, which is unknown and may vary across environments. The parameters (, , and ) of each component are assumed uniformly and randomly sampled from given intervals and then fixed in each environment in this paper. One instantiation of all parameters of all components is referred to one disturbance pattern.
In the remainder of this paper, one sampled disturbance pattern is viewed as one environment to the freefloating body. The terms “disturbances” and “disturbance forces” are used exchangeably. Note that the external disturbances considered are excessive to the freefloating platform, the definition of which is given as follows.
Definition 1 (Excessive external disturbances)
Excessive external disturbances are those defined in eq. (3), where the amplitudes () exceed the platform control saturation ( and ).
Problem 1 (Optimal control)
Find one controller that chooses an action for the system described in eq. (2) at time in response to the current observation , such that the discounted summation of collected rewards is maximized. The summation is expected over episodes and is defined as , where is a reward function (additive inverse of the tracking error, defined later), denotes number of steps in an episode, and is a discount factor that prioritizes nearterm rewards over future rewards Nagabandi et al. (2018).
The tracking error is defined as
(4) 
where denotes the L norm. In each episode, the environment is randomly sampled and is characterized by excessive disturbances in eq. (3).
Classical RL approaches often implicitly assume independently identically distributed (i.i.d.), possibly conditioned on the platform state Sæmundsson et al. (2018). If not conditioned on , is marginalized over and , and is then described as , where . These disturbance models lead to a singlePOMDP description of the controlled platform and is sufficient when disturbances are small. However, the excessiveness makes these models of not suitable for disturbance rejection, as evidenced in Wang et al. (2019). The following analysis shows that the controlled systems in Problem 1 are better described by multiple POMDPs.
For a th pattern of disturbance superposition, each component of is a function that exhibits periodicity, which can be described as a Markovian chain. The index is dropped if no ambiguity is caused. The Markovian chain is given as
(5) 
where the index of indicates the variety of disturbance patterns. Let denote the space of and the space of possible .
Let , where might vary across environments. Then the platform dynamics can be rewritten in a partially observable Markovian chain as
(6) 
where is the observation function, showing that is observable while is not directly observable. Here the observability is in statistical sense (not in control sense). Let denote the space of all possible . Each transition function defines a POMDP , where is the trained current control network. Let denote the set of all possible .
The control network is targeted to solve Problem 1 (i.e., all ). Key to is the integration of a disturbance observer to existing RL frameworks. This observer not just estimates the unobservable state but also infer the transition function . Both and are critical to the control subnetwork. Our previous work has proposed a DOBnet for this purpose Wang et al. (2019). The DOBnet outperforms existing control and RL approaches. However, the understanding of the learnt DOBnet remains unsatisfactory. Therefore, the scope of this paper, shown below, is regarding the understanding of the learnt DOBnet. For simplicity, the reduced version of the DOBnet is studied in this paper.
Scope 1 (Analysis of DOBnets)
Inductive reasoning of the mechanism on how the learnt DOBnet responds to different unobservable external excessive disturbances (i.e., to different POMDPs).
4 DobNet
Estimating the disturbance forces, their transition functions, and their predictions is key in solving a randomly sampled from . In DOBnets, these estimations are encoded in a latent feature space. The features have to be mutual robust between the controller and the observer. The DOBnet developed in our previous work is composed of a disturbancebehaviour observer subnetwork and a controller subnetwork. For simplicity, this paper investigates the reduced version consisting of a singlelayer GRU, as shown in Fig. 1. Both subnetworks are jointly optimized for mutual robustness and unified optimization. The observer subnetwork imitates the classical DOB mechanisms and is enhanced with the flexibility from GRUs, instead of only providing the estimation of the lumped disturbances up to the current time. The encoding (shown in Fig. 1) is supposed to represent the disturbance behaviour that is key to controller subnetwork.
The full DOBnet is constructed based on the classical actorcritic architecture Lu et al. (2016), the network outputs actions and critics (also referred to as costtogo) associate with previous state and action. The policy is trained using simulated sinewave disturbances. Multiple control and RL algorithms have been tested and compared in Wang et al. (2019), the results have demonstrated that the proposed DOBnet does have a significant improvement in rejecting excessive disturbances.
5 A: Extracting Key Moore Machine Network
The proposed A approach aims to abstract the control mechanism captured in the trained DOBnet for solving continuouscontrol problems. It consists of two procedures: quantization and abstraction. The A involves two critical improvements to the “Quantized Bottleneck Network Insertion” (QBNI) Koul et al. (2018), the later of which is used to generate a finitestate automaton of a trained policy network.
Definition 2 (Finitestate automaton Cheng and Krishnakumar (1993))
A finitestate automaton is an abstract machine whose state is assigned as one of a finite number of states at any given time. It is also referred to as a finitestate machine. The transition between states is determined by discrete action and observation, which is often given by a table.
The finitestate automata in this paper are all deterministic.
Definition 3 (Moore machine network Koul et al. (2018))
A Moore machine network is a standard deterministic finitestate machine whose states are labeled by their output values (controls in this paper). A MMN is fully characterized by finite sets of states, observations, and actions, a transition function, and a policy that maps states to actions, where the policy and the transition function are represented by neural networks.
The QBNI algorithm together with the PES work well for grouping hidden states and observations (and thus reducing number of states in a MMN). However, the effectiveness of the PES heavily depends on the number of actions, which has to be limited. At least one state is related to a unique action Paull and Unger (1959), therefore the number of possible actions have to be reduced for revealing the interplay in Scope 1. In the cases of Atari games, the possible actions are often fewer than (e.g., “fire”, “move left/right”, “jump”). As pointed in the introduction, the problems studied here involve multiple POMDPs, leading to a large number of states and transition in the obtained MMN.
The first improvement is in the quantization, where continuousdiscrete interfaces are optimized for actions, reducing the number of quantized actions given acceptable DOBnet performance loss. The second improvement is the abstraction of key states and transitions in the MMN based on the evaluation of attention.
5.1 ContinuousDiscrete Interfaces
The proposed A approach first learns continuousdiscrete interfaces for observations and action, respectively. These interfaces commonly exist in hybrid system modelling Lu et al. (2015) for solving control of oftensaturated systems. The observation and action interfaces in the quantized DOBnet have been shown in Fig. 4 (better viewed in color), which are denoted as Observation Quantization (OQ) and the Action Quantization (AQ), respectively. Each Quantization block consists of a continuoustodiscrete interface and a discretetocontinuous interface.
Then, the components in the blue dashed rectangle and the ones in the green dashdotted rectangle correspond to the discreteevent subsystem and the mapping from the discrete hidden state to the continuous control, respectively. This will be discussed more in Section 6. In this paper, all interfaces are built upon neural networks, the detailed structures of which are shown in Section 6. In fact each quantization block is an autoencoder from the perspective of machine learning.
In general autoencoders consist of an encoder and an decoder, where the decoder aims to reconstruct the original inputs to the encoder. The autoencoder has been used widely to reduce data dimension using neural networks Hinton and Salakhutdinov (2006), which is often trained in a supervised manner.
One straightforward approach to have the interfaces is to evenly quantize the observation and action space, however, the quantization levels are not clear. Also, the importance of action (observation) to the controlled platform is not uniform across the action (observation) space. The states and actions that attract most attention from the optimallycontrolled platform are often subsets of the entire state and action spaces, respectively. We are interested in the interfaces that are both optimized with respective to these subsets.
In this paper, to produce a continuoustodiscrete interface, the output of the encoder is quantized through a combination of a level activation layer (denoted as Tanh*) and a quantization layer. Same to Koul et al. (2018), the Tanh* layer restricts the outputs in the range of and offers gradient near a valued input, which allows a quantization level at during training. The Tanh* activation function is given as Koul et al. (2018)
(7) 
With Tanh*, the quantization layer offers level quantization valued at .
With the continuousdiscrete interfaces inserted, the full quantized DOBnet is illustrated in Fig. 2. In the remainder of this paper, the original DOBnet is referred to as “continuous DOBnet” to distinguish from the quantized DOBnet.
Training: The QBNI algorithm, suggested in Koul et al. (2018), does not work well for learning OQ and AQ, since the number of quantized actions should also be minimized for effective reduction in obtaining a key MMN. Therefore, a threestep training approach is used to train both OQ and AQ. The number
of neurons in the encoder layer of AQ determines the cardinality of the set of all possible discrete actions. The cardinality is
since each quantization neuron has levels. On one hand, a large leads to less optimality loss from the quantization, compared with the continuous DOBnet. On other hand, a small results in fewer action choices and thus fewer MMN states after reduction by PES. Therefore, the number of discrete actions is expected to be minimized for the benefit of reducing the number of states in the MMN. By choosing the number of neurons in the quantization layer, the performance degeneration should be restricted within a reasonable number (e.g., ).Step one: The continuous DOBnet is first trained by the Advantage Actor Critic (A2C) Mnih et al. (2016), as shown in Wang et al. (2019). A2C uses synchronous gradient descents for optimizing policy networks and it executes multiple instances of the environments in parallel threads. This parallelism provides a more training estimation of critics.
Step Two
: A data set of observations and actions from a large number of episodes is collected through using the trained continuous DOBnet. Note that in each episode, a disturbance pattern is randomly generated, which is i.i.d. to the pattern in another episode. Then OQ and AQ are trained respectively using the observation and action data through supervised learning. Since the data is collected from using the optimal DOBnet, the data reflects the nonuniform distribution of attention in the action and observation space.
Step Three: The trained OQ and AQ are inserted into the trained continuous DOBnet to obtain the quantized DOBnet, as shown in the Fig. 2 (HQ is deactivated). However, the performance of the quantized DOBnet is not close to the continuous DOBnet (worse by ). Then the entire quantized DOBnet is finetuned in a RL fashion, same to Step One. The quantization layer introduces functions that are nondifferentiable. During the training, a straightthrough estimator for gradients, as suggested in Bengio et al. (2013), is adopted. The estimator simply treats the quantize function as an identity function during back propagation and passes on the gradients without any change. The results shown in Section 6 suggest that the performance of the quantized DOBnet resulted from the threestep training is close to the performance from the continuous DOBnet.
5.2 Key Moore Machine Network
The data sets of the discrete hidden states, the discrete observations, and the discrete actions are collected during solving Problem 1 in multiple randomlygenerated environments. In addition, the transitions between consecutive pairs of the quantized hidden states are also recorded.
Then unique states are found and indexed for each data set, resulting in a MMN. Let denote the cardinality of the state space of the MMN and the cardinality of the observation space of the MMN, then the transition function of this MMN is constructed as a transition matrix of that captures the transitions evidenced in the data. In general, and are larger than necessary.
A reduced but equivalent MMN can be obtained by a standard finite state machine reduction technique (i.e., PES in this paper), which is able to group hidden states and observations if a common transition and action can be found. Each group of the hidden states is referred to as a state in the reduced MMN and each group of the observations is referred to as an observation in the reduced MMN. This reduced MMN is able to show how states, observations, and actions are related to problems, as shown in Koul et al. (2018).
However, Problem subject to various environments are better described by multiple randomly sampled POMDPs. The number of states and observations in the reduced MMN are still too large to induce explainable relationship among states, action, and environments. In fact, the systems (controlled by the quantized DOBnet) visit different portions of the reduced MMN in different episodes (i.e., under various disturbance patterns), as illustrated in Fig. 3. As shown in Section 6, the number of states in the reduced MMN was still quite large () compared to Atari games investigated in Koul et al. (2018).
In order to understand the interplay between disturbances and control strategies, in this paper, we propose a Key Moore Machine Network (KMMN), which ignores some states and transitions in the reduced MMN. Some of the states and observations are unique to an episode (i.e, a POMDP), while others attract more attention from a number of episodes.
Definition 4 (Key Moore machine network)
A key Moore machine network is a finitestate automaton that only consists of the key states and transitions between key states. The key state are those MMN states that attract sufficient attention from the controlled systems in a number of environments. The attention of a state is defined as the number of episodes that visit this state. A transition between the key states is available if a concatenated transition can be found in the reduced MMN.
The relation between the KMMN and the reduced MMN is shown in Fig. 3, where the MMN states other than key states are referred to as relay states. Note that the transitions in KMMN are different from the MMN transitions. A KMMN transition may involve multiple MMN transitions. One MMN transition corresponds to one step defined in POMDPs. Since we are interested in the interplay between the control strategies and the environments (i.e., disturbances, POMDPs), we extract key MMN states that are commonly visited by a number of POMDPs. The KMMN greatly reduces the number of states and transitions, providing a baseline for inductive learning of the interplay. To find KMMN, the step of obtaining the reduced MMN is necessary. Otherwise the chance of having states with sufficient attention is quite low.
6 IMPLEMENTATION and RESULTS
This section first outlines the simulation details of the platform and disturbances. Then, the implementation of learning interfaces (AQ, OQ, and HQ) and results are presented. After that the obtained MMN and KMMN are summarized, as well as the found switching mechanism captured in the DOBnet.
6.1 Platform and Disturbances
As described in the problem formulation, the platforms is assumed stable in orientation. Only translational motion and control are considered, thus, the platform has a dimensional state space (positions and linear velocities) and a dimensional action space. In order to analyze the results more intuitively, the characteristics (mass, control, gravity and buoyancy forces, and disturbance forces) of the platform are scaled down such that the mass of the simulated platform is [kg]. Then, the control saturation is given as .
Each episode contains steps with second per step. In each episode, the platform starts at a random position with a random velocity, and it is controlled to reach a given position (the origin), aiming to keep its position within a range (as small as possible) to the origin against unknown excessive disturbances. In these simulations, the external disturbances are exerted on all three directions in the inertial frame. In each axis, the disturbance is sinusoidal and then the disturbance superposition is given as
(8) 
where
(9) 
and
denotes a uniform distribution in the range
. According to the problem setting, the amplitudes of disturbances exceed the control limits by . The purpose of the DOBnet training is to enable the trained network to deal with unknown timevarying disturbances, thus the values of the amplitude, period, and phase are randomly sampled in each training or testing episode.6.2 Learning Interfaces
The interfaces for action (AQ) are illustrated in Fig. 4, which consists of linear layers, quantization layer, and hyperbolic tangent (denoted as Tanh) activation layers. One of the activation layers is a level activation layer (defined in eq. (7) and denoted as Tanh*). The encoder component of the autoencoder is a continuoustodiscrete interface, while the decoder component is a discretetocontinuous interface. The interfaces for action and observation share a similar autoencoder structure with different numbers of neurons in linear layers and the quantization layer. In Fig. 4, the numbers and symbols in parentheses show the input, the output, and the number of neurons regarding OQ.
The neuron numbers were manually picked such that the quantized DOBnet performs similarly to its continuous counterpart. As pointed out earlier, the number of neurons in the encoding layer of AQ is critical. It is expected to minimize this number without loosing much optimality in the resultant quantized DOBnet. It was manually picked via trailanderror approach. The neuron number was first set to , however the resultant performance was not satisfactory. The collected reward (negative) was nearly doubled. Then, the neuron number was set to and , respectively. It was found that is sufficient for retaining optimality. The continuous DOBnet and quantized DOBnet exhibit on average difference in rewards collected in an episode. The number of neurons in OQ is also critical, choices of , , , and were tested and it was found that is an appropriate for the DOBnet. The choice of neuron numbers has been studied in the field of neural architecture search and can possibly be solved via RL Zoph and Le (2016), however it is out of the paper scope.
Since the disturbances exceed the control saturation frequently, the platform inevitably oscillates and so does the error of position regulation. The DOBnet requires some steps to collect sufficient data to infer the environment in the hidden state. Here the maximum tracking error from to is used as one criteria to show the effectiveness of the learnt DOBnets. It is referred to as the regulation error and given as
(10) 
The D trajectories from both quantized and continuous DOBnets for same problems (i.e., same POMDPs defined in eq. (3)) have been illustrated in Figures 5 and 6, respectively. The transparent red and blue spheres respectively represent the regulation errors from the quantized and continuous DOBnets. Clearly, the quantized DOBnet was able to achieve trajectories similar to the one from the continuous DOBnet. Furthermore, the regulation error did not increase much.
6.3 Moore Machine Networks
Once the interfaces for action and observations was trained, another set of simulations using the quantized DOBnet were conducted. A data set of the GRU hidden states was collected from episodes. In each episode, the disturbance pattern was randomly generated according to eq. (6.1). Following Koul et al. (2018), the autoencoder for quantizing hidden states is illustrated in Fig. 7, which consists of linear layers, quantization layer, and Tanh activation layers, where one of the activation layer is Tanh*.
The data collected was used to train HQ in a supervised manner. Different from usual loss functions, the importance of recursive stability was emphasized. The loss function used has two terms; the first one is standard and the second one regulates the recursive stability. The loss function is defined as
(11) 
where was set as
. Using stochastic gradient descent approach with the learning rate
, the training error (mean square error) was . The HQ network was inserted into the quantized DOBnet, as suggested in Koul et al. (2018), resulting in the full quantized DOBnet. The rewards collected in each episode by the quantized DOBnet has been compared with the ones collected by the continuous DOBnet in Fig. 8, showing about degeneration averaged over all episodes. As shown in Fig. 9, the averaged regulation error exhibited increase.Then another data was collected from simulations of episodes using the quantized DOBnet (with HQ inserted). Each episode has samples of observations, hidden states, current actions, and previous actions. Also the transitions between hidden states given observations and actions were recorded. It was found that the number of the unique hidden states was and the number of the unique observations was , suggesting that the system controlled by the quantized DOBnet in multiple environments did visit a large number of discrete hidden states. The number of the unique actions was , the maximum of which is .
Considering the transitions between discrete hidden states as an incompletely specified sequential switching functions, the number of hidden states and observations was grouped by PES Paull and Unger (1959). The number of unique groups of hidden states in the reduced MMN was reduced to and the number of observation groups was . We refer to each of this group as a state or a observation in MMN. It is nearly impossible to find insights about the interplay between environments and control strategies, due to the large number of the transitions and states. A portion of the MMN that highlights the transitions and states visited by two episodes has been illustrated in Fig. 3, where the key states were obtained in the following subsection.
6.4 Key Moore Machine Network
The goal of the KMMN is to extract some shared control logics used by the learnt DOBnet to solve different POMDPs defined in eq. (3), and thus to show the interplay between the control and disturbances. Here data from episodes were studied. The sufficient attention was defined as “ attention”. In other words, being qualified as a key state in KMMN, the state must attract attention from at least episodes out of .
We found that key states were picked by those episodes. One of the key states is the initial state since in all episodes the hidden state always started at zero. The key states found are shown in Table 1, which summarizes the key state indices, the quantized encodings, and the decoded actions. It was found the action at the beginning of each episode was almost zero, while the actions associated with other key states were always at the control saturation. More about this phenomenon will be discussed later in this section.
Key state index  Quantized encoding  Decoded action 

The transitions between the key states in episodes (out of ) converged to some cyclic patterns shown in Fig. 10. Figure 10 show examples, where first examples did not exhibit clear converged patterns. The remaining examples exhibited three cyclic transition patterns, highlighted by green solid arrows. In all examples, the state started from State and the system took a number of transitions to enter one of the cyclic patterns. It is because that at the beginning of each simulation (episode), the DOBnet intended to interact with the environments to gain observations for estimating the key aspects of the inherent POMDPs (i.e., disturbances and their transfer functions). The following analysis partially reveal how the hidden states are related to controls and disturbances.
Considering the associated action with each state in the KMMN, it was found that the learnt DOBnet behaved similarly to a hybrid controller where switchings occur. These switchings exhibited cyclic patterns due to the fact the disturbance in each direction was periodic. Each switching pattern indicated a disturbance pattern. As shown in Fig. 11, the disturbances in three directions are illustrated in red, green, and blue, respectively. The additive inversion of the controls associated with the states are also illustrated. Note that the values of the controls in and directions were added by and , respectively, for clear illustration. It was found that the states in the KMMN were only activated when the disturbances were close to the control saturation, as shown in Fig. 11. By inspecting the controls and unknown disturbances, it was shown that the obtained actions were synchronized with the disturbance forces.
Some episodes exhibited similar converged transition patterns, as shown in Fig. 10 (c), (d), (f), and (h). However, the way the system entered into the cyclic patterns varied in different environments. The system shown in Fig. 10(c) entered the cyclic pattern (referred to as cycle) through State directly, while the system shown in Fig. 10(h) entered the cycle through State after visiting State . In Fig. 10(d), the system visited States and , and then entered into the cycle at State . As illustrated in Fig. 11, in cases of (c), (d), (f), and (h), the disturbance forces in and directions had similar frequencies and phases, while the disturbance force in direction had different a frequency and phase. These examples show that the DOBnet was able to estimate disturbances and their inherent governing behavior.
In Fig. 10(e) and (f), the systems exhibited another two cyclic patterns. Interestingly, the episodes shown in Fig. 10(g) and (e) exhibited two cyclic patterns, respectively. For example, the system in Fig. 10(g) first entered the cycle (shown in blue solid lines) through State and then entered the second cycle at State and stayed in the second cycle.
The first two examples in Fig. 10 did not exhibit clear cyclic patterns. It is possible that the key states found in those episodes did not capture the states that were crucial to this two examples. More research about the definition of sufficient attention should be explored in future research.
The system in Fig. 10(e) entered in a binary switching pattern. With careful examination of the disturbances in Fig. 11(e), we found the components in the randomlygenerated disturbances had similar periods and phases. Therefore, the two states in the KMMN were sufficient to capture the periodic shifts. Overall, the key states found have strong correlation between disturbance patterns and the time instants when the disturbance forces were close to control saturation. The phases between disturbances change as a function of time, as shown in Fig. 11, which strongly ties the change of the hidden states and the action associated. Therefore, the observer designed in the DOBnet and learned together with the control subnetwork was able to estimate such shift in the phases and magnitudes of the disturbances.
7 Discussion
As pointed in Zuo et al. (2010); Benzaouia et al. (2010); Yuan and Wu (2015); Dong et al. (2010), the controlled platform whose control often reaches control saturation can be described by a switchingcontrolregulated system. This kind of systems can be characterized by
where is discrete state, governs the switching between the discrete states (refer to as “modes” in hybrid control), defines the transition function of the continuous state . Then the controlled platform can be depicted as the structure in Fig. 12.
The relation between the quantized DOBnet and the hybridsystem control can be found by comparing Figures 12 and 2. The components in the blue dashed rectangle in Fig. 2 correspond to the discreteevent subsystem, which is represented as the blue rounded rectangle in Fig. 12. The components in the green dashdotted rectangle in Fig. 2 correspond to the mapping between the discrete hidden states and the continuous controls, i.e., the discretetocontinuous interface and the controller shown in Fig. 12.
Note that in classical hybrid modelling and control, the red dashed arrow to the controller is necessary, which is not kept in the quantized DOBnet. Therefore, the quantized DOBnet only captures the discreteevent subsystem, which partially describes the interplay between the control strategy and the environments. The DOBnet is able to estimate the discreteevent subsystem online and generate its sufficient representation for effective control.
Cyclic switchings were found in the learnt DOBnet, showing the control policy is able to capture for position regulation problem in different environments (different POMDPs). In Fig. 11, the control between switching were not depicted for clear illustration, which may reflect . The continuous control based on feedback from continuous observation is missing in this study and should be included for future research.
8 Conclusion & Future Work
This paper proposes an attentionbased abstraction approach for finding a key Moore machine network, which reveals the switching mechanism that has been captured in the DOBnet and is key to excessive disturbance rejection. This method is effective in abstracting control logics in solving different POMDPs. Interestingly, the switching mechanism has been manually designed for controller developments in existing literature. This finding may offer a bridge between DOBnets and the hybrid systems for better analysis.
In the future, more effort will be devoted to a new definition of sufficient attention to better capture the control mechanisms common in solving multiple POMDPs. Also, the continuous controls should be characterized to show how the system is guided between switchings, for the purpose of fully understanding the control network in the language of hybrid control. Another interesting future work is to investigate the possibility of using the switching mechanism obtained through inductive learning as some distilled knowledge for transfer learning.
References
 Åström and Wittenmark (2013) Åström, K.J., Wittenmark, B., 2013. Adaptive control. Courier Corporation.
 Barnsley (2014) Barnsley, M.F., 2014. Fractals everywhere. Academic press.
 Barreto and et al. (2017) Barreto, A., et al., 2017. Successor features for transfer in reinforcement learning, in: Advances in neural information processing systems, pp. 4055–4065.
 Bengio et al. (2013) Bengio, Y., Léonard, N., Courville, A., 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 .
 Benzaouia et al. (2010) Benzaouia, A., Akhrif, O., Saydy, L., 2010. Stabilisation and control synthesis of switching systems subject to actuator saturation. International Journal of Systems Science 41, 397–409.

Brahmbhatt and Hays (2017)
Brahmbhatt, S., Hays, J.,
2017.
Deepnav: Learning to navigate large cities, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3087–3096.
 Camacho and Alba (2013) Camacho, E.F., Alba, C.B., 2013. Model predictive control. Springer Science & Business Media.
 Chen et al. (2000) Chen, W.H., Ballance, D.J., Gawthrop, P.J., O’Reilly, J., 2000. A nonlinear disturbance observer for robotic manipulators. IEEE Transactions on industrial Electronics 47, 932–938.
 Cheng and Krishnakumar (1993) Cheng, K.T., Krishnakumar, A.S., 1993. Automatic functional test generation using the extended finite state machine model, in: 30th ACM/IEEE Design Automation Conference, IEEE. pp. 86–91.
 Cleeremans et al. (1989) Cleeremans, A., ServanSchreiber, D., McClelland, J.L., 1989. Finite state automata and simple recurrent networks. Neural computation 1, 372–381.
 Crutchfield and Young (1988) Crutchfield, J.P., Young, K., 1988. Computation at the onset of chaos, in: The Santa Fe Institute, Westview, Citeseer.
 Dong et al. (2010) Dong, C., Hou, Y., Zhang, Y., Wang, Q., 2010. Model reference adaptive switching control of a linearized hypersonic flight vehicle model with actuator saturation. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering 224, 289–303.
 Doyle et al. (1989) Doyle, J.C., Glover, K., Khargonekar, P.P., Francis, B.A., 1989. Statespace solutions to standard h/sub 2/and h/sub infinity/control problems. IEEE Transactions on Automatic control 34, 831–847.
 Edwards and Spurgeon (1998) Edwards, C., Spurgeon, S., 1998. Sliding mode control: theory and applications. Crc Press.

Frasconi et al. (1996)
Frasconi, P., Gori, M.,
Maggini, M., Soda, G.,
1996.
Representation of finite state automata in recurrent radial basis function networks.
Machine Learning 23, 5–32.  Gao and Cai (2016) Gao, H., Cai, Y., 2016. Nonlinear disturbance observerbased model predictive control for a generic hypersonic vehicle. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering 230, 3–12.
 Gao (2014) Gao, Z., 2014. On the centrality of disturbance rejection in automatic control. ISA transactions 53, 850–857.
 Gao et al. (2001) Gao, Z., Huang, Y., Han, J., 2001. An alternative paradigm for control system design, in: Decision and Control, 2001. Proceedings of the 40th IEEE Conference on, IEEE. pp. 4578–4585.
 Ghafarirad et al. (2014) Ghafarirad, H., Rezaei, S.M., Zareinejad, M., Sarhan, A.A., 2014. Disturbance rejectionbased robust control for micropositioning of piezoelectric actuators. Comptes Rendus Mécanique 342, 32–45.
 Gu et al. (2016a) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R.E., Levine, S., 2016a. Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv preprint arXiv:1611.02247 .
 Gu et al. (2016b) Gu, S., Lillicrap, T., Sutskever, I., Levine, S., 2016b. Continuous deep qlearning with modelbased acceleration, in: International Conference on Machine Learning, pp. 2829–2838.

Gunning (2017)
Gunning, D., 2017.
Explainable artificial intelligence (xai).
Defense Advanced Research Projects Agency (DARPA), nd Web 2.  Han (1995) Han, J., 1995. The” extended state observer” of a class of uncertain systems [j]. Control and Decision 1.
 Hinton and Salakhutdinov (2006) Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. science 313, 504–507.
 Hubara et al. (2016) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., Bengio, Y., 2016. Binarized neural networks, in: Advances in neural information processing systems, pp. 4107–4115.
 Johnson (1968) Johnson, C., 1968. Optimal control of the linear regulator with constant disturbances. IEEE Transactions on Automatic Control 13, 416–421.
 Johnson (1971) Johnson, C., 1971. Accomodation of external disturbances in linear regulator and servomechanism problems. IEEE Transactions on automatic control 16, 635–644.
 Karkus et al. (2018) Karkus, P., Hsu, D., Lee, W.S., 2018. Particle filter networks: Endtoend probabilistic localization from visual observations. arXiv preprint arXiv:1805.08975 .
 Karpathy et al. (2015) Karpathy, A., Johnson, J., FeiFei, L., 2015. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078 .
 Koul et al. (2018) Koul, A., Greydanus, S., Fern, A., 2018. Learning finite state representations of recurrent policy networks. arXiv preprint arXiv:1811.12530 .
 Li et al. (2014) Li, S., Yang, J., Chen, W.H., Chen, X., 2014. Disturbance observerbased control: methods and applications. CRC press.
 Lillicrap et al. (2015) Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D., 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 .
 Lu and Liu (2017) Lu, W., Liu, D., 2017. Active task design in adaptive control of redundant robotic systems, in: Australasian Conference on Robotics and Automation, ARAA.
 Lu and Liu (2018) Lu, W., Liu, D., 2018. A frequencylimited adaptive controller for underwater vehiclemanipulator systems under large wave disturbances, in: The World Congress on Intelligent Control and Automation.
 Lu et al. (2015) Lu, W., Zhu, P., Ferrari, S., 2015. A hybridadaptive dynamic programming approach for the modelfree control of nonlinear switched systems. IEEE Transactions on Automatic Control 61, 3203–3208.
 Lu et al. (2016) Lu, W., Zhu, P., Ferrari, S., 2016. An approximate dynamic programming approach for modelfree control of switched systems. IEEE Transactions on Automatic Control .
 Mnih et al. (2016) Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K., 2016. Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, pp. 1928–1937.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al., 2015. Humanlevel control through deep reinforcement learning. Nature 518, 529.
 Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S., 2018. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning, in: Robotics and Automation (ICRA), 2018 IEEE International Conference on, IEEE. pp. 7579–7586.
 Oh et al. (2016) Oh, J., Chockalingam, V., Singh, S., Lee, H., 2016. Control of memory, active perception, and action in minecraft. arXiv preprint arXiv:1605.09128 .
 Ohishi et al. (1987) Ohishi, K., Nakao, M., Ohnishi, K., Miyachi, K., 1987. Microprocessorcontrolled dc motor for loadinsensitive position servo system. IEEE Transactions on Industrial Electronics , 44–49.
 Omlin and Giles (1992) Omlin, C.W., Giles, C.L., 1992. Training secondorder recurrent neural networks using hints, in: Machine Learning Proceedings 1992. Elsevier, pp. 361–366.
 Paull and Unger (1959) Paull, M.C., Unger, S.H., 1959. Minimizing the number of states in incompletely specified sequential switching functions. IRE Transactions on Electronic Computers , 356–367.
 Ranganathan and Campbell (2003) Ranganathan, A., Campbell, R.H., 2003. A middleware for contextaware agents in ubiquitous computing environments, in: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, Springer. pp. 143–161.
 Read (2011) Read, C., 2011. BP and the Macondo spill: the complete story. Springer.
 Sæmundsson et al. (2018) Sæmundsson, S., Hofmann, K., Deisenroth, M.P., 2018. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551 .
 Samek et al. (2017) Samek, W., Wiegand, T., Müller, K.R., 2017. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296 .
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P., 2015a. Trust region policy optimization, in: International Conference on Machine Learning, pp. 1889–1897.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P., 2015b. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 .
 Skogestad and Postlethwaite (2007) Skogestad, S., Postlethwaite, I., 2007. Multivariable feedback control: analysis and design. volume 2. Wiley New York.

Umeno et al. (1993)
Umeno, T., Kaneko, T.,
Hori, Y., 1993.
Robust servosystem design with two degrees of freedom and its application to novel motion control of robot manipulators.
IEEE Transactions on Industrial Electronics 40, 473–485.  Wang et al. (2019) Wang, T., Lu, W., Yan, Z., Liu, D., 2019. Dobnet: Actively rejecting unknown excessive timevarying disturbances. arXiv preprint arXiv:1907.04514 .
 Waslander and Wang (2009) Waslander, S., Wang, C., 2009. Wind disturbance estimation and rejection for quadrotor position control, in: AIAA Infotech@ Aerospace Conference and AIAA Unmanned… Unlimited Conference, p. 1983.
 Weiss et al. (2017) Weiss, G., Goldberg, Y., Yahav, E., 2017. Extracting automata from recurrent neural networks using queries and counterexamples. arXiv preprint arXiv:1711.09576 .
 Woolfrey et al. (2016) Woolfrey, J., Liu, D., Carmichael, M., 2016. Kinematic control of an autonomous underwater vehiclemanipulator system (auvms) using autoregressive prediction of vehicle motion and model predictive control, in: Robotics and Automation (ICRA), 2016 IEEE International Conference on, IEEE. pp. 4591–4596.
 Woolfrey et al. (2019) Woolfrey, J., Lu, W., Liu, D., 2019. A control method for joint torque minimization of redundant manipulators handling large external forces. Journal of Intelligent & Robotic Systems , 1–14.
 Xie and Guo (2000) Xie, L.L., Guo, L., 2000. How much uncertainty can be dealt with by feedback? IEEE Transactions on Automatic Control 45, 2203–2217.
 Yang et al. (2010) Yang, J., Li, S., Chen, X., Li, Q., 2010. Disturbance rejection of ball mill grinding circuits using dob and mpc. Powder Technology 198, 219–228.
 Yuan and Wu (2015) Yuan, C., Wu, F., 2015. Switching control of linear systems subject to asymmetric actuator saturation. International Journal of Control 88, 204–215.
 Zoph and Le (2016) Zoph, B., Le, Q.V., 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 .
 Zuo et al. (2010) Zuo, Z., Ho, D.W., Wang, Y., 2010. Fault tolerant control for singular systems with actuator saturation and nonlinear perturbation. Automatica 46, 569–576.
Comments
There are no comments yet.