I Introduction
The ability to teach robotic agents using expert demonstration of tasks promises exciting developments across several sectors of industry. Indirect learning approaches formulate this as a search problem within a solution space of plans, where some notional (unknown) reward function
induces the demonstrated behaviour. A key learning problem is then to estimate this reward function. This
reward inference approach is commonly known as inverse reinforcement learning (IRL).The IRL approach to apprenticeship learning [abbeel2004apprenticeship]
aims to match the frequency (counts) of features encountered in the learner’s behaviour with those observed in demonstrations. This technique provides necessary and sufficient conditions when the reward function is linear in the features encoding execution states, but results in ambiguities in associating optimal policies with reward functions or feature counts. An elegant reformulation of this using the principle of maximum entropy resolves ambiguities and results in a single optimal stochastic policy. Methods for maximumentropy IRL
[ziebart2008maximum, wulfmeier2015maximum, levine2011nonlinear]identify reward functions using maximum likelihood estimation, under the assumption that the probability of seeing a given trajectory is proportional to the exponential of the total reward along a given path. Unfortunately, these methods are fundamentally frequentist and thus struggle to cope with
repetitive suboptimal demonstrations, as they assume that frequent appearance implies relevance. i.e. If a feature is seen repeatedly across demonstration trajectories, it is deemed valuable, as are policies that result in observations of these features. This makes these approaches unsuitable for a broad class of tasks that require exploratory actions or environment identification during demonstration. e.g. an expert using an ultrasound scan to locate a tumour.Obtaining useful ultrasound images requires contact with a deformable body (see Fig. 1) at an appropriate position and contact force, with image quality affected by the amount of ultrasound gel between the body and the probe, and air pockets that obscure object detection. This means that human demonstrations are frequently and inherently suboptimal, requiring that a demonstrator actively search for target objects, while attempting to locate a good viewpoint position and appropriate contact force. This class of demonstration violates many of the assumptions behind maximum entropy IRL.
This work introduces a pairwise ranking model of reward that addresses these limitations. Instead of assigning reward based on the maximum entropy model, we attribute reward using a ranking model. Here, we assume that, in general, an expert acts to improve their current state. This means that it is likely that a state at a later stage in a demonstrated trajectory is more important than one seen at an earlier stage. We make use of this fact to generate feature pairs and train a probabilistic temporal ranking model from image pixels.
Experimental results show that this pairwise feature ranking successfully recovers reward maps from demonstrations in tasks requiring significant levels of exploration alongside exploitation (where maximum entropy IRL fails), and obtains similar performance to maximum entropy inverse IRL when optimal demonstrations are available.
We illustrate the value of our approach in a challenging ultrasound scanning application, where demonstrations inherently contain a searching process, and show that we can train a model to find a tumourlike mass in an imaging phantom^{1}^{1}1An imaging phantom is an object that mimics the physical responses of biological tissue, and is commonly used in medical imaging to evaluate and analyse imaging devices.. Ultrasound imaging is a safe and low cost sensing modality of significant promise for surgical robotics, and is already frequently used for autonomous needle steering and tracking [LIANG2010173, 6630795]. Autonomous visual servoing systems have been proposed in support of teleoperated ultrasound diagnosis [988970, 6224974]
, but these techniques rely on hand designed anatomical target detectors. The scanner introduced in this work is fully autonomous, and relies entirely on a reward signal learned from demonstration, in what we believe is a first for medical imaging. Importantly, the proposed pairwise ranking model provides more signal for learning, as a greater number of comparisons can be generated from each demonstration trajectory. This means that we can train a more effective prediction model from pixels than with maximum entropy IRL, which in turn opens up a number of avenues towards selfsupervised learning for medical imaging and diagnosis.
In summary, the primary contributions of this paper are

a trajectory state ranking reward model that empirically performs better than maximum entropy approaches for a class of suboptimal demonstrations requiring a degree of discovery in addition to a reward maximisation phase,

a method for autonomous ultrasound scanning using image sequence demonstrations.
Ii Related work
As mentioned previously, apprenticeship learning [abbeel2004apprenticeship]
is an alternative to direct methods of imitation learning
[Bagnell20155921] or behaviour cloning, and is currently dominated by approaches making use of maximum entropy assumptions.Iia Maximum entropy inverse reinforcement learning
Maximum entropy or maximum likelihood inverse reinforcement learning models the probability of a user preference for a given trajectory as proportional to the exponential of the total reward along the path [ziebart2008maximum],
(1) 
Here, denotes a state, an action, and the reward obtained for taking an action in a given state. It is clear that this reward model can be maximised by any number of reward functions. levine2011nonlinear use a Gaussian process prior to constrain the reward, while wulfmeier2015maximumbackpropagate directly through the reward function using a deep neural network prior. Maximum entropy inverse reinforcement learning approaches are typically framed as iterative policy search, where policies are identified to maximise the reward model. This allows for the incorporation of additional policy constraints and inductive biases towards desirable behaviours, as in relative entropy search [boularias2011relative], which uses a relative entropy term to keep policies near a baseline, while maximising reward feature counts. Maximum entropy policies can also be obtained directly, bypassing reward inference stages, using adversarial imitation learning [ho2016generative, finn2016connection, fu2017learning, ghasemipour2019divergence], although reward prediction is itself useful for medical imaging applications and safety concerns limit online policy search here. In contrast, this paper focuses primarily on developing an improved reward model that serves as a replacement for the maximum entropy assumption. In light of this, we focus on reward inference directly, and leave policy search extensions using the proposed approach to future work.
Although maximum entropy IRL is ubiquitous, alternative reward models have been proposed. For example, angelov2019composing train a neural reward model using demonstration sequences, to schedule high level policies in long horizon tasks. Here, they capture overhead scene images, and train a network to predict a number between and , assigned in increasing order to each image in a demonstration sequence. This ranking approach is similar to the pairwise ranking method we propose, but, as will be shown in later results, is limited by its rigid assumption of linearly increasing reward. MajumdarRSS17 propose flexible reward models that explicitly account for human risk sensitivity. Time contrasted networks [sermanet2018time] learn disentangled latent representations of video using time as a supervisory signal. Here, time synchronised images taken from multiple viewpoints are used to learn a latent embedding space where similar images (captured from different viewpoints) are close to one another. This embedding space can then be used to find policies from demonstrations. Time contrasted networks use a triplet ranking loss, and are trained using positive and negative samples (based on frame timing margins).
IiB Preferencebased inverse reinforcement learning
Preferencebased ranking of this form is widely used in inverse reinforcement learning to rate demonstrations [wirth2017survey], and preference elicitation [braziunas2006preference] is a well established area of research. For example, brochu2010bayesian use Bayesian optimisation with a pairwise ranking model to allow users to procedurally generate realistic animations. sugiyama2012preference use preferencebased inverse reinforcement learning for dialog control. Here, dialog samples are annotated with ratings, which are used to train a preferencebased reward model. brown2019extrapolating make use of this approach to improve robot policies through artificial trajectory ranking using increasing levels of injected noise.
Unlike brown2019extrapolating, which uses preference ranking over trajectories, our work uses preference ranking within trajectories, under the assumption that a demonstrator generally acts to improve or maintain their current state. We use a Bayesian ranking model [burke2017rapid] that accounts for potential uncertainty in this assumption, and is less restrictive than the linear increasing model of angelov2019composing. Bayesian ranking models are common in other fields – for example, TrueSkill^{TM} [herbrich2007trueskill]
is widely used for player performance modelling in online gaming settings, but has also been applied to to train imagebased style classifiers in fashion applications
[kiapour2014hipster] and to predict the perceived safety of street scenes using binary answers to the question “Which place looks safer?” [naik2014streetscore].IiC Bayesian optimisation and Gaussian processes for control
Given an appropriate reward model, autonomous ultrasound scanning requires a policy that balances both exploration and exploitation for active viewpoint selection or informative path planning. Research on active viewpoint selection is concerned with agents that choose viewpoints which optimise the quality of the visual information they sense. Similarly, informative path planning involves an agent choosing actions that lead to observations which most decrease uncertainty in a model. Gaussian processes (GP) are useful models for informative path planning because of their inclusion of uncertainty, dataefficiency, and flexibility as nonparametric models.
binney2012branch use GPs with a branch and bound algorithm, while cho2018informative perform informative path planning using GP regression and a mutual information action selection criterion. More general applications of GPs to control include PILCO [deisenroth2011pilco], where models are optimised to learn policies for reinforcement learning control tasks, and the work of ling2016gaussian, which introduces a GP planning framework that uses GP predictions in Hstage Bellman equations.
These Bayesian optimisation schemes are wellestablished methods for optimisation of an unknown function, and have been applied to many problems in robotics including policy search [7989380], object grasping [yi2016active], and bipedal locomotion [calandra2014].
By generating policies dependent on predictions for both reward value and model uncertainty, Bayesian optimisation provides a mechanism for making control decisions that can both progress towards some task objective and acquire information to reduce uncertainty. GP’s and Bayesian optimisation are often used together, with a GP acting as the surrogate model for a Bayesian optimisation planner, as in the mobile robot path planning approaches of [martinez2009bayesian] and [6907763]. Our work takes a similar approach, using GPbased Bayesian optimisation for path planning in conjunction with the proposed observation ranking reward model.
Iii Probabilistic temporal ranking
This paper incorporates additional assumptions around the structure of demonstration sequences, to allow for improved reward inference. We introduce a reward model that learns from pairwise comparisons sampled from demonstration trajectories. Here we assume that an observation or state seen later in a demonstration trajectory should typically generate greater reward than one seen at an earlier stage. Below, we first describe a fully probabilistic ranking model that allows for uncertainty quantification and inference from few demonstrations, before introducing a maximum likelihood approximation that can be efficiently trained in a fully endtoend fashion, and is more suited to larger training sets.
Iiia Fully probabilistic model
We build on the pairwise image ranking model of burke2017rapid, replacing pretrained object recognition image features with a latent state,
, learned using a convolutional variational autoencoder (CVAE),
(2) 
that predicts mean, , and diagonal covariance, , for input observation captured at time (assuming image inputs of dimension ).
Rewards are modelled using a Gaussian process prior,
(3) 
Here, we use and to denote states and reward pairs corresponding to training observations. is a matrix formed by vertically stacking latent training states, and a covariance matrix formed by evaluating a Matern32 kernel function
(4) 
for all possible combinations of latent state pairs , sampled from the rows of .
is a length scale parameter with a Gamma distributed prior,
, andis a diagonal heteroscedastic noise covariance matrix, with diagonal elements drawn from a Half Cauchy prior,
. At prediction time, reward predictions for image observations can be made by encoding the image to produce latent state , and conditioning the Gaussian in Equation (3) [williams2006gaussian].Using this model, the generative process for a pairwise comparison outcome, , between two input observation rewards and at time steps and , is modelled using a Bernoulli trial over the sigmoid of the difference between the rewards,
(5) 
IiiB Reward inference using temporal observation ranking
The generative model above is fit to demonstration sequences using automatic differentiation variational inference (ADVI) [kucukelbir2017automatic] by sampling a number of observation pairs from each demonstration sequence, which produce a comparison outcome if , and if .
Intuitively, this comparison test, which uses time as a supervisory signal (Fig. 2), operates as follows. Assume that an image captured at time step has greater reward than an image captured at . This means that the sigmoid of the difference between the rewards is likely to be greater than 0.5, which leads to a higher probability of returning a comparison outcome . Importantly, this Bernoulli trial allows some slack in the model – when the difference between the rewards is closer to 0.5, there is a greater chance that a comparison outcome of is generated by accident. This means that the proposed ranking model can deal with demonstration trajectories where the reward is nonmonotonic. Additional slack in the model is obtained through the heteroscedastic noise model, , which also allows for uncertainty in inferred rewards to be modelled.
Inference under this model amounts to using the sampled comparison outcomes from a demonstration trajectory to find rewards that generate similar comparison outcomes, subject to the Gaussian process constraint that images with similar appearance should exhibit similar rewards. After inference, we make reward predictions by encoding an input image, and evaluating the conditional Gaussian process at this latent state.
IiiC Maximum likelihood neural approximation
Given that learning from demonstration typically aims to require only a few trials, numerical inference under the fully Bayesian generative model described above is tractable, particularly if a sparse Gaussian process prior is used. However, in the case where a greater number of demonstrations or comparisons is available, we can perform maximum likelihood inference under the model above in an endtoend fashion, using the architecture in Fig. 3 (See Appendix for model parameters and training details). Here, we can replace the Gaussian process with a single layer fully connected network (FCN), , with parameters , since single layer FCN’s are known to approximate Gaussian processes [neal1996priors], and minimise a binary cross entropy loss over the expected comparison outcome alongside a variational autoencoder (VAE) objective,
(6)  
using stochastic gradient descent. Here,
denotes the overall loss, is a standard normal prior over the latent space, denotes the variational encoder, with parameters , represents the variational decoder, with parameters , andis the comparison output logit (sigmoid).
is a comparison outcome label and denotes a training sample image, witha sample from the latent space. Weight sharing is used for both the convolutional VAEs and FCNs. Once trained, the reward model is provided by encoding the input observation, and then predicting the reward using the FCN. This allows for rapid endtoend training using larger datasets and gives us the ability to backpropagate the comparison supervisory signal through the autoencoder, potentially allowing for improved feature extraction in support of reward modelling. However, this comes at the expense of uncertainty quantification, which is potentially useful for the design of riskaverse policies that need to avoid regions of uncertainty.
Iv Gaussian process model predictive control
Once the reward model has been learned, we use it in a Bayesian optimisation framework, selecting action that drives an agent to a desired state , drawn from a set of possible states using a lower confidence bound objective function that seeks to trade off expected reward returns against information gain or uncertainty reduction. It should be noted that alternative control strategies could also be used with the proposed reward model.
Here, we learn a mapping between reward and endeffector positions using a surrogate Gaussian process model with a radial basis function kernel,
(7) 
The Gaussian process is trained on pairs consisting of each visited state (in our experiments these are 3D Cartesian endeffector positions) and the corresponding reward predicted using the imagebased reward model. Actions are then chosen to move to a desired state selected using the objective function,
(8) 
Here ,
are the mean and standard deviation of the Gaussian process, and
is a hyperparameter controlling the exploration exploitation tradeoff of the policy. This objective function is chosen in order to balance the competing objectives of visiting states that are known to maximise reward with gaining information about values of states for which the model is more uncertain. In our ultrasound imaging application, actions are linear motions to a desired Cartesian state.
V Results
The probabilistic temporal ranking reward model is evaluated using three tasks. First, we compare pairwise reward learning against existing inverse reinforcement learning approaches, using a grid world environment where demonstrations are provided by an optimal policy learned using value iteration. This is followed by experiments where policies are suboptimal, requiring that an agent first search for a target location, before moving towards it.
Finally, we demonstrate the use of the proposed reward model on a much harder task, where we learn a reward model from kinesthetic demonstrations of ultrasound scanning (Fig. 1) using a deformable imaging phantom constructed using a soft plastic casing filled will ultrasound gel, and with a tumourlike mass^{2}^{2}2A roughly 30 mm x 20 mm blob of Blu tack original in a container of dimensions 200 mm x 150 mm x 150 mm. suspended within. Here, we show that the proposed reward model not only learns to associate target objects in an ultrasound scan (semisupervised), but also to capture new scans, successfully moving a manipulator to a position and orientation with a contact force that results in a target scan containing a target object of interest.
Va Grid world – optimal demonstrations
The first experiment considers a simple grid world, where a Gaussian point attractor is positioned at some unknown location. Our goal is to learn a reward model that allows an agent (capable of moving up, down, left and right) to move towards the target location.
We generate 5 demonstrations (grid positions) from random starting points, across 100 randomised environment configurations with different goal points. We then evaluate performance over 100 trials in each configuration, using a policy obtained through tabular value iteration using the reward model inferred from the 5 demonstrations. This policy is optimal, as the target location is known, so for all demonstrations the agent moves directly towards the goal, as illustrated for the sample environment configuration depicted in Fig. 4.
Reward  

GPPTRIRL  9.51 4.92 
GPMEIRL [levine2011nonlinear]  9.58 4.90 
GPIncreasingIRL [angelov2019composing]  7.39 5.72 
Table I shows the averaged total returns obtained for trials in environments when rewards are inferred from optimal demonstrations using the probabilistic temporal ranking^{3}^{3}3We use PyMC3 [10.7717/peerjcs.55] (GPPTRIRL) to build probabilistic reward models and ADVI [kucukelbir2017automatic] for model fitting. (GPPTRIRL), a Gaussian process maximum entropy approach [levine2011nonlinear] (GPMEIRL) and an increasing linear model assumption [angelov2019composing] (GPIncreasingIRL). Value iteration is used to find a policy using the mean inferred rewards.
In the optimal demonstration case, policies obtained using both the maximum entropy and probabilistic temporal ranking approach perform equally well, although the pairwise ranking model assigns more neutral rewards to unseen states (Fig. 4). Importantly, as the proposed model is probabilistic, the uncertainty in predicted reward can be used to restrict a policy to regions of greater certainty by performing value iteration using an appropriate acquisition function (eg. Equation 8) instead of the mean reward. This implicitly allows for riskbased policies – by weighting uncertainty higher, we could negate the neutrality of the ranking model (riskaverse). Alternatively, we could tune the weighting to actively seek out uncertain regions with perceived high reward (riskseeking).
VB Grid world – suboptimal demonstrations
Our second experiment uses demonstrations that are provided by an agent that first needs to explore the environment, before exploiting it. Here, we use the Gaussian process model predictive control policy of Section IV to generate demonstrations, and repeat the experiments above. As shown in Fig. 5, this policy may need to cover a substantial portion of the environment before locating the target. Table II shows the averaged total returns obtained for trials in environments when rewards are inferred from suboptimal demonstrations using the probabilistic temporal ranking, the Gaussian process maximum entropy approach and the increasing linear model assumption. Here, value iteration is used to find the optimal policy using the inferred rewards.
Reward  

GPPTRIRL  7.42 4.82 
GPMEIRL [levine2011nonlinear]  3.31 4.24 
GPIncreasingIRL [angelov2019composing]  2.77 4.30 
In this suboptimal demonstration case, policies obtained using the maximum entropy approach regularly fail, while the probabilistic temporal ranking continues to perform relatively well. Fig. 5 shows a sample environment used for testing. The suboptimal behaviour of the exploring model predictive control policies used for demonstration can result in frequent visits to undesirable states, which leads to incorrect reward attribution under a maximum entropy model. Probabilistic temporal ranking avoids this by using the looser assumption that states generally improve over time.
VC Autonomous ultrasound scanning
For our final experiment, we demonstrate the use of probabilistic temporal ranking in a challenging ultrasound scanning application. Here, we capture 10 kinesthetic demonstrations (see Fig. 6) of a search for a target object using a compliant manipulator, and use only ultrasound image sequences (2D trapezoidal crosssectional scans) to learn a reward model. Our goal is to use this reward model within a control policy that automatically searches for and captures the best image of a target object.
This task is difficult because it involves a highly uncertain and dynamic domain. Obtaining stable ultrasound images requires contact with a deformable imaging phantom at an appropriate position and contact force, with image quality affected by the thickness of the ultrasound gel between the phantom and the probe, while air pockets within the phantom object can obscure object detection. Moreover, air pockets and gel can move in response to manipulator contact. This means that kinesthetic demonstrations are inherently suboptimal, as they require that a demonstrator actively search for target objects, while attempting to locate a good viewpoint position and appropriate contact force. As in realworld medical imaging scenarios, the demonstrator is unable to see through the phantom object from above, so demonstrations are based entirely on visual feedback from an ultrasound monitor.
Fig. 7 shows the predicted reward sequence for a sample expert demonstration trace held out from model training. It is clear that the ranking reward model captures the general improvements in image quality that occur as the demonstrator searches for a good scanning view, and that some searching is required before a good viewpoint is found. Importantly, the slack in the pairwise ranking model, combined with the model assumption that similar images result in similar rewards, allows for these peaks and dips in reward to be modelled.
We qualitatively assessed the image regions and features identified using the reward model using saliency maps (Fig. 8), which indicate that the proposed approach has learned to associate the target object with reward.
We compare probabilistic temporal ranking with a Gaussian process maximum entropy inverse reinforcement learning approach. For both models we use the same latent feature vector (extracted using a standalone variational autoencoder following the architecture in Fig.
3), and the same Bayesian optimisation policy to ensure a fair comparison. The acquisition function of the Bayesian optimisation policy is configured to be relatively conservative and to explore the entire search space (a predefined volume above the phantom) before exploiting, to ensure convergence to a good viewpoint. We compare the two approaches by evaluating the final image captured during scanning, and investigating the reward traces associated with each model.Trials were repeated 15 times for each approach, alternating between each, and ultrasound gel was replaced after 10 trials. Each trial ran for approximately 5 minutes, and was stopped when the robot pose had converged to a stable point, or after 350 frames had been observed. A high quality ultrasound scan is one in which the contours of the target object stand out as high intensity, where the object is centrally located in a scan, and imaged clearly enough to give some idea of the target object size (see Fig. 1).
As shown in Fig. 9, the probabilistic temporal ranking model consistently finds the target object in the phantom, but also results in substantially clearer final images. The maximum entropy approach fails more frequently than the ranking approach, and when detection is successful, tends to find offcentre viewpoints, and only images small portions of the target object.
It is particularly interesting to compare the reward traces for the probabilistic temporal ranking model to those obtained using maximum entropy IRL when the Bayesian optimisation scanning policy is applied. Fig. 10 overlays the reward traces obtained for each trial. The maximum entropy reward is extremely noisy throughout trials, indicating that it has failed to adequately associate image features with reward. Similar images fail to consistently return similar rewards, so the Bayesian optimisation policy struggles to converge to an imaging position with a stable reward score. In contrast, the reward trace associated with the pairwise ranking model contains an exploration phase where the reward varies substantially as the robot explores potential viewpoints, followed by a clear exploitation phase where an optimal viewpoint is selected and a stable reward is returned.
We believe that maximum entropy reward inference fails for two primary reasons. First, probabilistic temporal ranking produces substantially more training data, as each pair of images sampled (50 000 pairs) from a demonstration provides a supervisory signal. In contrast, the maximum entropy approach treats an entire trajectory as a single data point (10 trajectories), and thus needs to learn from far fewer samples. Moreover, the maximum entropy reward assumes that frequently occurring features are a sign of a good policy, which means that it mistakenly associates undesirable frames seen during the scan’s searching process for frames of high reward.
Fig. 11 shows the predicted reward over the search volume (a 50 mm x 50 mm x 30 mm region above the imagining phantom) for a pairwise ranking trial, determined as part of the Bayesian optimisation search for images with high reward. Here, we capture images at 3D endeffector locations according to the Bayesian optimisation policy, and predict the reward over the space of possible endeffector states using (7). Importantly, the Gaussian process proxy function is able to identify an ultrasound positioning region associated with high reward. This corresponds to a position above the target object, where the contact force with the phantom is firm enough to press through air pockets, but light enough to maintain a thin, airtight layer of gel between the probe and phantom.
(a) Scan volume  (b) Reward map 
When comparing with human scanning, a primary challenge we have yet to overcome is that of spreading ultrasound gel smoothly over a surface. Human demonstrators implicitly spread ultrasound gel evenly over a target as part of the scanning process so as to obtain a high quality image. The Gaussian process policy used in this work is unable to accomplish this, which means scans are still noisier than those taken by human demonstrators. Moreover, human operators typically make use of scanning parameters like image contrast, beam width and scanning depth, which we kept fixed for these experiments. Nevertheless, the results presented here show extensive promise for the development of targeted automatic ultrasound imaging systems, and open up new avenues towards semisupervised medical diagnosis.
Vi Conclusion
This work introduces an approach to inverse optimal control or reinforcement learning that infers rewards using a pairwise ranking approach. Here, we take advantage of the fact that demonstrations, whether optimal or suboptimal, generally involve steps taken to improve upon an existing state. Results show that leveraging this to infer reward through a ranking model is more effective than common IRL methods in suboptimal cases where demonstrations require a period of discovery in addition to reward exploitation.
This paper also shows how the proposed reward inference model can be used for a challenging ultrasound imaging application. Here, we learn to identify image features associated with target objects using kinesthetic scanning demonstrations that are suboptimal, as they inevitably require a search for an object and position or contact force that returns a good image. Using this within a policy that automatically searches for positions^{4}^{4}4See https://sites.google.com/view/ultrasoundscanner for videos and higher resolution scan images. and contact forces that maximise a learned reward, allows us to automate ultrasound scanning.
Acknowledgments
This work is supported by funding from the Turing Institute, as part of the Safe AI for surgical assistance project. We are particularly grateful to the Edinburgh RAD group and Dr Paul Brennan for valuable discussions and recommendations.
Appendix
Maximum likelihood architecture parameters
Convolutional VAE  

Batch size  128 
Training epochs 
100 
RMS prop optimiser  learning rate=0.001, 
Input dims  
Encoder  
Conv 1  
Conv 2  kernel, relu, strides 2 
Conv 3  kernel, relu, strides 2 
Conv 4  kernel, relu, strides 2 
Dense FC  neurons, relu 
Dense FC 
output (mean, variance) 
Decoder  
Dense FC  neurons, relu 
Conv transpose 1  kernel, relu, strides 2 
Conv transpose 2  kernel, relu, strides 2 
Conv transpose 3  kernel, relu, strides 2 
Conv transpose 4  kernel, relu, strides 2 
Output dims  
Reward predictor  
Dense FC  linear, output dims 1 