Log In Sign Up

Learning robotic ultrasound scanning using probabilistic temporal ranking

by   Michael Burke, et al.

This paper addresses a common class of problems where a robot learns to perform a discovery task based on example solutions, or human demonstrations. For example consider the problem of ultrasound scanning, where the demonstration requires that an expert adaptively searches for a satisfactory view of internal organs, vessels or tissue and potential anomalies while maintaining optimal contact between the probe and surface tissue. Such problems are currently solved by inferring notional rewards that, when optimised for, result in a plan that mimics demonstrations. A pivotal assumption, that plans with higher reward should be exponentially more likely, leads to the de facto approach for reward inference in robotics. While this approach of maximum entropy inverse reinforcement learning leads to a general and elegant formulation, it struggles to cope with frequently encountered sub-optimal demonstrations. In this paper, we propose an alternative approach to cope with the class of problems where sub-optimal demonstrations occur frequently. We hypothesise that, in tasks which require discovery, successive states of any demonstration are progressively more likely to be associated with a higher reward. We formalise this temporal ranking approach and show that it improves upon maximum-entropy approaches to perform reward inference for autonomous ultrasound scanning, a novel application of learning from demonstration in medical imaging.


page 1

page 4

page 5

page 6

page 7

page 8


Learning Ultrasound Scanning Skills from Human Demonstrations

Recently, the robotic ultrasound system has become an emerging topic owi...

Learning Robotic Ultrasound Scanning Skills via Human Demonstrations and Guided Explorations

Medical ultrasound has become a routine examination approach nowadays an...

Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments

Meta reinforcement learning (Meta-RL) is an approach wherein the experie...

Autonomous Tissue Scanning under Free-Form Motion for Intraoperative Tissue Characterisation

In Minimally Invasive Surgery (MIS), tissue scanning with imaging probes...

Revisiting Maximum Entropy Inverse Reinforcement Learning: New Perspectives and Algorithms

We provide new perspectives and inference algorithms for Maximum Entropy...

Ultrasound-Guided Assistive Robots for Scoliosis Assessment with Optimization-based Control and Variable Impedance

Assistive robots for healthcare have seen a growing demand due to the gr...

I Introduction

The ability to teach robotic agents using expert demonstration of tasks promises exciting developments across several sectors of industry. Indirect learning approaches formulate this as a search problem within a solution space of plans, where some notional (unknown) reward function

induces the demonstrated behaviour. A key learning problem is then to estimate this reward function. This

reward inference approach is commonly known as inverse reinforcement learning (IRL).

The IRL approach to apprenticeship learning [abbeel2004apprenticeship]

aims to match the frequency (counts) of features encountered in the learner’s behaviour with those observed in demonstrations. This technique provides necessary and sufficient conditions when the reward function is linear in the features encoding execution states, but results in ambiguities in associating optimal policies with reward functions or feature counts. An elegant reformulation of this using the principle of maximum entropy resolves ambiguities and results in a single optimal stochastic policy. Methods for maximum-entropy IRL

[ziebart2008maximum, wulfmeier2015maximum, levine2011nonlinear]

identify reward functions using maximum likelihood estimation, under the assumption that the probability of seeing a given trajectory is proportional to the exponential of the total reward along a given path. Unfortunately, these methods are fundamentally frequentist and thus struggle to cope with

repetitive sub-optimal demonstrations, as they assume that frequent appearance implies relevance. i.e. If a feature is seen repeatedly across demonstration trajectories, it is deemed valuable, as are policies that result in observations of these features. This makes these approaches unsuitable for a broad class of tasks that require exploratory actions or environment identification during demonstration. e.g. an expert using an ultrasound scan to locate a tumour.

Obtaining useful ultrasound images requires contact with a deformable body (see Fig. 1) at an appropriate position and contact force, with image quality affected by the amount of ultrasound gel between the body and the probe, and air pockets that obscure object detection. This means that human demonstrations are frequently and inherently sub-optimal, requiring that a demonstrator actively search for target objects, while attempting to locate a good viewpoint position and appropriate contact force. This class of demonstration violates many of the assumptions behind maximum entropy IRL.

Fig. 1: This work considers the task of learning to search for and capture an image of a target object (2.) suspended in a scattering material (3.) housed within a deformable container (4.). Our goal is to learn a reward signal from demonstrations that allows us to move an ultrasound sensor (1.) to positions that produce clear images of the target object. High quality ultrasound images (right) captured by a human demonstrator show high intensity contour outlines, centre the target object of interest, and generally provide some indication of target object size.

This work introduces a pairwise ranking model of reward that addresses these limitations. Instead of assigning reward based on the maximum entropy model, we attribute reward using a ranking model. Here, we assume that, in general, an expert acts to improve their current state. This means that it is likely that a state at a later stage in a demonstrated trajectory is more important than one seen at an earlier stage. We make use of this fact to generate feature pairs and train a probabilistic temporal ranking model from image pixels.

Experimental results show that this pairwise feature ranking successfully recovers reward maps from demonstrations in tasks requiring significant levels of exploration alongside exploitation (where maximum entropy IRL fails), and obtains similar performance to maximum entropy inverse IRL when optimal demonstrations are available.

We illustrate the value of our approach in a challenging ultrasound scanning application, where demonstrations inherently contain a searching process, and show that we can train a model to find a tumour-like mass in an imaging phantom111An imaging phantom is an object that mimics the physical responses of biological tissue, and is commonly used in medical imaging to evaluate and analyse imaging devices.. Ultrasound imaging is a safe and low cost sensing modality of significant promise for surgical robotics, and is already frequently used for autonomous needle steering and tracking [LIANG2010173, 6630795]. Autonomous visual servoing systems have been proposed in support of teleoperated ultrasound diagnosis [988970, 6224974]

, but these techniques rely on hand designed anatomical target detectors. The scanner introduced in this work is fully autonomous, and relies entirely on a reward signal learned from demonstration, in what we believe is a first for medical imaging. Importantly, the proposed pairwise ranking model provides more signal for learning, as a greater number of comparisons can be generated from each demonstration trajectory. This means that we can train a more effective prediction model from pixels than with maximum entropy IRL, which in turn opens up a number of avenues towards self-supervised learning for medical imaging and diagnosis.

In summary, the primary contributions of this paper are

  • a trajectory state ranking reward model that empirically performs better than maximum entropy approaches for a class of sub-optimal demonstrations requiring a degree of discovery in addition to a reward maximisation phase,

  • a method for autonomous ultrasound scanning using image sequence demonstrations.

Ii Related work

As mentioned previously, apprenticeship learning [abbeel2004apprenticeship]

is an alternative to direct methods of imitation learning

[Bagnell-2015-5921] or behaviour cloning, and is currently dominated by approaches making use of maximum entropy assumptions.

Ii-a Maximum entropy inverse reinforcement learning

Maximum entropy or maximum likelihood inverse reinforcement learning models the probability of a user preference for a given trajectory as proportional to the exponential of the total reward along the path [ziebart2008maximum],


Here, denotes a state, an action, and the reward obtained for taking an action in a given state. It is clear that this reward model can be maximised by any number of reward functions. levine2011nonlinear use a Gaussian process prior to constrain the reward, while wulfmeier2015maximumbackpropagate directly through the reward function using a deep neural network prior. Maximum entropy inverse reinforcement learning approaches are typically framed as iterative policy search, where policies are identified to maximise the reward model. This allows for the incorporation of additional policy constraints and inductive biases towards desirable behaviours, as in relative entropy search [boularias2011relative], which uses a relative entropy term to keep policies near a baseline, while maximising reward feature counts. Maximum entropy policies can also be obtained directly, bypassing reward inference stages, using adversarial imitation learning [ho2016generative, finn2016connection, fu2017learning, ghasemipour2019divergence], although reward prediction is itself useful for medical imaging applications and safety concerns limit online policy search here. In contrast, this paper focuses primarily on developing an improved reward model that serves as a replacement for the maximum entropy assumption. In light of this, we focus on reward inference directly, and leave policy search extensions using the proposed approach to future work.

Although maximum entropy IRL is ubiquitous, alternative reward models have been proposed. For example, angelov2019composing train a neural reward model using demonstration sequences, to schedule high level policies in long horizon tasks. Here, they capture overhead scene images, and train a network to predict a number between and , assigned in increasing order to each image in a demonstration sequence. This ranking approach is similar to the pairwise ranking method we propose, but, as will be shown in later results, is limited by its rigid assumption of linearly increasing reward. Majumdar-RSS-17 propose flexible reward models that explicitly account for human risk sensitivity. Time contrasted networks [sermanet2018time] learn disentangled latent representations of video using time as a supervisory signal. Here, time synchronised images taken from multiple viewpoints are used to learn a latent embedding space where similar images (captured from different viewpoints) are close to one another. This embedding space can then be used to find policies from demonstrations. Time contrasted networks use a triplet ranking loss, and are trained using positive and negative samples (based on frame timing margins).

Ii-B Preference-based inverse reinforcement learning

Preference-based ranking of this form is widely used in inverse reinforcement learning to rate demonstrations [wirth2017survey], and preference elicitation [braziunas2006preference] is a well established area of research. For example, brochu2010bayesian use Bayesian optimisation with a pairwise ranking model to allow users to procedurally generate realistic animations. sugiyama2012preference use preference-based inverse reinforcement learning for dialog control. Here, dialog samples are annotated with ratings, which are used to train a preference-based reward model. brown2019extrapolating make use of this approach to improve robot policies through artificial trajectory ranking using increasing levels of injected noise.

Unlike brown2019extrapolating, which uses preference ranking over trajectories, our work uses preference ranking within trajectories, under the assumption that a demonstrator generally acts to improve or maintain their current state. We use a Bayesian ranking model [burke2017rapid] that accounts for potential uncertainty in this assumption, and is less restrictive than the linear increasing model of angelov2019composing. Bayesian ranking models are common in other fields – for example, TrueSkillTM [herbrich2007trueskill]

is widely used for player performance modelling in online gaming settings, but has also been applied to to train image-based style classifiers in fashion applications

[kiapour2014hipster] and to predict the perceived safety of street scenes using binary answers to the question “Which place looks safer?” [naik2014streetscore].

Ii-C Bayesian optimisation and Gaussian processes for control

Given an appropriate reward model, autonomous ultrasound scanning requires a policy that balances both exploration and exploitation for active viewpoint selection or informative path planning. Research on active viewpoint selection is concerned with agents that choose viewpoints which optimise the quality of the visual information they sense. Similarly, informative path planning involves an agent choosing actions that lead to observations which most decrease uncertainty in a model. Gaussian processes (GP) are useful models for informative path planning because of their inclusion of uncertainty, data-efficiency, and flexibility as non-parametric models.

binney2012branch use GPs with a branch and bound algorithm, while cho2018informative perform informative path planning using GP regression and a mutual information action selection criterion. More general applications of GPs to control include PILCO [deisenroth2011pilco], where models are optimised to learn policies for reinforcement learning control tasks, and the work of ling2016gaussian, which introduces a GP planning framework that uses GP predictions in H-stage Bellman equations.

These Bayesian optimisation schemes are well-established methods for optimisation of an unknown function, and have been applied to many problems in robotics including policy search [7989380], object grasping [yi2016active], and bipedal locomotion [calandra2014].

By generating policies dependent on predictions for both reward value and model uncertainty, Bayesian optimisation provides a mechanism for making control decisions that can both progress towards some task objective and acquire information to reduce uncertainty. GP’s and Bayesian optimisation are often used together, with a GP acting as the surrogate model for a Bayesian optimisation planner, as in the mobile robot path planning approaches of [martinez2009bayesian] and [6907763]. Our work takes a similar approach, using GP-based Bayesian optimisation for path planning in conjunction with the proposed observation ranking reward model.

Iii Probabilistic temporal ranking

This paper incorporates additional assumptions around the structure of demonstration sequences, to allow for improved reward inference. We introduce a reward model that learns from pairwise comparisons sampled from demonstration trajectories. Here we assume that an observation or state seen later in a demonstration trajectory should typically generate greater reward than one seen at an earlier stage. Below, we first describe a fully probabilistic ranking model that allows for uncertainty quantification and inference from few demonstrations, before introducing a maximum likelihood approximation that can be efficiently trained in a fully end-to-end fashion, and is more suited to larger training sets.

Iii-a Fully probabilistic model

We build on the pairwise image ranking model of burke2017rapid, replacing pre-trained object recognition image features with a latent state,

, learned using a convolutional variational autoencoder (CVAE),


that predicts mean, , and diagonal covariance, , for input observation captured at time (assuming image inputs of dimension ).

Rewards are modelled using a Gaussian process prior,


Here, we use and to denote states and reward pairs corresponding to training observations. is a matrix formed by vertically stacking latent training states, and a covariance matrix formed by evaluating a Matern32 kernel function


for all possible combinations of latent state pairs , sampled from the rows of .

is a length scale parameter with a Gamma distributed prior,

, and

is a diagonal heteroscedastic noise covariance matrix, with diagonal elements drawn from a Half Cauchy prior,

. At prediction time, reward predictions for image observations can be made by encoding the image to produce latent state , and conditioning the Gaussian in Equation (3) [williams2006gaussian].

Using this model, the generative process for a pairwise comparison outcome, , between two input observation rewards and at time steps and , is modelled using a Bernoulli trial over the sigmoid of the difference between the rewards,


Iii-B Reward inference using temporal observation ranking

The generative model above is fit to demonstration sequences using automatic differentiation variational inference (ADVI) [kucukelbir2017automatic] by sampling a number of observation pairs from each demonstration sequence, which produce a comparison outcome if , and if .

Fig. 2: Time is used as a supervisory signal, by sampling image pairs at times , and setting otherwise.

Intuitively, this comparison test, which uses time as a supervisory signal (Fig. 2), operates as follows. Assume that an image captured at time step has greater reward than an image captured at . This means that the sigmoid of the difference between the rewards is likely to be greater than 0.5, which leads to a higher probability of returning a comparison outcome . Importantly, this Bernoulli trial allows some slack in the model – when the difference between the rewards is closer to 0.5, there is a greater chance that a comparison outcome of is generated by accident. This means that the proposed ranking model can deal with demonstration trajectories where the reward is non-monotonic. Additional slack in the model is obtained through the heteroscedastic noise model, , which also allows for uncertainty in inferred rewards to be modelled.

Inference under this model amounts to using the sampled comparison outcomes from a demonstration trajectory to find rewards that generate similar comparison outcomes, subject to the Gaussian process constraint that images with similar appearance should exhibit similar rewards. After inference, we make reward predictions by encoding an input image, and evaluating the conditional Gaussian process at this latent state.

Iii-C Maximum likelihood neural approximation

Given that learning from demonstration typically aims to require only a few trials, numerical inference under the fully Bayesian generative model described above is tractable, particularly if a sparse Gaussian process prior is used. However, in the case where a greater number of demonstrations or comparisons is available, we can perform maximum likelihood inference under the model above in an end-to-end fashion, using the architecture in Fig. 3 (See Appendix for model parameters and training details). Here, we can replace the Gaussian process with a single layer fully connected network (FCN), , with parameters , since single layer FCN’s are known to approximate Gaussian processes [neal1996priors], and minimise a binary cross entropy loss over the expected comparison outcome alongside a variational autoencoder (VAE) objective,


using stochastic gradient descent. Here,

denotes the overall loss, is a standard normal prior over the latent space, denotes the variational encoder, with parameters , represents the variational decoder, with parameters , and

is the comparison output logit (sigmoid).

is a comparison outcome label and denotes a training sample image, with

a sample from the latent space. Weight sharing is used for both the convolutional VAEs and FCNs. Once trained, the reward model is provided by encoding the input observation, and then predicting the reward using the FCN. This allows for rapid end-to-end training using larger datasets and gives us the ability to backpropagate the comparison supervisory signal through the autoencoder, potentially allowing for improved feature extraction in support of reward modelling. However, this comes at the expense of uncertainty quantification, which is potentially useful for the design of risk-averse policies that need to avoid regions of uncertainty.


Fig. 3:

Neural (maximum likelihood) model. Sampled images are auto-encoded, and a reward network predicts corresponding rewards, the sigmoid of the difference between these reward produces a comparison outcome probability. Weight sharing is indicated by colour. The network is trained jointly using a joint variational autoencoder and binary cross entropy loss.

Iv Gaussian process model predictive control

Once the reward model has been learned, we use it in a Bayesian optimisation framework, selecting action that drives an agent to a desired state , drawn from a set of possible states using a lower confidence bound objective function that seeks to trade off expected reward returns against information gain or uncertainty reduction. It should be noted that alternative control strategies could also be used with the proposed reward model.

Here, we learn a mapping between reward and end-effector positions using a surrogate Gaussian process model with a radial basis function kernel,


The Gaussian process is trained on pairs consisting of each visited state (in our experiments these are 3D Cartesian end-effector positions) and the corresponding reward predicted using the image-based reward model. Actions are then chosen to move to a desired state selected using the objective function,


Here ,

are the mean and standard deviation of the Gaussian process, and

is a hyperparameter controlling the exploration exploitation trade-off of the policy. This objective function is chosen in order to balance the competing objectives of visiting states that are known to maximise reward with gaining information about values of states for which the model is more uncertain. In our ultrasound imaging application, actions are linear motions to a desired Cartesian state.

(a) Ground truth
(d) GP-Increasing-IRL
Fig. 4: Reward inference from optimal demonstrations. Demonstration trajectories are marked in red, and the colour map indicates the reward for each grid position. Probabilistic temporal ranking and maximum entropy models have similar relative reward values, and policies trained using these rewards perform near identically. A linearly increasing reward model attributes reward more evenly across a demonstration, resulting in sub-optimal policy performance here.
(a) Ground truth
(d) GP-Increasing-IRL
Fig. 5: Reward inference from sub-optimal demonstrations. Demonstration trajectories are marked in red, and the colour map indicates the reward for each grid position. Both the linearly increasing and maximum entropy reward models induce local maxima that result in sub-optimal policies.

V Results

The probabilistic temporal ranking reward model is evaluated using three tasks. First, we compare pairwise reward learning against existing inverse reinforcement learning approaches, using a grid world environment where demonstrations are provided by an optimal policy learned using value iteration. This is followed by experiments where policies are sub-optimal, requiring that an agent first search for a target location, before moving towards it.

Finally, we demonstrate the use of the proposed reward model on a much harder task, where we learn a reward model from kinesthetic demonstrations of ultrasound scanning (Fig. 1) using a deformable imaging phantom constructed using a soft plastic casing filled will ultrasound gel, and with a tumour-like mass222A roughly 30 mm x 20 mm blob of Blu tack original in a container of dimensions 200 mm x 150 mm x 150 mm. suspended within. Here, we show that the proposed reward model not only learns to associate target objects in an ultrasound scan (semi-supervised), but also to capture new scans, successfully moving a manipulator to a position and orientation with a contact force that results in a target scan containing a target object of interest.

V-a Grid world – optimal demonstrations

The first experiment considers a simple grid world, where a Gaussian point attractor is positioned at some unknown location. Our goal is to learn a reward model that allows an agent (capable of moving up, down, left and right) to move towards the target location.

We generate 5 demonstrations (grid positions) from random starting points, across 100 randomised environment configurations with different goal points. We then evaluate performance over 100 trials in each configuration, using a policy obtained through tabular value iteration using the reward model inferred from the 5 demonstrations. This policy is optimal, as the target location is known, so for all demonstrations the agent moves directly towards the goal, as illustrated for the sample environment configuration depicted in Fig. 4.

GP-PTR-IRL 9.51 4.92
GP-ME-IRL [levine2011nonlinear] 9.58 4.90
GP-Increasing-IRL [angelov2019composing] 7.39 5.72
TABLE I: Averaged total returns using VI policy trained using inferred reward from optimal demonstrations.

Table I shows the averaged total returns obtained for trials in environments when rewards are inferred from optimal demonstrations using the probabilistic temporal ranking333We use PyMC3 [10.7717/peerj-cs.55] (GP-PTR-IRL) to build probabilistic reward models and ADVI [kucukelbir2017automatic] for model fitting. (GP-PTR-IRL), a Gaussian process maximum entropy approach [levine2011nonlinear] (GP-ME-IRL) and an increasing linear model assumption [angelov2019composing] (GP-Increasing-IRL). Value iteration is used to find a policy using the mean inferred rewards.

In the optimal demonstration case, policies obtained using both the maximum entropy and probabilistic temporal ranking approach perform equally well, although the pairwise ranking model assigns more neutral rewards to unseen states (Fig. 4). Importantly, as the proposed model is probabilistic, the uncertainty in predicted reward can be used to restrict a policy to regions of greater certainty by performing value iteration using an appropriate acquisition function (eg. Equation 8) instead of the mean reward. This implicitly allows for risk-based policies – by weighting uncertainty higher, we could negate the neutrality of the ranking model (risk-averse). Alternatively, we could tune the weighting to actively seek out uncertain regions with perceived high reward (risk-seeking).

V-B Grid world – sub-optimal demonstrations

Our second experiment uses demonstrations that are provided by an agent that first needs to explore the environment, before exploiting it. Here, we use the Gaussian process model predictive control policy of Section IV to generate demonstrations, and repeat the experiments above. As shown in Fig. 5, this policy may need to cover a substantial portion of the environment before locating the target. Table II shows the averaged total returns obtained for trials in environments when rewards are inferred from sub-optimal demonstrations using the probabilistic temporal ranking, the Gaussian process maximum entropy approach and the increasing linear model assumption. Here, value iteration is used to find the optimal policy using the inferred rewards.

GP-PTR-IRL 7.42 4.82
GP-ME-IRL [levine2011nonlinear] 3.31 4.24
GP-Increasing-IRL [angelov2019composing] 2.77 4.30
TABLE II: Averaged total returns using VI policy trained using inferred reward from sub-optimal demonstrations.

In this sub-optimal demonstration case, policies obtained using the maximum entropy approach regularly fail, while the probabilistic temporal ranking continues to perform relatively well. Fig. 5 shows a sample environment used for testing. The sub-optimal behaviour of the exploring model predictive control policies used for demonstration can result in frequent visits to undesirable states, which leads to incorrect reward attribution under a maximum entropy model. Probabilistic temporal ranking avoids this by using the looser assumption that states generally improve over time.

Fig. 6: Kinesthetic demonstrations of ultrasound scanning involve an active search process, with the user exploring different positions and contact forces before identifying a target object of interest and an optimal viewpoint.
Fig. 7: Predicted reward trace for a held-out unseen ultrasound scanning sequence provided by a demonstrator.
Fig. 8: Phantom for scanning and model saliency map. The trained ranking model has learned to associate the target object (yellow heatmap) with high rewards.
(a) Probabilistic temporal ranking reward
(b) Maximum entropy reward
Fig. 9: Final images obtained after policy convergence clearly show that images obtained using probabilistic temporal ranking are much clearer and capture the target object far more frequently than the maximum entropy reward. Correct detections are circled in green, failures marked with a red cross. (Images are best viewed electronically, with zooming. See anonymous companion site,, for higher resolution images.)
Fig. 10: Reward traces show that the probabilistic temporal ranking reward is stable enough for the BO robot policy to explore the volume of interest (varying reward) before exploiting (stable reward). The maximum entropy reward is extremely noisy, indicating that it has failed to consistently associate high quality ultrasound image features with reward.

V-C Autonomous ultrasound scanning

For our final experiment, we demonstrate the use of probabilistic temporal ranking in a challenging ultrasound scanning application. Here, we capture 10 kinesthetic demonstrations (see Fig. 6) of a search for a target object using a compliant manipulator, and use only ultrasound image sequences (2D trapezoidal cross-sectional scans) to learn a reward model. Our goal is to use this reward model within a control policy that automatically searches for and captures the best image of a target object.

This task is difficult because it involves a highly uncertain and dynamic domain. Obtaining stable ultrasound images requires contact with a deformable imaging phantom at an appropriate position and contact force, with image quality affected by the thickness of the ultrasound gel between the phantom and the probe, while air pockets within the phantom object can obscure object detection. Moreover, air pockets and gel can move in response to manipulator contact. This means that kinesthetic demonstrations are inherently sub-optimal, as they require that a demonstrator actively search for target objects, while attempting to locate a good viewpoint position and appropriate contact force. As in real-world medical imaging scenarios, the demonstrator is unable to see through the phantom object from above, so demonstrations are based entirely on visual feedback from an ultrasound monitor.

Fig. 7 shows the predicted reward sequence for a sample expert demonstration trace held out from model training. It is clear that the ranking reward model captures the general improvements in image quality that occur as the demonstrator searches for a good scanning view, and that some searching is required before a good viewpoint is found. Importantly, the slack in the pairwise ranking model, combined with the model assumption that similar images result in similar rewards, allows for these peaks and dips in reward to be modelled.

We qualitatively assessed the image regions and features identified using the reward model using saliency maps (Fig. 8), which indicate that the proposed approach has learned to associate the target object with reward.

We compare probabilistic temporal ranking with a Gaussian process maximum entropy inverse reinforcement learning approach. For both models we use the same latent feature vector (extracted using a stand-alone variational autoencoder following the architecture in Fig.

3), and the same Bayesian optimisation policy to ensure a fair comparison. The acquisition function of the Bayesian optimisation policy is configured to be relatively conservative and to explore the entire search space (a predefined volume above the phantom) before exploiting, to ensure convergence to a good viewpoint. We compare the two approaches by evaluating the final image captured during scanning, and investigating the reward traces associated with each model.

Trials were repeated 15 times for each approach, alternating between each, and ultrasound gel was replaced after 10 trials. Each trial ran for approximately 5 minutes, and was stopped when the robot pose had converged to a stable point, or after 350 frames had been observed. A high quality ultrasound scan is one in which the contours of the target object stand out as high intensity, where the object is centrally located in a scan, and imaged clearly enough to give some idea of the target object size (see Fig. 1).

As shown in Fig. 9, the probabilistic temporal ranking model consistently finds the target object in the phantom, but also results in substantially clearer final images. The maximum entropy approach fails more frequently than the ranking approach, and when detection is successful, tends to find off-centre viewpoints, and only images small portions of the target object.

It is particularly interesting to compare the reward traces for the probabilistic temporal ranking model to those obtained using maximum entropy IRL when the Bayesian optimisation scanning policy is applied. Fig. 10 overlays the reward traces obtained for each trial. The maximum entropy reward is extremely noisy throughout trials, indicating that it has failed to adequately associate image features with reward. Similar images fail to consistently return similar rewards, so the Bayesian optimisation policy struggles to converge to an imaging position with a stable reward score. In contrast, the reward trace associated with the pairwise ranking model contains an exploration phase where the reward varies substantially as the robot explores potential viewpoints, followed by a clear exploitation phase where an optimal viewpoint is selected and a stable reward is returned.

We believe that maximum entropy reward inference fails for two primary reasons. First, probabilistic temporal ranking produces substantially more training data, as each pair of images sampled (50 000 pairs) from a demonstration provides a supervisory signal. In contrast, the maximum entropy approach treats an entire trajectory as a single data point (10 trajectories), and thus needs to learn from far fewer samples. Moreover, the maximum entropy reward assumes that frequently occurring features are a sign of a good policy, which means that it mistakenly associates undesirable frames seen during the scan’s searching process for frames of high reward.

Fig. 11 shows the predicted reward over the search volume (a 50 mm x 50 mm x 30 mm region above the imagining phantom) for a pairwise ranking trial, determined as part of the Bayesian optimisation search for images with high reward. Here, we capture images at 3D end-effector locations according to the Bayesian optimisation policy, and predict the reward over the space of possible end-effector states using (7). Importantly, the Gaussian process proxy function is able to identify an ultrasound positioning region associated with high reward. This corresponds to a position above the target object, where the contact force with the phantom is firm enough to press through air pockets, but light enough to maintain a thin, air-tight layer of gel between the probe and phantom.

(a) Scan volume (b) Reward map
Fig. 11: A visualisation of the reward map (b) inferred by our method (Section IV) during scanning shows that it attributes high rewards (green) when the probe is pressed against the container directly above the target. The scan volume, or the support of the reward map, is illustrated using a green wireframe in the setup (a).

When comparing with human scanning, a primary challenge we have yet to overcome is that of spreading ultrasound gel smoothly over a surface. Human demonstrators implicitly spread ultrasound gel evenly over a target as part of the scanning process so as to obtain a high quality image. The Gaussian process policy used in this work is unable to accomplish this, which means scans are still noisier than those taken by human demonstrators. Moreover, human operators typically make use of scanning parameters like image contrast, beam width and scanning depth, which we kept fixed for these experiments. Nevertheless, the results presented here show extensive promise for the development of targeted automatic ultrasound imaging systems, and open up new avenues towards semi-supervised medical diagnosis.

Vi Conclusion

This work introduces an approach to inverse optimal control or reinforcement learning that infers rewards using a pairwise ranking approach. Here, we take advantage of the fact that demonstrations, whether optimal or sub-optimal, generally involve steps taken to improve upon an existing state. Results show that leveraging this to infer reward through a ranking model is more effective than common IRL methods in sub-optimal cases where demonstrations require a period of discovery in addition to reward exploitation.

This paper also shows how the proposed reward inference model can be used for a challenging ultrasound imaging application. Here, we learn to identify image features associated with target objects using kinesthetic scanning demonstrations that are sub-optimal, as they inevitably require a search for an object and position or contact force that returns a good image. Using this within a policy that automatically searches for positions444See for videos and higher resolution scan images. and contact forces that maximise a learned reward, allows us to automate ultrasound scanning.


This work is supported by funding from the Turing Institute, as part of the Safe AI for surgical assistance project. We are particularly grateful to the Edinburgh RAD group and Dr Paul Brennan for valuable discussions and recommendations.


Maximum likelihood architecture parameters

Convolutional VAE
Batch size 128

Training epochs

RMS prop optimiser learning rate=0.001,
Input dims
Conv 1

kernel, relu, strides 2

Conv 2 kernel, relu, strides 2
Conv 3 kernel, relu, strides 2
Conv 4 kernel, relu, strides 2
Dense FC neurons, relu
Dense FC

output (mean, variance)

Dense FC neurons, relu
Conv transpose 1 kernel, relu, strides 2
Conv transpose 2 kernel, relu, strides 2
Conv transpose 3 kernel, relu, strides 2
Conv transpose 4 kernel, relu, strides 2
Output dims
Reward predictor
Dense FC linear, output dims 1