Ultrasound (US) imaging provides excellent temporal resolution and is a non-invasive modality, which together make it ideal for image-guidance of procedures, including of radiation therapy (RT). RT requires precise and conformal application of radiation dose in space, for which organ motion due to internal body movements (e.g. breathing) is a major challenge . Tracking treatment target location in US is a promising approach, given that real-time and accurate tracking algorithms can be developed. Image-based tracking techniques proposed in the literature include block matching , optical-flow [3, 4], and supporter models 
. Nevertheless, these methods are either slow or require often substantial parameter tuning to optimize for a particular image and landmark appearance. We propose herein a solution based on Convolutional Neural Networks (CNNs), which are fast at inference time, and their adaption to new data distributions is often straightforward given annotations. CNNs perform particularly well on classification tasks, but tracking has remained more of a challenge with CNNs so far. To the best of our knowledge, CNNs have only been applied to the ultrasound tracking problem in as a metric learning framework that aims to minimize the distance between patches containing the same landmark at the center. However, this model is not fully convolutional and is therefore relatively slow. Furthermore, this approach fails to effectively account for any temporal information, which is crucial when similar or repetitive structures exist such as many vessels in the liver.
Recently, fully-convolutional Siamese (SiameseFC) networks for similarity learning have been applied successfully on tracking problems for natural scenes from camera images [7, 8]. These methods aim to learn the similarity between a template image that contains a specific object of interest and a search image where a similar looking object is to be found. For this purpose, two identical CNNs are trained with their respective template and search images to represent arbitrary objects in an embedding, which can be used for effective comparison. Cross-correlation is applied to produce a similarity score map, from which the maximum value is chosen as the predicted landmark location.
This original SiameseFC 
aims to detect a specific object or objects in a given image. We propose herein to adapt this method for finding targeted anatomical locations in consecutive frames. We achieve this by learning the image similarity between corresponding target locations via a customized ground truth representation and loss definition. To promote temporal consistency, we augment the similarity maps with a location prior based on the entire preceding tracked path.
SiameseFC for similarity learning. Our SiameseFC was adapted from the method described in  for learning inter-frame similarity from annotated landmark locations as illustrated in Fig.1. It applies an identical CNN on both the template image (which contains the object of interest) and the search image to extract representative embeddings that can be effectively compared. This comparison is implemented as a cross-correlation layer (represented by the operator) between the sliding window and the search region , which results in the similarity function
We defined the template as the landmark region in the first (annotated) frame. A sufficiently large search region around this point in all subsequent frames are then used as search image. As , we used the convolutional stage of the AlexNet architecture 7], the CNN employs a pixel-wise logistic loss on the similarity map. Its ground truth is generated by setting pixels within a radius from the landmark to +1, and elsewhere to -1. We first attempted this approach in its native form, weighting each pixel by its class cardinality or weighting with a Gaussian function centered at the manual annotation. Across several variations initially, only one particular combination resulted in successful results, where we
generate the ground truth as a 2D Gaussian map centered at the ground truth landmark location , and
use an L2-loss to compare this with the tracking output
where corresponds to individual pixel locations in .
At the training stage we randomly pair annotated images from the same sequence as templates and search images showing the same annotated landmark at different time instances. For the prediction stage, the template is taken from the annotation at the first frame, and the search images are taken as all subsequent frames in the sequence.
Temporal consistency model for landmark tracking. With the above approach, a similarity metric to find a landmark in subsequent frames can be learned well. However, this incorporates no temporal information in its given naive form. In preliminary results, we saw this as a main limitation, especially when similar anatomical features come in proximity and the tracker switches to such false targets. We propose to augment the similarity maps with a location prior as illustrated in Fig. 2. To that end, we build a temporal consistency model based on the history of all predicted landmark positions in the preceding frames. This acts as a location prior and a confidence model of where to most likely expect a target location. This model somewhat regularizes the predicted similarity score , helping by avoiding the prediction of landmarks in unlikely regions. At time , we first update the location prior at all positions as a running average, i.e.
A temporal regularizer is then used to weight the similarity map to update it with temporal consistency as
where the parameter determines the weighting of the location prior at time . To avoid new positions being penalized during early tracking iteration when the model is being first constructed, this weight is set to increase with time as follows:
where constant balances the maximum contribution of in and is a constant defining how fast should grow. We define this growth rate empirically as approximately the time for one breathing cycle.
Given anatomical constraints, the landmark is assumed that it cannot move further than a predefined distance of between two frames. Accordingly, the maximum of within a radius of from the previous location is chosen as the landmark location , i.e.
was achieved with TensorFlow, with the network training and the experiments ran on an Nvidia GeForce GTX TITAN X GPU. Based on initial empirical tests, we employed a batch size of 16 images, a learning rate of -
, and trained for 100 epochs with the Adam optimizer. We setmm in Eq.(2), mm in Eq.(3), and in Eq.(5), and mm in Eq.(6).
3 Results and Discussion
Dataset. We applied our method to 2D liver US sequences provided by the Challenge on Liver Ultrasound Tracking (CLUST) 
, which was prepared to address the localization of anatomical landmarks under respiratory motion in the liver. The dataset contains 2D liver US sequences from four different clinical centers, with durations ranging from 60 to 330 seconds, at varying spatial and temporal resolutions. Each sequence has one or more landmarks annotated in the first frame, which are the landmarks to be tracked for the remaining frames. 24 sequences are provided by CLUST as the training set, from which we used 20 for training our CNN and the remaining 4 for validation. The test set consists of 40 sequences with a total of 85 landmarks for tracking annotated on the initial frames. The corresponding annotations for the remaining frames are inaccessible to participants and are evaluated by the organizers upon submission. Mean, standard deviation, and 95 percentile of the errors are reported, which are calculated as the Euclidean distance between the manual ground truth and the predicted landmark at each of the annotated frames.
Patches. To normalize anatomical feature sizes, we resampled all the images to 0.27 mm/pixel, which is the maximum resolution available in the dataset. The template image for each sequence was created by cropping a 127x127 region around the initial annotation in the first frame of the sequence. This region was considered to contain the spatial context required for tracking. Search images were cropped in all subsequent frames as a 407x407 region around the position of the initial landmark. This size includes a margin for the maximum liver motion possible due to anatomical constraints.
Loss. In preliminary tests on our validation set of 4 sequences (with 2 of them containing 2 annotations to track, thus, 6 landmarks in total), the original SiameseFC design 
with its binary ground truths and logistic loss function was not converging to yield viable similarity maps. Distant pixels were being activated in results, suggesting the model being incapable of accurately discriminating the under-represented true-positive locations from the false-negative ones. Weighting the loss function by the class cardinality to avoid class imbalance did not solve the problem. Logistic loss was found not optimal for the US images in which false-negative regions could easily have very similar features as the true-positive ones; contrary to natural scenes used in the original SiameseFC. To address this, we employed L2-loss to a probabilistic ground truth as a 2D Gaussian function centered at the desired landmark for smooth and derivable boundaries between classes. This resulted in substantially lower scores on our validation set, confirming the learning of a similarity map for landmarks.
Test set. We submitted our method for evaluation on the test set of the open CLUST challenge , in which we obtained an error of 1.342.57 mm. Despite overall high accuracy, for a few test sequences the errors are quite large; see Fig. 3(b). A visual inspection reveals that these occur when there are very similar features in proximity and the tracked location abruptly switches to the false one (see the example in Fig. 5). We could not find any frame on the validation set with this source of error. We believe that this error is due to the lack of motion features on the CNN part of our method.
|Shepard A., et al.||0.72||1.25||1.71|
|Williamson T., et al.||0.74||1.03||1.85|
|Hallack A., et al.||1.21||3.17||2.82|
|SiameseFC + regularization||1.34||2.57||2.95|
|Makhinya M. & Goksel O.||1.44||2.80||3.62|
|Ihle F. A.||2.48||5.09||15.13|
|Nouri D. & Rothberg A.||3.35||5.21||14.19|
In contrast to conventional SiameseFC, our proposed Gaussian soft ground truth with L2-loss is able to learn US tracking problem despite similar looking false-positive alternatives. Our regularization effectively penalizes misleading similarities with a location prior, built based on a relatively simple form of temporal information. In contrast, many methods that perform superior to ours in CLUST use sophisticated motion models or priors, such as motion dynamics through Kalman filtering, which would also be possible to incorporate in our method in the future. Alternatively or in addition, a long short-term memory (LSTM) unit can be incorporated in our approach to integrate similarity maps throughout sequences.
Relevance. We show relevant error ranges and 95% errors, compared to typical RT treatment margins of 5 to 10 mm. Given that speed and adaptability to different datasets is of utmost importance for RT image guidance, we believe our CNN-based approach can provide an ideal solution, as CNNs can run quite fast, especially on GPU. Average inference time of our proposed method is 9.4 ms per frame, much faster than the acquisition rate of the US sequences employed.
Landmark tracking in US sequences is a challenging and important topic given its relevance in clinical settings. While previous methods have achieved errors similar to humans, they are often slow and have difficulties to generalize to different data distributions. While CNNs can tackle these obstacles in other fields, their application to medical image tracking has been under-explored. We propose herein an adaptation of SiameseFC to accurately learn similarity maps from a landmark in US to a search region, which we augment with a location prior for temporal consistence. Given our contributions, competitive results have been achieved using fast, extendable CNNs. Future directions include more sophisticated motion models and LSTMs for temporal consistency.
-  PJ Keall et al., “The management of respiratory motion in radiation oncology report of AAPM task group 76 a,” Med Phys, vol. 33(10), pp. 3874–3900, 2006.
-  AJ Shepard et al., “A block matching based approach with multiple simultaneous templates for the real-time 2D ultrasound tracking of liver vessels,” Med Phys, vol. 44(11), pp. 5889–5900, 2017.
-  M Makhinya and O Goksel, “Motion tracking in 2D ultrasound using vessel models and robust optic-flow,” in Proc. MICCAI CLUST, 2015, p. 20.
-  T Williamson et al., “Ultrasound-based liver tracking utilizing a hybrid template/optical flow approach,” Int J Comp Asst Rad Surg, vol. 13(10), pp. 1605–1615, 2018.
-  E Ozkan, C Tanner, M Kastelic, O Mattausch, M Makhinya, and O Goksel, “Robust motion tracking in liver from 2D ultrasound images using supporters,” Int J Comp Asst Rad Surg, vol. 12(6), pp. 941–950, 2017.
-  D Nouri and A Rothberg, “Liver ultrasound tracking using a learned distance metric,” in Proc. MICCAI CLUST, 2015, pp. 5–12.
-  J Valmadre et al., “End-to-end representation learning for correlation filter based tracking,” in Procs IEEE CVPR, 2017, pp. 5000–5008.
-  L Bertinetto et al., “Fully-convolutional Siamese networks for object tracking,” in Procs ECCV, 2016, pp. 850–865.
A Krizhevsky, I Sutskever, and GE Hinton,
“Imagenet classification with deep convolutional neural networks,”in Adv Neural Inf Process Syst, 2012, pp. 1097–1105.
Martín Abadi et al.,
“TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
-  V De Luca et al., “Evaluation of 2D and 3D ultrasound tracking algorithms and impact on ultrasound-guided liver radiotherapy margins,” Med Phys, 2018.