Delineating Bone Surfaces in B-Mode Images Constrained by Physics of Ultrasound Propagation

01/07/2020 ∙ by Firat Ozdemir, et al. ∙ ETH Zurich 6

Bone surface delineation in ultrasound is of interest due to its potential in diagnosis, surgical planning, and post-operative follow-up in orthopedics, as well as the potential of using bones as anatomical landmarks in surgical navigation. We herein propose a method to encode the physics of ultrasound propagation into a factor graph formulation for the purpose of bone surface delineation. In this graph structure, unary node potentials encode the local likelihood for being a soft tissue or acoustic-shadow (behind bone surface) region, both learned through image descriptors. Pair-wise edge potentials encode ultrasound propagation constraints of bone surfaces given their large acoustic-impedance difference. We evaluate the proposed method in comparison with four earlier approaches, on in-vivo ultrasound images collected from dorsal and volar views of the forearm. The proposed method achieves an average root-mean-square error and symmetric Hausdorff distance of 0.28mm and 1.78mm, respectively. It detects 99.9 scanline error (distance to annotations) of 0.39mm.



There are no comments yet.


page 1

page 2

page 4

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

National Ambulatory Medical Care Survey and American Academy of Orthopaedic Surgeons report that each year around 6.8 million patients seek medical attention for bone fractures in the US, among which almost 900,000 require hospitalization. According to International Osteoporosis Foundation, approximately 1.6 million hip fractures occur worldwide and it is expected to increase 3 to 4 fold by 2050 [Gullberg1997, Cooper1992]. Additionally, vast proportion of vertebral fractures are unrecognized; up to 45% in the Americas and 29% in Europe [JBMR5650200402]. In Asia, underdiagnosis is even more significant due to the population mostly living in rural areas [mithal2009asian], leading to limitations in access to hospitals and necessary diagnosis equipment. Due to the high vulnerability to orthopedic conditions in daily life, there is an increasing interest and focus on improving orthopedic imaging and other complementary technologies.

Bone surface localization and segmentation are beneficial for many orthopedic procedures including diagnosis, surgical planning, intra-surgical guidance [ciganovic2018registration], and follow-up care. While accurate diagnosis is essential for best treatment, additional imaging contrast or visual anatomical aids would allow for better surgical planning and intra-operative guidance; both improving surgical outcomes and enabling less invasive surgical alternatives.

Although X-ray radiography or computed tomography (CT) can reveal orthopedic diagnostic information, they are also major sources of synthetic radiation [deGonzalez2004345]. Ionizing radiation should be particularly avoided for vulnerable patients, e.g., children and pregnant women [BrennerEllistonHallEtAl2001]. Furthermore, conventional clinical X-ray imaging shows insufficient contrast for certain muscoskeletal tissue conditions such as fibrosis [cady1983ultrasonic, Pillen2011MuscleUS]

. Ultrasound (US) may provide an additional or alternative imaging modality for diagnosis and treatment. Albeit its low signal-to-noise ratio (SNR), US is a non-ionizing, real-time, low-cost, portable and hence widely accessible imaging option.

In US imaging, acoustic impedance difference between layers of tissues cause US signal to reflect, inducing the hyperechoic bands seen in B-mode images, see Fig. 1. Since the impedance difference between the bones and their surrounding tissues is relatively large, almost all incident acoustic power reflects, hence casting a shadow behind the bones, with any tissue below this point practically invisible as illustrated in Fig. 1. The thickness of the observed hyperechoic band at a bone surface depends on multiple factors including the imaging wavelength and the orientation of the US transducer relative to the bone surface [jain2004understanding]. Although there are US radio-frequncy (RF) raw data based bone localization techniques, e.g. [wen2007Enhancement, hussain2014robust], RF data access is not commonly available in commercial US systems, let alone its realtime streaming. Therefore, most techniques in the literature utilize commonly available beamformed, demodulated, and dynamic-range adjusted brightness (B)-mode images.

Fig. 1: Sample US image showing (a) layers of hyperechoic bands between soft tissue layers and the hyperechoic appearance (red arrow) of the bone surface, (b) the shadow below the bone surface.

Given the increased B-mode intensity values at bone surfaces, earlier studies in localizing bone surfaces use gradient or edge based approaches [daanen2004fully]. Phase symmetry (PS) [kovesi1999image] is the aggregation of US images filtered with different orientation and scale log-Gabor kernels; and PS was shown in [hacihaliloglu2009bone] to greatly emphasize the hyperechoic bands visible at bone surfaces. Unfortunately, many hyperechoic bands between soft tissue layers show locally similar properties to bone surfaces, hence also yielding a high PS response, causing false positives. Despite its low specificity, PS has shown to provide satisfactory bone localization performance within manually selected regions of interest [hacihaliloglu2009bone]. Later works expanded PS to 3D US [hacihaliloglu2015automatic], automatized its parametrization [automaticHacihaliloglu2011]

, and used phase tensors for registering statistical shape models to 3D US 

[hacihaliloglu2013statistical]. With the aim of suppressing false detections, confidence in phase-symmetry (CPS) [quader2014confidence] was proposed by combining PS with attenuation and shadowing maps extracted from US image, resulting in so-called confidence maps [karamalis2012ultrasound]. In [Baka2016]

, a machine learning approach is suggested to estimate a likelihood map of bone surface, where the likelihood map is then used in the cost term for a dynamic programming delineation of the bone surface. Later, in 


additional features were extracted with a random forest classifier to segment bone surfaces. Further studies utilizing dynamic programming and combining phase tensors suggest additional improvements 

[hacihaliloglu2018localization]. In [villa2018fcn], a fully convolutional network (CNN) is trained on the confidence map [quader2014confidence]

, PS, and the B-mode ultrasound images to delineate bone surfaces. Later, additional deep learning based methods fused with phase tensors were proposed to accurately segment bone surfaces 

[alsinan2019automatic, puyang2018simultaneous]. In a recent study, state-of-the-art bone surface delineation accuracy is shown to be achieved with a standard CNN-based end-to-end learning approach [ciganovic2019deep], hence substantially reducing the inference time. It is noteworthy to emphasize on the influence of available datasets in works utilizing CNNs, where works with relatively smaller datasets ( images) report substantially inferior results when a CNN is trained using US images alone [alsinan2019automatic, puyang2018simultaneous].

In this work, we model the physics of US propagation and interaction with bone surfaces for delineating such surfaces. We encode such information in a graph formulation, where we define the bone surface as the posterior interface between two graph labels for soft tissue lying above and acoustic shadow cast below the bone surface. Pairwise potentials are used with directed edges to encode US propagation, where vertical edges encourage the label interface to align with the hyperechoic surfaces visible in the image, while enforcing an acoustically plausible order of labels. In particular, when going downward, once the shadowing is initiated, then no tissue shall be visible anymore. Horizontal edges encourage spatial smoothness of the resulting labels and hence bone delineations. Using a set of handcrafted features and annotated US images, we train binary classifiers for assigning unary potentials to the graph nodes at each image pixel. Solving this factor graph with directed edges, the bone surface is delineated as the border between tissue and shadow labels. We compare this with our implementations of [hacihaliloglu2013statistical] and [quader2014confidence], as well as with our earlier work [ozdemir2016graphical], which uses a separate additional label for the bone surface and jump edges at all graph nodes to enforce a given thickness of bone surface appearance. In contrast, we propose herein a novel bone surface definition that yields itself to a binary graph labeling problem, validated through extensive evaluations and comparisons. We also investigate an extended set of features, such as local texture, curvature, and Haar-like features.

Ii Material

For evaluations, 37 US images from a diverse range of anatomical regions, including forearm (radius, ulna), shoulder (acromion, humerus tip), leg (fibula, tibia, malleolus), hip (iliac crest), jaw (mandible, rasmus) and fingers (phalanges) from one volunteer were collected, which we will refer to as diverseUS dataset. In addition, 415 US images from the dorsal and volar sides of the left forearm from another volunteer were collected, which will be referred to as forearmUS dataset. Experimental data was acquired using a SonixTouch machine (Ultrasonix, Richmond, Canada) and an L14-5 linear transducer. The former dataset (diverseUS) was bilinearly resampled for isotropic one pixel per wavelength resolution within the scope of our earlier work [ozdemir2016graphical]. For both datasets, an imaging frequency of 6.66 or 10 Mhz was used depending on the body location. The collected B-mode images had a depth within  mm with a pixel resolution within  mm. Gold-standard (GS) bone surface delineations were annotated following the guidelines in [jain2004understanding]

. DiverseUS dataset additionally includes soft tissue and acoustic shadow annotations. Hence, diverseUS is used for all the training and hyperparameter optimization whereas forearmUS is used only for evaluations, without any further training or parameter tuning to also study generalizibility.

Fig. 2: Underneath the bone surfaces (red arrows), one can observe noise such as reverberations (blue arrows) or multiple reflections (green arrows) which may create ambiguity to infer shadowing and thus the bone surface above as the structure-of-interest.

Iii Methodology

When localizing and segmenting bone surfaces on US, there are three different structures of interest; Bone surface that is visible as a hyperechoic band perpendicular to the axis of US propagation, Shadow below a bone surface due to reflected and highly attenuated US signal, and remaining soft Tissues that are not inside the bone nor are shadowed. Below we first introduce the classifiers we use for determining the likelihood of the above classes per image region. Then, we present the proposed factor graph structure and potentially applicable formulations to impose the physics of US propagation.

Iii-a Feature-based Classification of B-mode Image Pixels

Although both bone surfaces and soft tissue interfaces may appear similarly in US images, near complete reflection and posterior attenuation of US signal leads to a shadowed region underneath bone surfaces. Its identification in the presence of noise may become ambiguous as shown in Fig. 2; for instance, shadow not appearing completely hypoechoic due to US artifacts such as multiple reflections and reverberations. We resolve such ambiguities in a probabilistic, learning-based framework based on relevant US image features. For this given binary classification problem, we use LogitBoost [Friedman98additivelogistic], which is a logistic additive method similar to AdaBoost, but requiring less computational power. As weak learners we use simple trees, leading to boosted decision forests for the local classification task.

Features are extracted at varying scales from 2D image patches as well as from 1D column vectors aligned with the propagation axis of US, as listed in Table 

I and shown for a sample set in Fig. 3. For this given binary classification problem, we use LogitBoost [Friedman98additivelogistic], which is a logistic additive method similar to AdaBoost, but requiring less computational power. As weak learners we use simple trees, leading to boosted decision forests for the local classification task. Features we utilize are described below.

            image                   confidence map            Rayleigh fit error            patch variance             patch median

ord. CW local stat.    ord. CW local stat.              LBP                     extended LBP         CW cumulative mean
    CW cumulative std.           curvature map              Haar features           Attenuation feature         Shadowing feature
Fig. 3: For an example US image, a representative subset of corresponding feature maps are shown.

Local Patch Statistics: Lower-order statistics such as mean and variance can separate T and S

regions thanks to the speckle noise and various tissue-layer reflections present within the former but not the latter. Additionally, entropy and higher-order statistics such as skewness and kurtosis may help with separating noise in

S regions from T

. Furthermore, we compute median, standard deviation, and energy (i.e., sum of squared intensities) within local patches. Eight features extracted at

patch scales yield a total of features governing local patch statistics.

Random-Walks based Statistics: Based on the random-walks framework [grady2006random], confidence maps () from [karamalis2012ultrasound] aim to quantify the confidence in the information seen in a US image at pixel resolution. Assuming the top row of a US image to be in contact with the transducer, each pixel’s virtual “strength” to reach the transducer is solved. Confidence maps are computed through random walks on an undirected graph with higher edge costs for diagonal and horizontal connections, since only a few transducer elements are assumed to contribute to a given scanline of the US image with diagonal signal transmission being less likely. Taking confidence maps as its basis, attenuation () and shadowing () metrics were proposed for enhancing bone surfaces in US [quader2014confidence]. Having confidence map for image , the attenuation metric at position is given by


where corresponds to the neighborhood patch around at scale and thus stands for the minimum of the confidence map value within . is the normalizing factor to map to [0, 1]. Similarly, the shadowing metric is computed by


where normalizes to [0, 1]. We include {, , } along with in our feature space.

Column-wise (CW) Local and Cumulative Statistics: Since signal propagation is in the axial plane along the scanlines, we run different kernels along this axis to emphasize characteristics of different interfaces at varying scanlengths. Using Gaussian kernels at 3 different scales, we compute smoothed 1D gradients of up to 4 orders at a given image location along (vertical) US propagation axis, which we call CW local statistics. These yield to = features from gradient convolutions. While lower order characteristics look similar, combination of higher-order gradients can be important for differentiating the classes.

Although local characteristics are important, cumulatively observing all pixels along the scanline below a given point can be instrumental for determining the label of that pixel. For example, commonly hypoechoic regions such as blood vessels can be distinguished from shadow by observing any non-hypoechoic location below that region. Therefore, we also apply cumulative scanline-long filters where all axial (vertical) information below a given pixel is summarized with column-wise features: sum, mean, and standard deviation of all pixels below the given pixel down to bottom of the image. These 3 features are called cumulative statistics.

Local Binary Patterns (LBP): LBP [ojala1996comparative] feature is known to be a powerful texture descriptor [1717463]. Aside from low computational complexity, its invariance to monotonic intensity changes (e.g., due to changing US imaging gain or time-gain-compensation settings) makes LBP attractive for US texture discrimination. LBP descriptors are typically represented in 8 bits, where the intensity of each pixel is thresholded with its 8-connected immediate neighbors in a consistent order to generate a binary vector. In [froba2004face], Modified Census Transform (MCT) was proposed for a robust threshold value as the mean of a neighborhood, adding invariance to illumination change and noise.

In addition to the pixel region used in LBP and MCT, we propose to also include every second pixel in each neighborhood like in a checker-board pattern. Then, for an extended LBP feature, the differences of pixel pairs that are symmetric with respect to the center pixel are thresholded by 0. For an extended MCT, the differences of these opposing pairs are thresholded with the difference between the center pixel intensity and the mean intensity of the 25 pixels in the given neighborhood. These features remain invariant to slow intensity changes while being sensitive to spikes at multiple orientations and widths. These small variations could be descriptive when discriminating bright reflections at bone surfaces from soft tissue and shadow regions. These 4 features above are extracted at a single scale.

Speckle Characterization: In B-mode images, even in the absence of specular reflections, coherent noise from scattering cause the typical US speckle texture. Fully-developed speckle intensities can be characterized by Rayleigh, Nakagami, or distributions of similar nature. Such characterization may help to differentiate tissue speckle “noise” (texture) from other potential noise source, e.g., in shadowed regions S, the latter likely containing different distributions due to different nature of origin, e.g., electric noise. Accordingly, we use the fitness of a patch histogram to Rayleigh distribution as a feature as follows:


where the first term is the normalized histogram of the pixel intensities within patch centered at , and the second term is the maximum-likelihood fitted distribution to this histogram. We implemented in Matlab as


where is the maximum likelihood estimation of the Rayleigh scale parameter for the given data [johnson2005univariate] and

is the Rayleigh probability density function 

[evans2000statistical] for the given parameter. Due to high computational demand, we used a single patch size, leading to a single feature.

Local Texture Response: Spatio-temporal information from an image can be extracted using Gabor filters tuned at different frequencies and spatial orientations of interest. For filter frequency and orientation , a set of Gabor filters


can be defined, where is the 2D zero-mean Gaussian functions characterized by different variances , and are the components of . A Gabor filtered image, called the local texture map, is included among our features.

Local Curvature Response: Image curvatures may be useful for discriminating soft tissue layers and bone surfaces. We include image curvature in 2D as a feature as defined in [schnabel1995active] as


where and are divergence and gradient operators, respectively.

Haar-Like Features: Haar-like features [viola2001rapid] are widely used in fast object detection tasks, thanks to their fast computation by exploiting integral image properties. We extract Haar-like features at 5 different scales in the form of center-surround features [1038171], where a smaller square is subtracted from a larger co-centric square.

Feature Types Feature Group Names Scale #
Intensity Pixel intensity 1px 1
Local patch statistics
Mean, median, variance, standard deviation
skewness, kurtosis, entropy, energy
3,6,12 24
Random Walks
Confidence Map [karamalis2012ultrasound],
Shadowing, Shadowing, Attenuation [quader2014confidence]
Column-wise (CW) local CW local statistics (order 5,11,31 12
& cumulative statistics Sum, mean, standard deviation - 3
Local Binary Patterns
(normal & extended) Local Binary Patterns
[ojala1996comparative] and
Modified Census Transform [froba2004face]
- 4
Speckle characteristics Rayleigh fit error 12 1
Local texture Gabor filter response (= =Hz) =mm 1
Local curvature Curvature response - 1
Haar-like features Center-surround Haar features 2,5,9,15,25,30px 5
TABLE I: List of 56 features extracted at different kernel space-scales for US transmit wavelength () and pixels (px).

Iii-B Graph Formalization of US Interaction with Bone Surface

Although the learned classifiers above may already suppress some false-positive detections and false-negative delineation gaps, these utilize potentially ambiguous and sometimes contradictory local image descriptions. Indeed, incorporating global image context can help in resolving local ambiguities, and hence is essential for resolving false bone surface detections. For instance, a tissue T appearance inside a shadow S region can only be explained and resolved by either concluding the appearance being an artifact in the shadow (hence correcting the apperance to label S) or realizing that the falsely-presumed bone surface supposedly casting that shadow is potentially a tissue interface instead (hence correcting the region lying above the actual shadow to label T). CPS is an approach to combine PS with global US properties through linear weighting. Although this may emphasize bone surfaces, the resulting segmentation performance was found in [ozdemir2016graphical] to be inferior to PS, due to CPS responding only when there already is a PS response. Alternatively, global contextual constraints can be incorporated naturally and more accurately through a graph formalization of this segmentation problem. Below we describe two alternative approaches; with and without an explicit bone surface label, where the nature and advantages are comparatively described and evaluated in the following sections.

Graphical Model and Pairwise Edge Potentials. We represent the US image as a Markov random field (MRF) with the following energy


where and are the unary nodal and pairwise edge cost functions and is the neighbourhood of node . A 4-connected edge configuration is used for the graph representation and connectivity of the pixels in the image as shown in Fig. 4(i).

For spatial regularization of labels, a trivial option is a Potts-like pairwise model, where any edge connecting nodes and will be penalized higher when the labels differ, i.e. . This is seen in Table II(a) when and . This table assumes directed edges. For MRFs with undirected edges, one can consider only the lower triangle of .

Fig. 4: Pairwise edge connection representations for (i) the Potts model, and our proposed (ii) directed lateral and scanline edges as well as (iii) bone factor graph (BFG).
(a) (b) (c)
T 1 1
B 1 1
S 1 1
T 0 0
B 0
S 0
TABLE II: Pairwise cost definitions for (a) horizontal H, (b) vertical V, and (c) jump J edges. discourages undesired label neighbourhood, which would disobey the physcially-expected TBS label order. ensures a bone surface before shadow. and ensure a limited thickness for the bone surface. parameters control the spatial regularization.

Iii-B1 US Propagation Edges

In order to enforce the strict propagation prior of US, lateral edges, also referred to as horizontal (H), and edges along the US scanline, similarly referred to as vertical (V), are herein proposed to have different pairwise energy definitions. For no bone in the scanline, all nodes should be tissue T. For a top-to-bottom traversal of the US image in depth, if and once a bone surface is encountered, the labels in this said scanline should strictly follow the order: T, B, and S, since after the bone surface only shadow can be visible. Such an order can be enforced in a graphical model, only using directed edges; i.e., allowing . For defining such directed edges, we use factor graphs [kschischang2001factor], with which we encode the above-mentioned scanline order as a propagation prior using direction-dependent weights between vertically-neighbouring nodes.

With the motivation above, a pairwise cost scheme for vertical neighbours shown in Table II(b) was proposed in [ozdemir2016graphical], summarized below for completeness. Large penalties (represented with ) prohibit undesired vertical neighbourhoods; i.e. preventing BT, SB, and ST assuming directed edges pointing the US propagation direction (downward). ensures a layer of bone surface B to exist at any TS transition. Parameters and allow to weight encountering a bone interface with respect to the continuation of the same label as shown in Table II(b). For the horizontal edges, no meaningful physics priors can be foreseen, but the anatomical connectivity can be utilized as a geometric prior for spatial regularization with Potts potentials where homogeneity, i.e.,  can be penalized lower when than otherwise, as seen in Table II(a). Here we defined a separate penalty for S and T pixels compared to for B, since a large class imbalance is expected. Note that in a typical image there would be large regions of shadow S and tissue T, in comparison to bone surface B pixels. Therefore, we use a larger horizontal weight for bone surface, which requires stronger agreement among unary potentials for continuity in the lateral direction.

Iii-B2 Jump Edge

Despite the above constraints, graph solutions with a surface stretching over several centimeters thickness (e.g., with two or more layers of tissue reflections being combined as a thick bone surface) are still valid solutions. In contrast, in B-mode images the bone surfaces appear as hyperechoic bands of finite thickness, due to the limited (pulse-length related) axial resolution of ultrasound among other factors [jain2004understanding]. Accordingly, we enforce bone surface B to vertically be a finite-width layer, the thickness of which can be defined as a function of US frequency . In order to enforce such given thickness, we use additional jump (J) edges that connect each node to another one precisely nodes below, as shown in Fig. 4(iii) and with the costs given in Table II(c). A large penalty (similarly to ) prohibits any direct transition from T to S within nodes, thus requiring a label B at an horizon of – effectively putting a minimum vertical thickness constraint of on the posterior of B. Additionally, prevents the two ends of a jump edge being both B, effectively constraining the maximum vertical thickness of any posterior B band to . These two above constraints combined ensure all detected bone surfaces B to be exactly pixels thick vertically. The cost term in Table II(c) acts similarly and concurrently to to ensure the logical order of TBS.

The factor-graph model combining J, V, and H edge definitions as shown in Fig. 4(iii) for bone surface segmentation in US images was called bone factor graph (BFG) in [ozdemir2016graphical]. This, however, has several shortcomings in design, paramterization, and computation: First, the bone thickness parameter above is hard-coded into the graph formulation and thus it needs to be empirically defined a priori, which is not a trivial task. Moreover, this surface thickness would in practice not be invariant, but would depend on depth from surface, focusing depth, surface inclination, aberrations, image post-processing operations, etc. Furthermore, the given jump edges above increase the total number of edges by . Indeed, given that the actual bone surface (not the US appearance) is a layer of infinitesimal thickness, any need for a separate label B to that end is actually questionable. To address these points we herein propose an alternative simplified graph structure, where the bone surface is encoded in the edges rather than a separate label, as described below.

Iii-B3 Bone-Responsive Edges

We modify the above model by removing the jump edges as well as discarding the label B, and instead encoding bone surface likelihood on vertical TS edges. This yields a simplified directed graph as seen in Fig. 4(ii). In this work, we use the pixel-wise PS response to define the bone-responsive edge potentials locally; i.e., the vertical edge potential is redefined as for T and S, where is a weighting constant as shown in Table III and


where is the phase-symmetry response at the image position of node , is the normalization factor to scale PS values across a given image to the range , and regulates an exponential decay. For horizontal edges we use the earlier Potts-like spatial smoothing model, as shown in Table III(a). Since the vertical edges are conditioned on a spatially varying function (PS, in this scenario) as in conditional random fields, we refer to our proposed model above as conditional bone graph (CBG). An overview of CBG framework is shown in Fig. 5.

Iii-C Implementation

PS is computed as the sum over different orientations and scales as follows [hacihaliloglu2009bone]


where is an orientation dependent noise threshold, , and and are, respectively, the real and imaginary components of image filtered (convolved) with the 2D Log-Gabor filter . In other words,


where is the imaginary unit,

represents the inverse Fourier transform operator, and

can be customized in spectral domain as


where parameters , , , and allow to define the filter orientation, center frequency, scaling factor, and angular bandwidth of the band-pass filter, respectively.

T 1
S 1
TABLE III: Pairwise cost definitions for horizontal H and vertical V edges for the proposed conditional bone graph (CBG). discourages undesired label neighbourhood, which would disobey the physcially-expected TS label order. parameters control the spatial regularization. embeds the local bone surface likelihood for regulating TS transitions.

Given these potential definitions, we seek the maximum a posteriori (MAP) solution for a given graphical model. Note that an exact inference may not always be possible, since submodularity cannot be guaranteed with conditional edge potentials and the directed edges prevent the use of many graph solvers. Accordingly, all graphical models given above were implemented in OpenGM and solved using Sequential Tree-Reweighted Message Passing (TRW-S), which uses a dual ascent algorithm and allows for an approximate inference regardless of problem definition.

Iv Experimental Setup and Evaluation

For evaluations and comparisons, we optimized the proposed algorithms in multiple aspects, including hyperparameter and feature selection (details are provided in the Appendix).

Fig. 5: Overview of the presented framework CBG, where the components within the red contour are valid only for the proposed BFG configuration.

Iv-a Performance Metrics

As gold-standard (GS) bone surface for evaluations, only the bone surfaces visible with sufficient confidence were annotated. Hence, non-annotated GS image columns (US scanlines) indicate no visible bone surface, either due to the actual in-existence of a bone or due to a potential ambiguity in its existence or confident localization. Based on such under-segmentation as GS, we compute common metrics such as root-mean-square error (RMSE), mean Euclidean distance (MED) and one-way Hausdorff distance (oHD) only from the GS with guaranteed bones to the automatic delineation with potential errors. Furthermore, in order to compute symmetric Hausdorff distance (sHD), we evaluate delineations only for US scanlines where GS annotations exist. Provided that Hausdorff distance can be too prohibitive, we also report -percentiles for oHD and sHD; denoted as oHD, and sHD, respectively.

Although RMSE, MED, oHD, and sHD measure different aspects of segmentation performance, we found them to be incomplete for assessing bone surface delineations with respect to GS annotations. With RMSE and oHD, major errors at some scanlines can be compensated with more accurate segmentations found at neighboring scanlines; and sHD may be unfair by introducing a very large penalty, e.g., for a single falsely-detected pixel at a far distance. To assess the segmentation performance for each scanline separately, we therefore utilize two additional metrics: For each scanline where both manually (GS) and automatically segmented bone surface exists, the average distance error to all predictions along the scanline is measured, and their mean over the image is reported as mean scanline error (MSE). To quantify false detections, i.e. false negative rate, we report the ratio of the number of scanlines with only GS but no delineation output to the total number of scanlines with GS annotations, called herein the miss percentage (MP).

Iv-B Parameter Optimization

After a set of initial empirical tests and in view of earlier works such as [hacihaliloglu2009bone, automaticHacihaliloglu2011], we fixed the parameters regarding Log-Gabor filter scale (), ratio , and ensemble tree size (). For the remaining 8 parameters of CBG we ran a grid-based parameter optimization with respect to bone surface delineation. These 8 parameters are the number of orientations in PS computation (), decay parameter (), angular frequency () which we reported in spatial units as , weighting coefficient of pairwise potentials (), and pairwise edge cost parameters (), and half the angle of coverage () of the Log-Gabor filter, i.e.


Based on preliminary experiments, viable parameter ranges shown in Table IV were used in a grid search. We used a single combined metric as optimization objective, namely the mean metric given by the mean of conventional metrics RMSE, oHD, and sHD, as these assess various aspects of a delineation.

Range [25, 75] px [, ] [1, 3] [0.01, 10] [0.1, 5] [0.1, 1] [0.1, 1] [0.1, ]
TABLE IV: Tested range of optimized parameters.

We followed a bootstrapping based approach to generate different subsets of samples from diverseUS to run cross-validation for parameter optimization. In a pseudo-random manner, we generated 5 different sets , such that each sample from diverseUS is left out once. Then, we ran a 6-fold cross-validation on each for the parameter optimization. For each parameter setting the average mean metric across all crossfolds was computed to identify the best parameter set that maximizes the average crossfold performance within .

Iv-C Method Standardization

Some methods and bone localization algorithms in the literature, such as PS, do not necessarily aim for a complete and accurate delineation of the bone surface, but rather act as a filter aiming to enhance the visibility of bone surface in an US image. Hence, their output may additionally contain tissue interfaces and/or they may return multiple-pixel thick bone surfaces, which would both skew the proposed evaluation metrics to their disadvantage. In our preliminary tests, such simple baseline approaches compared very unfavorably to our proposed methods for the given metrics. To enable a fair quantitative comparison and to maximize the delineation performance of these algorithms, we introduce and compare two standardization techniques as post-processing steps: For PS we apply

 [automaticHacihaliloglu2011], with which the detection with the highest response along each scanline is picked as the bone surface. We also employ where first a morphological thinning [thinningLam1992] is applied on any segmentation result, and then for each scanline the deepest (lowermost) bone surface detection is selected as the output delineation [ozdemir2016graphical]. We apply postprocessing to PS and CPS. For BFG, any bone surface response is always pixel thick, so for any hairline delineation we simply pick the mid-pixel of BFG output vertically. In CBG, no post-processing is needed since the bone surface is given implicitly as a thin layer at the interface between the regions with posterior T and S labels.

V Results

V-a Implementation

Computations were performed on an Intel i7 4.00GHz CPU with 16 GB available RAM. We refactored our earlier BFG implementation [ozdemir2016graphical] leading to speed improvements in feature extraction and classifier training.

From the results in Table V, it can be seen that half of the hyperparameters have the same optimal values across different sets . Hence, CBG hyperparameters are set accordingly as , , , , , , , and .

Given that the parameter optimization of all methods were conducted on the diverseUS dataset, test images from forearmUS dataset were bilinearly interpolated to match the pixel resolution of diverseUS. For the quantitative evaluations, we resize the segmentation results to match the initial GS resolution of forearmUS. This is done by bilinearly resizing the segmentation distance map and thresholding based on the forearmUS GS resolution.

Set RMSE oHD sHD mean
25 /3 3 0.01 5 0.1 0.1 100 0.56 1.59 3.37 1.84
25 /12 1 0.1 5 0.1 1 100 0.56 1.45 3.28 1.76
25 /3 3 0.01 5 0.1 0.5 100 0.42 1.35 2.77 1.51
25 /3 3 0.1 5 0.1 1 100 0.67 1.82 3.33 1.94
25 /12 1 1 5 0.1 0.5 100 0.79 1.96 2.88 1.88
TABLE V: For each subset , the optimal parameter set () and the average cross-validation scores in [mm] with the proposed CBG.
Methods [mm] [mm] [mm] [mm] [%] [mm] [mm] [mm]
PS [ozdemir2016graphical] 0.34 1.03 3.89 0.68 0.0 0.24 0.74 2.16
PS 3.31 6.76 13.75 8.48 0.0 2.59 6.14 13.22
CPS [quader2014confidence, ozdemir2016graphical] 1.74 3.58 4.79 1.22 40.76 1.39 3.17 2.25
BFG [ozdemir2016graphical] 0.56 1.47 2.31 0.68 0.02 0.42 1.14 1.76
CBG 0.28 0.75 1.78 0.39 0.10 0.20 0.56 1.26
UNet [ciganovic2019deep] 0.18 0.46 1.15 0.14
Inter-Annotator [ciganovic2019deep] 0.91 2.56 2.69 0.13
TABLE VI: Delineation performance measured by RMSE, one-way Hausdorff distance (oHD), symmetric Hausdorff distance (sHD), mean scanline error (MSE), miss percentage (MP), mean Euclidean distance (MED), and 95 percentile oHD (oHD) and sHD (sHD) on forearmUS dataset. UNet (that was trained on another forearm US dataset) and inter-annotator scores reported in [ciganovic2019deep] are also presented herein as a reference.

V-B Evaluation Results

We evaluated bone surface delineation performance of the compared methods on the forearmUS dataset. In Table VI we present a quantitative comparison of the proposed CBG with BFG [ozdemir2016graphical], PS [automaticHacihaliloglu2011], CPS [quader2014confidence], and UNet [ciganovic2019deep]. PS and CPS required no training and were parameter-optimized on diverseUS. BFG and CBG were trained and parameter-optimized with hold-out sets on diverseUS. UNet was trained on another dataset of only forearm images. Considering a fair comparison given a limited training set with potential domain-shift, we compare below the methods that can cope with such practical limitations.

It is seen that PS delineation results can be greatly improved by a simple post-processing step (), over its () alternative used in earlier works [automaticHacihaliloglu2011]. Indeed, for some metrics (RMSE, MED, oHD, and MP), PS even outperforms BFG on the forearm dataset, contrary to earlier findings in [ozdemir2016graphical]; although BFG still yields a substantially improved () sHD compared to PS. Using only the diverseUS training, our proposed method CBG achieves the best performance in all metrics except MP. The improvement of CBG over PS for these metrics range from (MED) to (sHD), average being . Similarly, CBG improves our earlier proposed method BFG on average for these metrics. For all metrics except MSE, both BFG and CBG achieve performance superior to inter-annotator variation reported for this dataset in [ciganovic2019deep]

. Since neural networks require a large training set and are susceptible to domain shift, the UNet would have been quite disadvantaged if trained on diverseUS, due to its limited 37 images, among which only few are of the forearm region. We therefore presented here UNet results trained with another forearm dataset of 1385 images from a different subject, with which UNet then expectedly outperforms all other methods substantially, in agreement with 

[ciganovic2019deep]. Given that it is not always practical to annotate and train on large sets of a targeted anatomy, UNet results reported in [ciganovic2019deep] are herein only presented as a reference. Other works in literature using a similar experiment setup (SonixTouch US machine) report UNet results of 2.43 mm [alsinan2019automatic] and 0.39 mm [puyang2018simultaneous] for MED metric when only B-mode US images were used as input, where training set sizes were 300 and 415 images, respectively.

In Fig. 6

, a collection of qualitative results are presented, where the bone surface delineations are demonstrated for images yielding the best, median, and worst scores for four quantitative metrics. Results for our proposed method CBG are on the odd columns. As a comparison and a baseline, bone surface delineation of PS

are shown in alternating columns. It is seen that PS results in many false positive detections, even within the visible bone surface scanlines (GS), which is reflected in its high sHD error. For instance, PS often detects a bone surface per scanline regardless the existence of a real bone, mainly due to hyperechoic bands within the soft tissue, e.g., subcutaneous tissue. Such false detections can be mostly avoided in BFG and CBG since to similar patterns are commonly visible and hence can be learned by the classifiers from diverseUS. Another common pitfall observed in PS is that detections can bounce abruptly between deep and shallow regions. While there can be multiple bone tissues as well as gaps in between them in a given US image that justify such jumps, a continuous surface is often expected throughout a single bone surface. Both BFG and CBG can often avoid such false detections thanks to globally optimized factor graph constraints. It is also seen in Fig. 6 first row that despite appearing very similar to bone surface, the connective tissue between two bones (radius and ulna) is often correctly identified as background in CBG thanks to graphical formulation. Upon inspection of the worst performing image samples, one can notice that false negatives for CBG occur when there is shadow appearing region with a smooth and continuous hyperechoic band above; e.g., first and last columns in Fig. 6. Based on visual inspection, metrics sHD and MSE were found to be more representative of perceived delineation quality, whereas oHD was not very correlated with visually pleasing bone surface delineations.

0.06 0.08 0.13 0.33 0.13 0.58 0.04 0.13
0.17 0.21 0.45 0.66 0.58 3.95 0.20 0.66
8.76 0.13 12.70 0.92 16.59 19.36 9.84 0.13

Fig. 6: Images yielding the best, median, and worst scores for four main evaluation metrics for our proposed method (CBG) and for PS for comparison. The red and blue overlays show the method segmentations and the gold standard annotations, respectively. The quantitative score of each image is displayed below itself.

Vi Conclusions

Earlier works on bone surface localization in US images such as using maximum phase symmetry (PS) [automaticHacihaliloglu2011] or intensity gradient [jain2004understanding]

often required many assumptions and manual interactions. Deep learning based techniques have already shown strong improvements in bone surface delineation problem. However, such methods require large annotated datasets, which are often not available for a targeted anatomical structure. For automatic bone surface delineations, we have herein proposed a method to incorporate US physics as constraints into image analysis. We have introduced a graph formalization, in which the US interaction with bone surfaces and resulting shadows are encoded as graph edge potentials. We have used an ensemble training method to compute unary potentials of labels for soft tissue and acoustic shadow. The transition between the two posterior labels localize the bone surface in a given image. In the future we aim to formulate our US physics constraints within the scope of recurrent neural networks. We believe imaging physics shall be incorporated more in image analysis tasks, and this work to be a step in that direction.

Vii Acknowledgements

We thank Dr. Andreas Schweizer for annotations, and the Swiss National Science Foundation and a Highly Specialized Medicine grant from Zurich Department of Health for funding.


Viii Appendix

Viii-a Feature Selection

Some of the features proposed earlier in Section III-A may not be necessary for a particular classification task, or when used together, some features may not bring significant gain to the learned model. This is not favourable due to additional computation time needed for each feature. There are various methods for feature selection for learning algorithms. Trying every feature subset combination is intractable, due to the large number of possible combinations ( for features). A common method is the greedy backward elimination, where the feature with the least influence on the trained model is removed from the feature space one at a time and a new model is trained, sequentially eliminating features until a single feature is left or another stopping criterion (e.g., a large performance drop) is reached. Out-of-bag (OOB) error of a classifier is a common metric for estimating the influence of each feature on the classification task.

Since some of our features are extracted efficiently at multiple scales at once, we group all scales of a given feature in the feature selection and determine their group performance as the maximum OOB importance in that feature group, which is called the group OOB importance. In each iteration of our feature elimination process, the remaining features (groups) are used to train both shadow and tissue classifiers on a chosen training set . Then, the feature group with the lowest group OOB importance is determined separately for each classifier for elimination from that classifier in the next iteration.

In Fig. 7, we show the performance drop in the mean metric value (mean) for the unused samples of diverseUS in subset when greedy feature selection is done for both shadow and soft tissue classifiers. Note that removed features vary between the two classifiers. The leftmost bar represents mean when all feature groups are used, and each consecutive bar indicates mean after the removal of the feature group in the axis label underneath that bar. In addition to the delineation-accuracy based approach for the feature subset selection, one can also check for any potential computation-time based feature (group) elimination for a large speed gain in feature extraction at the expense of minor loss of accuracy. In Fig. 8, we show the feature extraction time for both the most relevant 7 feature groups based on greedy feature selection (cf. Fig. 7) as well as all feature groups with more than  ms computation time. Surprisingly, the union of the most relevant 7 feature groups for the two classifiers (S & T) consist of 8 unique feature groups, which are confidence map, CW cumulative mean, CW local statistics, (log-)shadowing feature, patch entropy, energy and kurtosis, and Rayleigh fit error.

In Fig. 7, it can be observed that mean increases substantially when less than 7 feature groups are left in the feature selection, while it is relatively stationary with more groups. Following our intuition earlier, it is not surprising to see that the column-wise cumulative mean feature is among the most descriptive features for both of the classifiers albeit having very little computational footprint. Another significant feature for both classifiers is the log-Shadowing feature, which is derived from shadowing feature, used in earlier methods such as CPS [quader2014confidence]. Among patch statistics, path entropy is found to be the most relevant for the classifiers. This can be attributed to significant difference of the information content between the soft tissue and acoustic shadow. It can be observed in Fig. 8 (right) that Rayleigh fit error has the largest computational cost while also being of large importance for the classifiers, especially for the shadow label. This is followed by patch kurtosis and skewness features having the same feature extraction time, even though kurtosis is found to be substantially more important than skewness.

One can expect feature importance to vary slightly based on the dataset properties, however, the conducted greedy analysis as well as the average feature computation time can provide intuition for different experiments setups, e.g., applications prioritizing time vs. performance.

Fig. 7: Change in mean metric (mean of RMSE, oHD, sHD) for after sequentially removing the least out-of-bag important feature groups for LogitBoost trained classifiers for (top) shadow and (bottom) soft tissue label. Both labels have the same mean metric values by design, but the removal order of feature groups do change for the two classifiers.
Fig. 8: The pie chart (left) depicts the relative computation times of the union of the most important 7 feature groups for both classifiers S & T. Although it is among the top feature groups, computation time of CW local statistics is not displayed due to being negligibly low compared to the rest. Note that log-shadowing and shadowing features are grouped as (log-/Shadowing) as they can be computed together with little overhead. (Right) Mean computation times of feature groups with 30 ms computation.