Discovering Clinically Meaningful Shape Features for the Analysis of Tumor Pathology Images

12/09/2020 ∙ by Esteban Fernández Morales, et al. ∙ The University of Texas at Dallas 0

With the advanced imaging technology, digital pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high resolution. Recent developments in deep-learning methods have enabled us to automatically detect and characterize the tumor regions in pathology images at large scale. From each identified tumor region, we extracted 30 well-defined descriptors that quantify its shape, geometry, and topology. We demonstrated how those descriptor features were associated with patient survival outcome in lung adenocarcinoma patients from the National Lung Screening Trial (n=143). Besides, a descriptor-based prognostic model was developed and validated in an independent patient cohort from The Cancer Genome Atlas Program program (n=318). This study proposes new insights into the relationship between tumor shape, geometrical, and topological features and patient prognosis. We provide software in the form of R code on GitHub:



There are no comments yet.


page 4

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lung cancer is the leading cause of cancer death for both females and males (siegel2020cancer). Lung adenocarcinoma (ADC) is heterogeneous in morphological features, highly volatile in prognosis, and compromises half of all lung cancer cases (matsuda2015morphological; wang2018comprehensive). While mortality rates have rapidly declined, lung cancer in 2017 has caused more deaths than breast, prostate, colorectal, and brain cancer combined (siegel2020cancer).

With the advanced imaging technology, digital hematoxylin and eosin (H&E)-stained pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high spatial resolution. Recent developments in deep-learning methods have enabled us to automatically detect and characterize the tumor regions in pathology images on a large scale (wang2018comprehensive). Through these developments, wang2018comprehensive discovered a relationship between tumor shape and patient survival outcome in lung cancer patients. Although numerous studies have associated morphological features with patient prognosis (wang2014novel; yu2016predicting; luo2017comprehensive), this study was the first to incorporate shape into a prognostic model. Furthermore, this association reflects multiple studies in brain and breast cancer (kilday1993classifying; pohlman1996quantitative), though further systemic research with additional techniques to represent and quantify tumor shape is necessary.

Existing approaches to describe the shape and boundary of regions have been the objective of shape analysis (wirth2001shape; yang2008survey). Shape representations characterize and visualize regions, while shape descriptors quantify the regions themselves. These shape representations, either one or two-dimensional, involve the regions or their contours. From these shape representations, we can compute shape descriptors to quantify the shape, geometry, and topology of the original regions. In general, shape descriptors should be translation, rotation, and scale-invariant (yang2008survey). Additionally, they should be application-dependent with a low computational complexity (zhang2004review).

In this paper, we develop a segmentation procedure and analysis pipeline to study the relationship between tumor shape and patient survival outcome. The analysis pipeline, based on wang2018comprehensive

, can be summarized in four parts: image pre-processing, feature extraction, survival analysis, and predictive performance. Image pre-processing relies on an automated tumor recognition system to convert raw pathology images into machine-readable representations; the results lead to a feature extraction process that computes shape, geometric, and topological features from the extracted tumors in the pathology images. A regularized Cox proportional-hazards (CoxPH) model, dependent on the shape-based features, is used as an objective prognostic method. The evaluation of its predictive performance, adjusted for clinical variables, concludes the analysis pipeline. We summarize these steps in Figure 


Figure 1: Analysis pipeline with segmentation procedure and survival analysis.

We organize the remaining of this paper as follows. Section 2 introduces the segmentation procedure that allows us to extract tumors from pathology images. Section 3 presents the shape representations used to represent tumors and compute shape-based features; these features being defined in Section 4. In section 5, we evaluate the prognostic performance of these features, while developing a shape-based prognostic model that varies from the one presented in wang2018comprehensive. We discuss the results and conclude the paper in Section 6.

2 Tumor Segmentation

We developed a processing procedure to extract tumors in raw pathology images. Section 2.1 introduces the matrix representation of a whole-slide image that is used by the procedure and referred to as the reconstructed heat map. The intermediate tissue segmentation step is outlined in Section 2.2, while the main tumor segmentation step is detailed in Section 2.3.

2.1 Reconstructed Heat Map

Automatic tumor recognition systems, such as the one developed in wang2018comprehensive, create machine-readable representations of whole-slide images. A matrix results from the pathology image given to the recognition system or other imaging tools, where each entry corresponds to an empty region, non-malignant tissue, or tumor tissue in the raw sample. We refer to this matrix as a reconstructed heat map, which we can use to segment the tumor regions within the pathology image. For example, an automatic tumor recognition systems can be used with H&E stained pathology images (see Figure 2), where we can obtain a reconstructed heat map, as shown in Figure 3 (top-left). We continue the processing procedure by segmenting the tissue regions.

Figure 2: An example of an H&E stained pathology image from a lung cancer patient in the TCGA cohort.
Figure 3: Steps of tumor segmentation procedure. Top-left figure shows the reconstructed heat map of the H&E stained pathology image. Bottom-right and bottom-left figures show the identified tissue regions and segmented tumors, respectively. The key on the right shows the integer coding for each tumor.

2.2 Tissue Segmentation

Since a pathology image can contain multiple tissue samples, disconnected tissue regions are identified and segmented through standard morphological operations, usually based on a four-connectivity (gonzalez2020digital). We show this segmentation in Figure 3 (bottom-left). Furthermore, we remove the tissue regions with an area less than one-fourth of the largest tissue region to remove the effects of small tissue samples (wang2018comprehensive). For each tissue region, we can finalize the processing procedure by segmenting the tumor regions.

2.3 Tumor Segmentation

Individual tumor regions, which also contain empty and non-malignant categories within them, were extracted from each tissue region with a similar procedure as the tissue segmentation in Section 2.2. Similarly, to reduce the influence of small tumors, we disregard tumor regions with an area less than and consider them as “islands” to analyze, separately. We create two separate segmented matrices from the tumor regions and the non-malignant categories, respectively; the non-malignant categories make up the holes within the tumors. An integer coding is used to differentiate each tumor and their respective holes. We show the segmented tumor regions from the continued example in Figure 3 (bottom-left).

3 Shape Representations

From the segmented tumor regions, detailed in Section 2.3, we can represent tumors with various one and two-dimensional techniques. These representation techniques should have invariance properties (rotation, scaling, translation), low computational complexity, and being application-independent (yang2008survey). Section 3.1 and the beginning of Section 3.2 make up the primary shape representations, while the remaining sections make up the derived shape representations.111The derived shape representations depend on the primary representations. A visualization of these are shown in Figures 4 and 5, while Table 1 summarizes the introduced notation. We begin by introducing the binary matrix, which is used to represent individual tumors in the segmented matrix.

Name Notation Support
Region Matrix
Polygon Chain
Convex Hull Chain
Bounding Box Chain
Chain Code
Curvature Chain Code
Radial Lengths
Normalized Radial Lengths
Smoothed Radial Lengths
Table 1: Summary of original and derived shape representations.
Figure 4: Examples for the -dimensional shape representations, based on a tumor region in the NLST cohort. Top two make up the primary representations: region matrix and polygon chain, respectively. Bottom two make up the derived representations: convex hull and minimum bounding box, respectively. We overlay the polygon chain on the derived representations.
Figure 5: Examples for the -dimensional shape representations, based on the tumor region in Figure 4. All four make up the derived shape representations: chain code, curvature chain code, radial lengths, smoothed radial lengths, respectively. We overlay the mean for the two bottom representations.

3.1 Binary Matrix

Let be a binary matrix representing an arbitrary -by- image, containing a -connected region, where the foreground and background are composed of ones and zeros, respectively. Additionally, we can represent each pixel in the image as a point in a -dimensional discrete plane, that is, each entry can be denoted as a point . Furthermore, to differentiate between foreground and background points, let be the indicator function for an image matrix given by

The indicator function and distribution of points ’s will be used to recreate the region’s contour in a two dimensional Cartesian plane, known as the polygon chain.

3.2 Polygon Chain

We can find the entries in the image matrix that make up the region’s boundary to, subsequently, obtain the equivalent points by using the Moore-Neighbor tracing algorithm, modified by Jacob’s stopping criteria (gonzalez2020digital). The algorithm has three arguments:

  1. starting boundary point,

  2. direction to traverse the boundary (clockwise or counter-clockwise), and

  3. the pixel connectivity.

Clearly, since the image matrix contains a -connected component, the pixel connectivity must be four. Furthermore, in our case, the boundary shall be traced in a clockwise direction. For the starting boundary point, we shall choose the point that represents the lowest left-most entry in the region. That is, let

be the collection of points that make up the region and are located in the lowest -coordinate such that

Applying the Moore-Neighbor tracing algorithm to the region matrix, results in the points that make up the boundary of the region, specifically, where the boundary begins at the lowest left-most area of the region and traverse through the boundary in a clockwise direction.

From the boundary points, we can create a sequence of points, known as the closed polygon chain, that represents the boundary of the region by creating a closed and simple polygon. Let be the total number of points that make up the region and be the collection of points that make up the closed polygon chain of the region such that

  1. is the number of boundary points,

  2. , for , and

Through the polygon chain, we can derive further two- and one-dimensional shape representations that can be used to compute specific descriptors.

3.2.1 Convex Hull Chain

The convex hull of an arbitrary shape is defined as the smallest convex polygon that encloses the shape. Therefore, we can introduce the collection of points which create a closed and simple polygon, known as the convex hull chain and denoted by , such that

  1. the convex polygon created by encloses the shape,

  2. the number of points in is less than or equal to , and

  3. each point .

Since the polygon chain forms a closed and simple polygon, we can use the linear time algorithm introduced in lee1983finding to obtain the points that make up the convex hull of the shape to, subsequently, create the convex hull chain using the directional and starting point requirements in Section 3.2.

3.2.2 Bounding Box Chain

The minimum area rectangle, or MAR, encloses an arbitrary shape with the smallest area possible (freeman1975determining). Let the minimum bounding box chain, denoted as , be the vector that makes up of the collection of points in the minimum area rectangle. To determine these points, we make use of the theorem in freeman1975determining which states that the minimum area rectangle has a side collinear with the side of the convex polygon it encloses.

That is, to obtain the MAR of the shape formed by the polygon chain, we need the convex hull chain. Additionally, at least one side of the convex hull, formed by two points and , will align with one side of the MAR. As a result, the linear-time algorithm presented in toussaint1983solving can be applied to to construct . This algorithm obtains the points and decreases the running time of the original algorithm in freeman1975determining from to by using rotating calipers (toussaint1983solving).

3.3 Chain Code

The slope of a shape’s contour can be approximated by the directional changes between two consecutive boundary points. These directional changes can be encoded to, essentially, assign a number (from to ) to each possible relative direction resulting in an encoding list, each element known as the chain code, that provides a compact representation of the shape’s contour (wirth2001shape; lecture8slides). Let be a vector representing the chain codes of the polygon chain where each entry corresponds to a direction in the -way split of the unit circle and determined by a series of steps.

First, we determine the angle between the vector composed of the difference between the two consecutive points and the -axis, that is, let be the resulting angle where is the difference. Since , we have to transform the angle to

such that . As a result, we can now determine the corresponding chain code

Clearly, we can see that this procedure splits the unit circle into eight equal parts. Additionally, if the directional change does not exactly align within the eight splits, then the rounding operator will approximate the chain code to the nearest integer.

3.3.1 Curvature Chain Code

Let be a vector representing the curvature chain code such that each entry is formed from a transformation of the difference between two consecutive chain codes, that is, let and

This simple chain code derivation estimates the curvature and contains information on the convexity of the shape 


3.4 Radial Lengths

Let be a vector of radial lengths, that is, each entry , for , is the Euclidean distance from the boundary point to the shape’s centroid (See Appendix A.1). Clearly, the radial lengths are not scale-invariant (as the Euclidean distance is not). Therefore, to properly analyze the structure of , the individual radial lengths must be normalized.

Let be the maximum radial length in such that we can introduce a vector of normalized radial lengths, denoted as , where each entry , . By normalizing the radial lengths, we have obtain a -dimensional signal that is scale-invariant and which we can use to analyze the fine details of the shape’s contour (wirth2001shape).

3.4.1 Smoothed Radial Lengths

While the normalized radial lengths make up a scale-invariant representation of the shape’s contour, this can still produce extra noise that impacts shape descriptors. We can remove any noise in the -dimensional signal of the shape by smoothing the normalized radial lengths; specifically, we apply a mean filter with a size equal to of the shape’s perimeter to the normalized radial lengths (pohlman1996quantitative). Let the window size be of the shape’s perimeter and be a vector representing the smoothed radial lengths such that each entry

4 Feature Extraction

From the shape representations in Section 3, we can compute shape-based features, known as shape descriptors. These features allow us to quantify the shape, geometry, and topology of a tumor and are categorized into either shape, boundary, or topological features.

The continuation of the processing procedure in Section 2, relies on converting the segmented tumors in Section 2.3 to individual binary matrices. Furthermore, The polygon chain for the contour of each tumor is created and the derived shape representations are also obtained. From these results, we can compute the tumor-level features of each category.

These categories and their respective features are introduced, below, while Table 2 shows an example of some of the features extracted from the tumor regions in Figure 3 (bottom-left). In Section 4.1, we introduce the shape features while in Sections 4.2 and 4.3 we introduce the boundary and topological features, respectively. The dependencies for these features are summarized in Figure 6.

Tumor Thick- # of Perimeter Major Axis Minor Axis Fibre Fibre
ID ness Holes Perimeter Area Length Length Length Width
1 37 873 5801.6652 159203.0 644.0000 404.00000 55.961409 2844.8712
2 24 328 6928.8683 96094.5 608.2032 501.90189 27.963134 3436.4710
3 5 26 639.1371 3242.0 117.2818 67.34740 10.489217 309.0793
4 5 24 667.2376 2821.5 106.7748 63.32211 8.683261 324.9355
5 4 17 470.8528 2458.0 113.0000 38.00000 10.949922 224.4765
6 3 19 509.9239 2016.5 105.1036 47.72370 8.170879 246.7911
Table 2: Computed features from the tumor regions in Figure 3. Each row corresponds to a tumor where the Tumor ID is based on the integer coding found on the key of Figure 3.
Figure 6: Dependencies of shape descriptors and representations. Colored boxes refer to the categories of the shape descriptors. Green denotes geometrical, grey topological, and orange shape.

4.1 Shape Features

The shape features characterize the geometry and shape of tumor, relying on the -dimensional shape representations, that is, the binary matrix , polygon chain , convex hull chain , and bounding box chain

. We begin by introducing the region moments, features derived from regular statistical moments.

4.1.1 Region Moments

Since we can view the entries in the region matrix as points in a -dimensional discrete plane, we can view the points as a statistical distribution in a -dimensional space (lecture8slides). From this statistical distribution, we can calculate the concrete physical properties, known as statistical moments, of the region (lecture8slides).

We start with the low-order moments, denoted as and known as the ordinary moments of order and , given by

and identical to standard statistical moments; that is, takes the moment in the -direction and the moment in the -direction.

Clearly, ordinary moments are not translation-invariant as the values depend on the -coordinates. Therefore, central moments, denoted as , measure characteristic properties with respect to the region’s centroid which result in translation-invariant features. The central moments are, hence, given by

where the centroid is determined by

Although central moments are translation-invariant, they are impacted by the area of the region and are, thus, not scale-invariant.

Normalizing the central moments by some uniform factor allows the computation of translation- and scale-invariant features (lecture8slides). This normalization procedure yields the normalized central moments, denoted as , given by

such that . With the normalized central moments, we can derive the linear combinations known as Hu’s moments, denoted as , which are invariant to translation, rotation, and scaling (lecture8slides). Although Hu’s moments are linear, their formulation cannot be generalized. Therefore, we only give the first three of Hu’s moments given by

4.1.2 Aspect Ratio

The aspect ratio is the relationship between the dimensions of the image and is given by

where and are the width and length of the image, respectively (wirth2001shape). Unlike other descriptors, the aspect ratio does not depend on the shape of the region within the image.

4.1.3 Net Area

The net area is the number of pixels that make up the region:

4.1.4 Thickness

The thickness of a region, without holes, is determined by the number of of erosion steps until the region disappears:

where is the sequence of eroded region matrices such that , , and , for  (wirth2001shape).

4.1.5 Elongation

The elongation of a region is the relationship between the area and the squared-thickness, given by

we note that this descriptor becomes distorted as the curvature of the region increases (wirth2001shape).

4.1.6 Perimeter

The perimeter is the length of the region’s contour, an approximation to number of pixels that make up the region:

4.1.7 Filled Area

The filled area is the approximate area of the polygon that represents the region, estimated using Gauss’s area formula:

This descriptor is also known as the area of connected region, that is, the area of the region without holes (lecture8slides).

4.1.8 Compactness

The compactness of a region is a dimensionless quantity that increases as the irregularity of the shape increases, given by the relationship between perimeter and area (castleman1996digital). That is, the formulae is given by

4.1.9 Roundness

The roundness of a region is the compactness normalized against a filled circle, that is, the relationship between the area of the region and the area of a circle with the same perimeter (wirth2001shape). The formulae is given by

4.1.10 Fibre Length and Width

The fibre length and width are, clearly, another form of measurements for a shape’s length and width, respectively. The fibre length is given by

and the fibre width by

4.1.11 Circularity

The circularity of a region is the relationship between the area of the region and the area of a circle with same convex perimeter:

This descriptor decreases as the shapes becomes less circular, but is relatively insensitive to irregular boundaries (wirth2001shape).

4.1.12 Convexity

The convexity of a region is the relationship between the perimeter of the region’s convex hull and the perimeter of the region:

Therefore, this descriptor quantifies how much the shape differs from a convex shape as well as accounting for any irregularities along the boundary of the shape (wirth2001shape).

4.1.13 Density

The density of a region is a measurement of solidity, obtained from the ratio of the area of the region to the area of the region’s convex hull (wirth2001shape). That is,

4.1.14 Major Axis Length

The major axis length is the length of the line segment that runs along the longest part of the region and is, clearly, an approximate measurement of the region’s length (wirth2001shape; lecture8slides). The formulae is given by

where and .

4.1.15 Major Axis Angle

The major axis angle describes the direction of the major axis, relative to the -axis, and is given by

4.1.16 Minor Axis Length

Opposite to the major axis length, the minor axis length is the length of the line segment that runs along the shortest part of the region (lecture8slides). Clearly, an approximate measurement of the region’s width (wirth2001shape). The formulae is given by

where and .

4.1.17 Eccentricity

The eccentricity of a region is the relationship between the major and minor axis; this descriptor is another way to measure the elongation of a region and is given by

We notice that as the eccentricity approaches one, the shape of the region becomes less elongated and roughly square or circular (wirth2001shape).

4.1.18 Rectangularity

The rectangularity of a region is the relationship between the area of a region and the area of the region’s minimum bounding box, given by

4.1.19 Curl

The curl of a region quantifies how much its shape is “curled up”, given by

We can see that as the descriptor decreases, the degree to which the region is “curled up” increases (wirth2001shape).

4.2 Boundary Features

The boundary features characterize the contour of the tumor, relying on the -dimensional shape representations, that is, the chain code , curvature chain code , radial lengths , smoothed radial lengths . We begin by introducing the contour-equivalent of the region moments: the boundary moments. We rely on the radial lengths , but yang2008survey notes that any -dimensional representation works.

4.2.1 Boundary Moments

Using the radial lengths as a -dimensional representation, we can reduce the dimension of the shape with boundary moments (yang2008survey; sonka2014image). Similar to region moments, we can estimate both the moment and central moment by

Additionally, we also have two normalized moments given by

which are invariant to translation, rotation, and scaling. From these moments, less noise-sensitive descriptors can be obtained and are given by

4.2.2 Bending Energy

The total bending energy of a region is, theoretically, the energy necessary to bend a rod the shape of the region (wirth2001shape). As a result, the minimum value for this descriptor is for a circle of radius  (wirth2001shape). This robust descriptor is given by

4.2.3 Total Absolute Curvature

The total absolute of a curvature for a region is given by

where the minimum is obtained for all convex shapes (wirth2001shape).

4.2.4 Radial Mean and Standard Deviation

The average of the normalized radial lengths is given by

and can measure macroscopic changes in the shape’s contour. Similarly, the standard deviation is given by

and can indicate fine contour changes.

4.2.5 Circularity

Another derivation for the circularity of a region, using the normalized radial lengths, is given by

where and are the mean and standard deviation of the radial lengths, respectively. This derivation increases as the shape becomes more circular (haralick1974measure).

4.2.6 Entropy

The entropy relies on the -bin histogram of the radial lengths and is a probabilistic measure that quantifies how well the radial lengths can be estimated (kilday1993classifying; wirth2001shape). The formulae is given by

where is the entry of the -bin histogram222 kilday1993classifying relied on the

-bin probability histogram to compute the entropy of tumor regions.

. This descriptor incorporates the roundness and roughness of a shape (kilday1993classifying).

4.2.7 Area Ratio

The area ratio measures how much of the shape is outside the circle with a radius defined by the radial mean, describing the macroscopic characteristics of the shape (kilday1993classifying). The formulae is given by

4.2.8 Zero-crossing Count

In conjunction with the area ratio, the zero-crossing count measures the number of times the shape crosses the contour of the circle with a radius defined by the radial mean. This descriptor captures the fine detail of the shape’s contour and is given by

4.2.9 Average Roughness Index

The average roughness index divides the contour of the shape into small segments of equal length, computing a roughness measure on each segment and averaging the results (kilday1993classifying). That is,

where is the number of points in a fixed interval and

4.2.10 Margin Fluctuation

The margin fluctuation is the standard deviation of the differences between the normalized radial lengths and their corresponding smoothed radial length, that is,

where and

This descriptor was shown to be greater for spiculated shapes in giger1994computerized.

4.3 Topological Features

The topological features are not impacted by degradations or deformations in the shape (wirth2001shape). Therefore, they are invariant to rotation, scaling, and translation. These makes them useful global shape descriptor, that can characterize the structure of a region (wirth2001shape; lecture8slides). Features based on topological data analysis, have been a rising focus of shape analysis. Although, in this section, we introduce only three simple topological features.

4.3.1 Number of Holes

The number of holes within a region is a simple and robust measurement, given by

where is the matrix that contains the holes.

4.3.2 Number of Connected Components

The number of connected components is, trivially, the number of regions in the image based on a -connectivity or -connectivity (gonzalez2020digital) and is given by

4.3.3 Euler Number

The Euler number is the number of connected components minus the number of holes:

This simple descriptor is invariant to rotation, scaling, translation (wirth2001shape).

5 Real Data Analysis

We studied the association between tumor shape and prognostic outcomes by applying univariate and multivariate survival analysis techniques to real data. Section 5.1 describes the univariate and multivariate survival analyses, while Section 5.2 concludes the chapter by developing and validating a prognostic model to predict patient survival outcomes.

Two independent cohorts provided the pathology images with the corresponding clinical data. Specifically, the National Lung Screening Trial (NLST) provided images for lung ADC patients, while The Cancer Genome Atlas (TCGA) program provided images for lung ADC patients.333The NLST website ( provides additional information on how the images, based on H&E-stained slides from blocks of tissues, were obtained. We considered the NLST dataset for initial analyses and training of the prognostic model while the TCGA dataset was used only for validation purposes. Table 3 presents a clinical summary of the patients in each cohort.

NLST (training) TCGA (validation) -value
Number of Patients 143 318
Age 64.015.19 64.5210.28 0.4819
    Male 80 (55.9%) 137 (43.1%)
    Female 63 (44.1%) 181 (56.9%)
Smoking Status
    Yes 75 (52.4%) 214 (67.3%)
    No 68 (47.6%) 104 (32.7%)
    I 95 (66.4%) 174 (54.7%)
    II 15 (10.5%) 78 (24.5%)
    III 23 (16.1%) 45 (14.2%)
    IV 10 (7%) 21 (6.6%)
Values are either mean standard deviation, or number (percentage).
Table 3: Patient characteristics of training and validation datasets. Hypothesis testing was performed to test the differences in the two patient cohort; two sample

-test (unequal variances) was used for age and chi-square test for remaining variables.

The automated tumor recognition system, based on a convolutional neural network (CNN) and developed in

wang2018comprehensive, created a reconstructed heat map from each pathology image. Furthermore, each reconstructed heat map was processed using the procedure in Section 2, resulting in their shape representations and shape-based features. The aforementioned features were computed for each tumor region. Only the primary tumor regions were kept, that is, the largest tumor region in the largest tissue sample of each slide image.

We note that, for each analysis, the response is overall survival, defined as the period from diagnosis until death or last contact. Additionally, whenever possible, we adjust for clinical variables (age, gender, smoking status, and stage) and account for patients with multiple slide images by clustering patient identifiers. If clustering is unavailable, we summarize the features at the patient-level, using the median. Survival analysis was performed with R software, version 4.02, and R packages survival (version 3.2-7) and glmnet (version 4.0-2) (glmnet-survival; R; survival-package). The results were considered significant if the two-tailed p-value and the analyses were based on the methodology in wang2018comprehensive.

5.1 Survival Analysis

We fit a univariate Cox proportional-hazards model (CoxPH) to each scaled feature444Features were scaled (standardized) by their respective mean and standard deviation., adjusting for age, gender, smoking status, and stage as well as clustering by patient identifiers, to study the association between each individual feature and prognostic outcome. The results, using the NLST dataset, are summarized in Table 4. Only eight features were statistically significant (p-value ) with patient survival outcome. Out of the eight, seven were shape features, one was a topological feature, while none were boundary features. All of these features were correlated with poor survival outcome with hazard ratios (HRs) ). Serving as a negative control, the aspect ratio and major axis angle were not correlated with patient prognosis (-value ; Table 4).

Coefficient Hazard ratio (HR) Standard error (SE) Robust SE -value
Aspect Ratio -0.0403 0.9605 0.1177 0.1059 0.7031
Net Area 0.2547 1.2901 0.0946 0.1321 0.0538
Elongation -0.2347 0.7908 0.1613 0.1416 0.0976
Thickness 0.2577 1.2939 0.1017 0.1236 0.0370
Number of Holes 0.2432 1.2753 0.0920 0.0974 0.0125
Polygon-Based Perimeter 0.3404 1.4055 0.1178 0.1380 0.0136
Polygon-Based Area 0.2631 1.3009 0.0962 0.1230 0.0324
Compactness 0.0349 1.0355 0.1263 0.1223 0.7756
Roundness 0.1258 1.1340 0.1212 0.1357 0.3539
Circularity 0.0857 1.0895 0.1271 0.1531 0.5754
Convexity 0.0800 1.0833 0.1244 0.1428 0.5752
Density 0.2305 1.2593 0.1332 0.1528 0.1314
Major Axis Length 0.3738 1.4532 0.1148 0.1363 0.0061
Major Axis Angle -0.0744 0.9283 0.1117 0.1147 0.5169
Minor Axis Length 0.2767 1.3188 0.1132 0.1317 0.0356
Rectangularity 0.1342 1.1436 0.1213 0.1440 0.3514
Eccentricity -0.2324 0.7926 0.1168 0.1207 0.0542
Fibre Length 0.2481 1.2816 0.1023 0.1182 0.0358
Fibre Width 0.3336 1.3959 0.1178 0.1375 0.0153
Curl 0.0238 1.0240 0.1215 0.1198 0.8428
Total Absolute Curvature 0.1121 1.1186 0.1200 0.1628 0.4912
Bending Energy 0.1099 1.1162 0.1203 0.1623 0.4984
Radial Mean 0.0862 1.0901 0.1226 0.1343 0.5209
Radial SD -0.0670 0.9352 0.1188 0.1161 0.5642
Radial-Based Circularity 0.0720 1.0747 0.1172 0.1195 0.5466
Radial Entropy -0.0507 0.9506 0.1174 0.1118 0.6503
Area Ratio -0.0474 0.9537 0.1203 0.1226 0.6992
Zero-crossing Count 0.1040 1.1096 0.1127 0.1275 0.4146
Average Roughness Index -0.1156 0.8908 0.1552 0.2609 0.6577
Margin Fluctuation -0.0669 0.9353 0.1104 0.1203 0.5785
Bolding signifies features with -value .
Table 4: Univariate analysis of shape-based features in NLST training dataset. A Cox proportional-hazards (CoxPH) model was fitted to each scaled feature, clustering by patient identifier (Patient ID) and adjusting for clinical variables (see Section 5.1).

All shape-based features, scaled like in the univariate analysis, were selected to build a multivariate Cox proportional-hazards model with a lasso penalty; using Harrell’s concordance measure, the penalty coefficient was determined through -fold cross validation (glmnet-survival; wang2018comprehensive). We chose the penalty coefficient such that this value was within standard error of the value that gave the minimum mean cross-validated error (glmnet-survival). The mean cross-validated errors for various penalty coefficients are shown in Figure 7 (left).

Figure 7: Results from the regularized Cox proportional-hazards (CoxPH) model. The left figure shows the mean cross-validated errors, based on Harrell’s concordance measure, for each penalty coefficient . The right figure shows the importance of each feature kept by the regularized CoxPH model, based on the magnitude of each coefficient.

Similarly, we summarized the standardized coefficients kept by the regularized CoxPH model in Table 5. Scaling the feature before fitting the model, allowed us to obtain the standardized coefficients. This allowed us to compare the regularized coefficients with the coefficients obtained in the univariate analysis (Table 4). Additionally, we can also show the importance of each feature in Figure 7 (right), based on the magnitudes of each regularized coefficient. Clustering of patient identifiers was not used in the regularized Cox proportional-hazards model; instead, we summarized the features at the patient-level. The shape-based prognostic model, based on the predicted risk score of the regularized CoxPH model, was developed, and its predictive performance validated as follows.

Coefficient exp(Coefficient)
Aspect Ratio
Net Area
Elongation -0.3276 0.7206
Number of Holes 0.0754 1.0783
Polygon-Based Perimeter 0.5106 1.6663
Polygon-Based Area
Roundness 0.2139 1.2385
Major Axis Length
Major Axis Angle -0.1956 0.8223
Minor Axis Length 0.0888 1.0929
Eccentricity -0.2600 0.7710
Fibre Length
Fibre Width
Total Absolute Curvature
Bending Energy
Radial Mean
Radial SD
Radial-Based Circularity
Radial Entropy
Area Ratio
Zero-crossing Count 0.0054 1.0054
Average Roughness Index 0.3468 1.4145
Margin Fluctuation
Table 5: Multivariate analysis of shape-based features in NLST training dataset. A regularized Cox proportional-hazards (CoxPH) model with a lasso penalty was fitted to each scaled feature. The penalty coefficient was chosen based on -fold cross validation. The resulting coefficient values can be compared to the univariate coefficients summarized in Table 4.

5.2 Predictive Performance

Since the multivariate model in Section 5.1 was based on the patient-level features, similar processing was applied to the TCGA cohort. Additionally, the features were centered and scaled based on the means and standard deviations of the features in the NLST cohort. These transformations prevent bias and allow us to independently validate the prognostic performance of this model. A shape-based risk score was computed for each patient to assign patients into predicted high-risk and low-risk groups, using the median as a cutoff (wang2018comprehensive). The survival curves for the predicted high-risk and low-risk groups were estimated using the Kaplan-Meier method and are shown in Figure 8, as well as the results of the -rank test to compare the survival difference between risk groups. We associate a worse survival outcome with the predicted high-risk group (-rank test, -value , Figure 8).

Figure 8: Survival curves, estimated using the Kaplan-Meier method, for both high-risk and low-risk groups. A -rank test was performed to test the difference between these two curves. The -value of the -rank test is overlaid and the

pointwise confidence intervals are also shown for each risk group.

The prognostic performance of the shape-based risk score was validated by a multivariate CoxPH model and the resulting coefficients are summarized in Table 6. After adjusting for clinical variables, including age, gender, smoking status, and stage, the predicted risk groups independently predicted prognosis (high-risk vs. low-risk, HR , CI , -value , Table 6).

Hazard Ratio (HR) with confidence interval (CI) -value
High-Risk vs. Low-Risk 2.11 (1.24–3.59) 0.0059
Age 1.02 (0.99–1.04) 0.2008
Female vs. Male 1.66 (0.94–2.96) 0.0829
Smoker vs. Non-Smoker 0.81 (0.48–1.36) 0.4200
Stage II vs. Stage I 2.48 (1.34–4.59) 0.0038
Stage III vs. Stage I 4.98 (2.59–9.61)
Stage IV vs. Stage I 3.48 (1.46–8.28) 0.0047
Bolding signifies features with p-value .
Table 6: Multivariate analysis of the predicted risk group. A Cox proportional-hazards (CoxPH) model was fitted to test the predictive performance of the predicted risk group, adjust for clinical variables and based on the TCGA cohort.

6 Discussion and Conclusion

A whole-slide image processing procedure was developed to extract tumor regions and compute shape-based descriptors from them. This procedure facilitated the analyses performed to study the association between tumor shape and patient survival outcome. While this procedure successfully extracted and visualized tumor regions, further inspection is necessary to ensure the extracted objects accurately represent the tumors in the whole-slide image. Specifically, various morphological operations (gonzalez2020digital) could be used in the procedure to improve the main tumor representation.

The univariate analysis discovered eight statistically significant features (at the level of significance); of these eight features, seven attempted to quantify the size and length of the tumor: thickness, polygon-based perimeter and area, major and minor axis length, and fibre length and width (Table 4). The only topological feature proved to be significant: the number of holes, which quantifies the degradations within the tumors. We note that none of these seven features were based on the derived shape representations; specifically, the thickness feature was based on the binary matrix, while the remaining features were based on the polygon chain. Therefore, with the features used, the derived shape representations were ineffective in diversifying the tumor shape. Interestingly, the results align with some of the significant features found in wang2018comprehensive such as the area, perimeter, major axis length, minor axis length, and the number of holes.

The multivariate analysis used regularization to avoid overfitting and selected nine features: perimeter, average roughness index, elongation, eccentricity, roundness, major axis angle, minor axis length, number of holes, and zero-crossing count (ordered from largest to smallest importance; Figure 7). Out of the nine features three were significant in the univariate analysis: perimeter, minor axis length, and the number of holes. Interestingly, the major axis angle, a negative control, was kept in the regularized model. Furthermore, the radial-based features, average roughness index, and zero-crossing count, insignificant in the univariate analysis, were also kept by the model. This indicates that further research regarding the derived shape representations and their features should be conducted. The resulting regularized model was used to compute patient risk scores to, subsequently, dichotomize patients into high-risk and low-risk groups. Through an independent cohort, the predicted risk score proved to be a significant prognostic factor, alongside other clinical variables (Table 6).

Tumor regions obtained from whole-slide images tend to have a complex shape and boundary. These complexities can, potentially, introduce noise into the shape representations. As a result, this noise disrupts the computed features. This issue could be resolved by introducing noise-resistant shape representations and features. We also cannot disregard the -dimensional spatial information that is lost due to two-dimensional imaging procedures. Therefore, we can apply feature extraction to other imaging techniques, such as computed tomography (CT) colonography, magnetic resonance imaging (MRI), and positron emission tomography (PET)/CT colonography, to produce more comprehensive and diverse shape-based features that further quantify the tumor shape, geometry, and topology as well as improving the performance of current risk-based prognostic models (wang2018comprehensive).

In this study, we developed a whole-slide image processing procedure and shape-based prognostic model to evaluate the relationship between tumor shape, geometry, and topology and patient survival outcome. Our image processing procedure, efficiently, extracted tumor regions, and computed shape-based features from pathology images. This procedure can be adapted to include more shape representations and features; it can also be adapted to other cancer types such as the brain, breast, and kidney cancer. The univariate and multivariate provided new insights into the relationship between tumor shape and patient prognosis. Additionally, our results supported the findings in wang2018comprehensive. Our shape-based prognostic model, adapted from wang2018comprehensive, predicted patient risk scores. These risk scores served as a prognostic factor, independent of other clinical variables. In all, the proposed processing and model development pipeline provided an objective prognostic method as well as contributing to the shape analysis of tumors.

Appendix A Additional Formulas

a.1 Centroid

The centroid of a shape , also known as the center of gravity, is given by


is the signed area of the shape, obtained using Gauss’s area formula. For a region, represented in a two-dimensional discrete plane, the centroid is given by

where is the number of points that make up the region.555 We can see that the coordinates of the centroid are the means of the -coordinates that make up the region.