1 Introduction
Lung cancer is the leading cause of cancer death for both females and males (siegel2020cancer). Lung adenocarcinoma (ADC) is heterogeneous in morphological features, highly volatile in prognosis, and compromises half of all lung cancer cases (matsuda2015morphological; wang2018comprehensive). While mortality rates have rapidly declined, lung cancer in 2017 has caused more deaths than breast, prostate, colorectal, and brain cancer combined (siegel2020cancer).
With the advanced imaging technology, digital hematoxylin and eosin (H&E)stained pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high spatial resolution. Recent developments in deeplearning methods have enabled us to automatically detect and characterize the tumor regions in pathology images on a large scale (wang2018comprehensive). Through these developments, wang2018comprehensive discovered a relationship between tumor shape and patient survival outcome in lung cancer patients. Although numerous studies have associated morphological features with patient prognosis (wang2014novel; yu2016predicting; luo2017comprehensive), this study was the first to incorporate shape into a prognostic model. Furthermore, this association reflects multiple studies in brain and breast cancer (kilday1993classifying; pohlman1996quantitative), though further systemic research with additional techniques to represent and quantify tumor shape is necessary.
Existing approaches to describe the shape and boundary of regions have been the objective of shape analysis (wirth2001shape; yang2008survey). Shape representations characterize and visualize regions, while shape descriptors quantify the regions themselves. These shape representations, either one or twodimensional, involve the regions or their contours. From these shape representations, we can compute shape descriptors to quantify the shape, geometry, and topology of the original regions. In general, shape descriptors should be translation, rotation, and scaleinvariant (yang2008survey). Additionally, they should be applicationdependent with a low computational complexity (zhang2004review).
In this paper, we develop a segmentation procedure and analysis pipeline to study the relationship between tumor shape and patient survival outcome. The analysis pipeline, based on wang2018comprehensive
, can be summarized in four parts: image preprocessing, feature extraction, survival analysis, and predictive performance. Image preprocessing relies on an automated tumor recognition system to convert raw pathology images into machinereadable representations; the results lead to a feature extraction process that computes shape, geometric, and topological features from the extracted tumors in the pathology images. A regularized Cox proportionalhazards (CoxPH) model, dependent on the shapebased features, is used as an objective prognostic method. The evaluation of its predictive performance, adjusted for clinical variables, concludes the analysis pipeline. We summarize these steps in Figure
1.We organize the remaining of this paper as follows. Section 2 introduces the segmentation procedure that allows us to extract tumors from pathology images. Section 3 presents the shape representations used to represent tumors and compute shapebased features; these features being defined in Section 4. In section 5, we evaluate the prognostic performance of these features, while developing a shapebased prognostic model that varies from the one presented in wang2018comprehensive. We discuss the results and conclude the paper in Section 6.
2 Tumor Segmentation
We developed a processing procedure to extract tumors in raw pathology images. Section 2.1 introduces the matrix representation of a wholeslide image that is used by the procedure and referred to as the reconstructed heat map. The intermediate tissue segmentation step is outlined in Section 2.2, while the main tumor segmentation step is detailed in Section 2.3.
2.1 Reconstructed Heat Map
Automatic tumor recognition systems, such as the one developed in wang2018comprehensive, create machinereadable representations of wholeslide images. A matrix results from the pathology image given to the recognition system or other imaging tools, where each entry corresponds to an empty region, nonmalignant tissue, or tumor tissue in the raw sample. We refer to this matrix as a reconstructed heat map, which we can use to segment the tumor regions within the pathology image. For example, an automatic tumor recognition systems can be used with H&E stained pathology images (see Figure 2), where we can obtain a reconstructed heat map, as shown in Figure 3 (topleft). We continue the processing procedure by segmenting the tissue regions.
2.2 Tissue Segmentation
Since a pathology image can contain multiple tissue samples, disconnected tissue regions are identified and segmented through standard morphological operations, usually based on a fourconnectivity (gonzalez2020digital). We show this segmentation in Figure 3 (bottomleft). Furthermore, we remove the tissue regions with an area less than onefourth of the largest tissue region to remove the effects of small tissue samples (wang2018comprehensive). For each tissue region, we can finalize the processing procedure by segmenting the tumor regions.
2.3 Tumor Segmentation
Individual tumor regions, which also contain empty and nonmalignant categories within them, were extracted from each tissue region with a similar procedure as the tissue segmentation in Section 2.2. Similarly, to reduce the influence of small tumors, we disregard tumor regions with an area less than and consider them as “islands” to analyze, separately. We create two separate segmented matrices from the tumor regions and the nonmalignant categories, respectively; the nonmalignant categories make up the holes within the tumors. An integer coding is used to differentiate each tumor and their respective holes. We show the segmented tumor regions from the continued example in Figure 3 (bottomleft).
3 Shape Representations
From the segmented tumor regions, detailed in Section 2.3, we can represent tumors with various one and twodimensional techniques. These representation techniques should have invariance properties (rotation, scaling, translation), low computational complexity, and being applicationindependent (yang2008survey). Section 3.1 and the beginning of Section 3.2 make up the primary shape representations, while the remaining sections make up the derived shape representations.^{1}^{1}1The derived shape representations depend on the primary representations. A visualization of these are shown in Figures 4 and 5, while Table 1 summarizes the introduced notation. We begin by introducing the binary matrix, which is used to represent individual tumors in the segmented matrix.
Name  Notation  Support 

Region Matrix  
Polygon Chain  
Convex Hull Chain  
Bounding Box Chain  
Chain Code  
Curvature Chain Code  
Radial Lengths  
Normalized Radial Lengths  
Smoothed Radial Lengths 
3.1 Binary Matrix
Let be a binary matrix representing an arbitrary by image, containing a connected region, where the foreground and background are composed of ones and zeros, respectively. Additionally, we can represent each pixel in the image as a point in a dimensional discrete plane, that is, each entry can be denoted as a point . Furthermore, to differentiate between foreground and background points, let be the indicator function for an image matrix given by
The indicator function and distribution of points ’s will be used to recreate the region’s contour in a two dimensional Cartesian plane, known as the polygon chain.
3.2 Polygon Chain
We can find the entries in the image matrix that make up the region’s boundary to, subsequently, obtain the equivalent points by using the MooreNeighbor tracing algorithm, modified by Jacob’s stopping criteria (gonzalez2020digital). The algorithm has three arguments:

starting boundary point,

direction to traverse the boundary (clockwise or counterclockwise), and

the pixel connectivity.
Clearly, since the image matrix contains a connected component, the pixel connectivity must be four. Furthermore, in our case, the boundary shall be traced in a clockwise direction. For the starting boundary point, we shall choose the point that represents the lowest leftmost entry in the region. That is, let
be the collection of points that make up the region and are located in the lowest coordinate such that
Applying the MooreNeighbor tracing algorithm to the region matrix, results in the points that make up the boundary of the region, specifically, where the boundary begins at the lowest leftmost area of the region and traverse through the boundary in a clockwise direction.
From the boundary points, we can create a sequence of points, known as the closed polygon chain, that represents the boundary of the region by creating a closed and simple polygon. Let be the total number of points that make up the region and be the collection of points that make up the closed polygon chain of the region such that

is the number of boundary points,

, for , and

Through the polygon chain, we can derive further two and onedimensional shape representations that can be used to compute specific descriptors.
3.2.1 Convex Hull Chain
The convex hull of an arbitrary shape is defined as the smallest convex polygon that encloses the shape. Therefore, we can introduce the collection of points which create a closed and simple polygon, known as the convex hull chain and denoted by , such that

the convex polygon created by encloses the shape,

the number of points in is less than or equal to , and

each point .
Since the polygon chain forms a closed and simple polygon, we can use the linear time algorithm introduced in lee1983finding to obtain the points that make up the convex hull of the shape to, subsequently, create the convex hull chain using the directional and starting point requirements in Section 3.2.
3.2.2 Bounding Box Chain
The minimum area rectangle, or MAR, encloses an arbitrary shape with the smallest area possible (freeman1975determining). Let the minimum bounding box chain, denoted as , be the vector that makes up of the collection of points in the minimum area rectangle. To determine these points, we make use of the theorem in freeman1975determining which states that the minimum area rectangle has a side collinear with the side of the convex polygon it encloses.
That is, to obtain the MAR of the shape formed by the polygon chain, we need the convex hull chain. Additionally, at least one side of the convex hull, formed by two points and , will align with one side of the MAR. As a result, the lineartime algorithm presented in toussaint1983solving can be applied to to construct . This algorithm obtains the points and decreases the running time of the original algorithm in freeman1975determining from to by using rotating calipers (toussaint1983solving).
3.3 Chain Code
The slope of a shape’s contour can be approximated by the directional changes between two consecutive boundary points. These directional changes can be encoded to, essentially, assign a number (from to ) to each possible relative direction resulting in an encoding list, each element known as the chain code, that provides a compact representation of the shape’s contour (wirth2001shape; lecture8slides). Let be a vector representing the chain codes of the polygon chain where each entry corresponds to a direction in the way split of the unit circle and determined by a series of steps.
First, we determine the angle between the vector composed of the difference between the two consecutive points and the axis, that is, let be the resulting angle where is the difference. Since , we have to transform the angle to
such that . As a result, we can now determine the corresponding chain code
Clearly, we can see that this procedure splits the unit circle into eight equal parts. Additionally, if the directional change does not exactly align within the eight splits, then the rounding operator will approximate the chain code to the nearest integer.
3.3.1 Curvature Chain Code
Let be a vector representing the curvature chain code such that each entry is formed from a transformation of the difference between two consecutive chain codes, that is, let and
This simple chain code derivation estimates the curvature and contains information on the convexity of the shape
(wirth2001shape).3.4 Radial Lengths
Let be a vector of radial lengths, that is, each entry , for , is the Euclidean distance from the boundary point to the shape’s centroid (See Appendix A.1). Clearly, the radial lengths are not scaleinvariant (as the Euclidean distance is not). Therefore, to properly analyze the structure of , the individual radial lengths must be normalized.
Let be the maximum radial length in such that we can introduce a vector of normalized radial lengths, denoted as , where each entry , . By normalizing the radial lengths, we have obtain a dimensional signal that is scaleinvariant and which we can use to analyze the fine details of the shape’s contour (wirth2001shape).
3.4.1 Smoothed Radial Lengths
While the normalized radial lengths make up a scaleinvariant representation of the shape’s contour, this can still produce extra noise that impacts shape descriptors. We can remove any noise in the dimensional signal of the shape by smoothing the normalized radial lengths; specifically, we apply a mean filter with a size equal to of the shape’s perimeter to the normalized radial lengths (pohlman1996quantitative). Let the window size be of the shape’s perimeter and be a vector representing the smoothed radial lengths such that each entry
4 Feature Extraction
From the shape representations in Section 3, we can compute shapebased features, known as shape descriptors. These features allow us to quantify the shape, geometry, and topology of a tumor and are categorized into either shape, boundary, or topological features.
The continuation of the processing procedure in Section 2, relies on converting the segmented tumors in Section 2.3 to individual binary matrices. Furthermore, The polygon chain for the contour of each tumor is created and the derived shape representations are also obtained. From these results, we can compute the tumorlevel features of each category.
These categories and their respective features are introduced, below, while Table 2 shows an example of some of the features extracted from the tumor regions in Figure 3 (bottomleft). In Section 4.1, we introduce the shape features while in Sections 4.2 and 4.3 we introduce the boundary and topological features, respectively. The dependencies for these features are summarized in Figure 6.
Tumor  Thick  # of  Perimeter  Major Axis  Minor Axis  Fibre  Fibre  

ID  ness  Holes  Perimeter  Area  Length  Length  Length  Width 
1  37  873  5801.6652  159203.0  644.0000  404.00000  55.961409  2844.8712 
2  24  328  6928.8683  96094.5  608.2032  501.90189  27.963134  3436.4710 
3  5  26  639.1371  3242.0  117.2818  67.34740  10.489217  309.0793 
4  5  24  667.2376  2821.5  106.7748  63.32211  8.683261  324.9355 
5  4  17  470.8528  2458.0  113.0000  38.00000  10.949922  224.4765 
6  3  19  509.9239  2016.5  105.1036  47.72370  8.170879  246.7911 
4.1 Shape Features
The shape features characterize the geometry and shape of tumor, relying on the dimensional shape representations, that is, the binary matrix , polygon chain , convex hull chain , and bounding box chain
. We begin by introducing the region moments, features derived from regular statistical moments.
4.1.1 Region Moments
Since we can view the entries in the region matrix as points in a dimensional discrete plane, we can view the points as a statistical distribution in a dimensional space (lecture8slides). From this statistical distribution, we can calculate the concrete physical properties, known as statistical moments, of the region (lecture8slides).
We start with the loworder moments, denoted as and known as the ordinary moments of order and , given by
and identical to standard statistical moments; that is, takes the moment in the direction and the moment in the direction.
Clearly, ordinary moments are not translationinvariant as the values depend on the coordinates. Therefore, central moments, denoted as , measure characteristic properties with respect to the region’s centroid which result in translationinvariant features. The central moments are, hence, given by
where the centroid is determined by
Although central moments are translationinvariant, they are impacted by the area of the region and are, thus, not scaleinvariant.
Normalizing the central moments by some uniform factor allows the computation of translation and scaleinvariant features (lecture8slides). This normalization procedure yields the normalized central moments, denoted as , given by
such that . With the normalized central moments, we can derive the linear combinations known as Hu’s moments, denoted as , which are invariant to translation, rotation, and scaling (lecture8slides). Although Hu’s moments are linear, their formulation cannot be generalized. Therefore, we only give the first three of Hu’s moments given by
4.1.2 Aspect Ratio
The aspect ratio is the relationship between the dimensions of the image and is given by
where and are the width and length of the image, respectively (wirth2001shape). Unlike other descriptors, the aspect ratio does not depend on the shape of the region within the image.
4.1.3 Net Area
The net area is the number of pixels that make up the region:
4.1.4 Thickness
The thickness of a region, without holes, is determined by the number of of erosion steps until the region disappears:
where is the sequence of eroded region matrices such that , , and , for (wirth2001shape).
4.1.5 Elongation
The elongation of a region is the relationship between the area and the squaredthickness, given by
we note that this descriptor becomes distorted as the curvature of the region increases (wirth2001shape).
4.1.6 Perimeter
The perimeter is the length of the region’s contour, an approximation to number of pixels that make up the region:
4.1.7 Filled Area
The filled area is the approximate area of the polygon that represents the region, estimated using Gauss’s area formula:
This descriptor is also known as the area of connected region, that is, the area of the region without holes (lecture8slides).
4.1.8 Compactness
The compactness of a region is a dimensionless quantity that increases as the irregularity of the shape increases, given by the relationship between perimeter and area (castleman1996digital). That is, the formulae is given by
4.1.9 Roundness
The roundness of a region is the compactness normalized against a filled circle, that is, the relationship between the area of the region and the area of a circle with the same perimeter (wirth2001shape). The formulae is given by
4.1.10 Fibre Length and Width
The fibre length and width are, clearly, another form of measurements for a shape’s length and width, respectively. The fibre length is given by
and the fibre width by
4.1.11 Circularity
The circularity of a region is the relationship between the area of the region and the area of a circle with same convex perimeter:
This descriptor decreases as the shapes becomes less circular, but is relatively insensitive to irregular boundaries (wirth2001shape).
4.1.12 Convexity
The convexity of a region is the relationship between the perimeter of the region’s convex hull and the perimeter of the region:
Therefore, this descriptor quantifies how much the shape differs from a convex shape as well as accounting for any irregularities along the boundary of the shape (wirth2001shape).
4.1.13 Density
The density of a region is a measurement of solidity, obtained from the ratio of the area of the region to the area of the region’s convex hull (wirth2001shape). That is,
4.1.14 Major Axis Length
The major axis length is the length of the line segment that runs along the longest part of the region and is, clearly, an approximate measurement of the region’s length (wirth2001shape; lecture8slides). The formulae is given by
where and .
4.1.15 Major Axis Angle
The major axis angle describes the direction of the major axis, relative to the axis, and is given by
4.1.16 Minor Axis Length
Opposite to the major axis length, the minor axis length is the length of the line segment that runs along the shortest part of the region (lecture8slides). Clearly, an approximate measurement of the region’s width (wirth2001shape). The formulae is given by
where and .
4.1.17 Eccentricity
The eccentricity of a region is the relationship between the major and minor axis; this descriptor is another way to measure the elongation of a region and is given by
We notice that as the eccentricity approaches one, the shape of the region becomes less elongated and roughly square or circular (wirth2001shape).
4.1.18 Rectangularity
The rectangularity of a region is the relationship between the area of a region and the area of the region’s minimum bounding box, given by
4.1.19 Curl
The curl of a region quantifies how much its shape is “curled up”, given by
We can see that as the descriptor decreases, the degree to which the region is “curled up” increases (wirth2001shape).
4.2 Boundary Features
The boundary features characterize the contour of the tumor, relying on the dimensional shape representations, that is, the chain code , curvature chain code , radial lengths , smoothed radial lengths . We begin by introducing the contourequivalent of the region moments: the boundary moments. We rely on the radial lengths , but yang2008survey notes that any dimensional representation works.
4.2.1 Boundary Moments
Using the radial lengths as a dimensional representation, we can reduce the dimension of the shape with boundary moments (yang2008survey; sonka2014image). Similar to region moments, we can estimate both the moment and central moment by
Additionally, we also have two normalized moments given by
which are invariant to translation, rotation, and scaling. From these moments, less noisesensitive descriptors can be obtained and are given by
4.2.2 Bending Energy
The total bending energy of a region is, theoretically, the energy necessary to bend a rod the shape of the region (wirth2001shape). As a result, the minimum value for this descriptor is for a circle of radius (wirth2001shape). This robust descriptor is given by
4.2.3 Total Absolute Curvature
The total absolute of a curvature for a region is given by
where the minimum is obtained for all convex shapes (wirth2001shape).
4.2.4 Radial Mean and Standard Deviation
The average of the normalized radial lengths is given by
and can measure macroscopic changes in the shape’s contour. Similarly, the standard deviation is given by
and can indicate fine contour changes.
4.2.5 Circularity
Another derivation for the circularity of a region, using the normalized radial lengths, is given by
where and are the mean and standard deviation of the radial lengths, respectively. This derivation increases as the shape becomes more circular (haralick1974measure).
4.2.6 Entropy
The entropy relies on the bin histogram of the radial lengths and is a probabilistic measure that quantifies how well the radial lengths can be estimated (kilday1993classifying; wirth2001shape). The formulae is given by
where is the entry of the bin histogram^{2}^{2}2 kilday1993classifying relied on the
bin probability histogram to compute the entropy of tumor regions.
. This descriptor incorporates the roundness and roughness of a shape (kilday1993classifying).4.2.7 Area Ratio
The area ratio measures how much of the shape is outside the circle with a radius defined by the radial mean, describing the macroscopic characteristics of the shape (kilday1993classifying). The formulae is given by
4.2.8 Zerocrossing Count
In conjunction with the area ratio, the zerocrossing count measures the number of times the shape crosses the contour of the circle with a radius defined by the radial mean. This descriptor captures the fine detail of the shape’s contour and is given by
4.2.9 Average Roughness Index
The average roughness index divides the contour of the shape into small segments of equal length, computing a roughness measure on each segment and averaging the results (kilday1993classifying). That is,
where is the number of points in a fixed interval and
4.2.10 Margin Fluctuation
The margin fluctuation is the standard deviation of the differences between the normalized radial lengths and their corresponding smoothed radial length, that is,
where and
This descriptor was shown to be greater for spiculated shapes in giger1994computerized.
4.3 Topological Features
The topological features are not impacted by degradations or deformations in the shape (wirth2001shape). Therefore, they are invariant to rotation, scaling, and translation. These makes them useful global shape descriptor, that can characterize the structure of a region (wirth2001shape; lecture8slides). Features based on topological data analysis, have been a rising focus of shape analysis. Although, in this section, we introduce only three simple topological features.
4.3.1 Number of Holes
The number of holes within a region is a simple and robust measurement, given by
where is the matrix that contains the holes.
4.3.2 Number of Connected Components
The number of connected components is, trivially, the number of regions in the image based on a connectivity or connectivity (gonzalez2020digital) and is given by
4.3.3 Euler Number
The Euler number is the number of connected components minus the number of holes:
This simple descriptor is invariant to rotation, scaling, translation (wirth2001shape).
5 Real Data Analysis
We studied the association between tumor shape and prognostic outcomes by applying univariate and multivariate survival analysis techniques to real data. Section 5.1 describes the univariate and multivariate survival analyses, while Section 5.2 concludes the chapter by developing and validating a prognostic model to predict patient survival outcomes.
Two independent cohorts provided the pathology images with the corresponding clinical data. Specifically, the National Lung Screening Trial (NLST) provided images for lung ADC patients, while The Cancer Genome Atlas (TCGA) program provided images for lung ADC patients.^{3}^{3}3The NLST website (https://biometry.nci.nih.gov/cdas/learn/nlst) provides additional information on how the images, based on H&Estained slides from blocks of tissues, were obtained. We considered the NLST dataset for initial analyses and training of the prognostic model while the TCGA dataset was used only for validation purposes. Table 3 presents a clinical summary of the patients in each cohort.
NLST (training)  TCGA (validation)  value  
Number of Patients  143  318  
Age  64.015.19  64.5210.28  0.4819 
Gender  
Male  80 (55.9%)  137 (43.1%)  
Female  63 (44.1%)  181 (56.9%)  
Smoking Status  
Yes  75 (52.4%)  214 (67.3%)  
No  68 (47.6%)  104 (32.7%)  
Stage  
I  95 (66.4%)  174 (54.7%)  
II  15 (10.5%)  78 (24.5%)  
III  23 (16.1%)  45 (14.2%)  
IV  10 (7%)  21 (6.6%)  
Values are either mean standard deviation, or number (percentage). 
test (unequal variances) was used for age and chisquare test for remaining variables.
The automated tumor recognition system, based on a convolutional neural network (CNN) and developed in
wang2018comprehensive, created a reconstructed heat map from each pathology image. Furthermore, each reconstructed heat map was processed using the procedure in Section 2, resulting in their shape representations and shapebased features. The aforementioned features were computed for each tumor region. Only the primary tumor regions were kept, that is, the largest tumor region in the largest tissue sample of each slide image.We note that, for each analysis, the response is overall survival, defined as the period from diagnosis until death or last contact. Additionally, whenever possible, we adjust for clinical variables (age, gender, smoking status, and stage) and account for patients with multiple slide images by clustering patient identifiers. If clustering is unavailable, we summarize the features at the patientlevel, using the median. Survival analysis was performed with R software, version 4.02, and R packages survival (version 3.27) and glmnet (version 4.02) (glmnetsurvival; R; survivalpackage). The results were considered significant if the twotailed pvalue and the analyses were based on the methodology in wang2018comprehensive.
5.1 Survival Analysis
We fit a univariate Cox proportionalhazards model (CoxPH) to each scaled feature^{4}^{4}4Features were scaled (standardized) by their respective mean and standard deviation., adjusting for age, gender, smoking status, and stage as well as clustering by patient identifiers, to study the association between each individual feature and prognostic outcome. The results, using the NLST dataset, are summarized in Table 4. Only eight features were statistically significant (pvalue ) with patient survival outcome. Out of the eight, seven were shape features, one was a topological feature, while none were boundary features. All of these features were correlated with poor survival outcome with hazard ratios (HRs) ). Serving as a negative control, the aspect ratio and major axis angle were not correlated with patient prognosis (value ; Table 4).
Coefficient  Hazard ratio (HR)  Standard error (SE)  Robust SE  value  
Aspect Ratio  0.0403  0.9605  0.1177  0.1059  0.7031 
Net Area  0.2547  1.2901  0.0946  0.1321  0.0538 
Elongation  0.2347  0.7908  0.1613  0.1416  0.0976 
Thickness  0.2577  1.2939  0.1017  0.1236  0.0370 
Number of Holes  0.2432  1.2753  0.0920  0.0974  0.0125 
PolygonBased Perimeter  0.3404  1.4055  0.1178  0.1380  0.0136 
PolygonBased Area  0.2631  1.3009  0.0962  0.1230  0.0324 
Compactness  0.0349  1.0355  0.1263  0.1223  0.7756 
Roundness  0.1258  1.1340  0.1212  0.1357  0.3539 
Circularity  0.0857  1.0895  0.1271  0.1531  0.5754 
Convexity  0.0800  1.0833  0.1244  0.1428  0.5752 
Density  0.2305  1.2593  0.1332  0.1528  0.1314 
Major Axis Length  0.3738  1.4532  0.1148  0.1363  0.0061 
Major Axis Angle  0.0744  0.9283  0.1117  0.1147  0.5169 
Minor Axis Length  0.2767  1.3188  0.1132  0.1317  0.0356 
Rectangularity  0.1342  1.1436  0.1213  0.1440  0.3514 
Eccentricity  0.2324  0.7926  0.1168  0.1207  0.0542 
Fibre Length  0.2481  1.2816  0.1023  0.1182  0.0358 
Fibre Width  0.3336  1.3959  0.1178  0.1375  0.0153 
Curl  0.0238  1.0240  0.1215  0.1198  0.8428 
Total Absolute Curvature  0.1121  1.1186  0.1200  0.1628  0.4912 
Bending Energy  0.1099  1.1162  0.1203  0.1623  0.4984 
Radial Mean  0.0862  1.0901  0.1226  0.1343  0.5209 
Radial SD  0.0670  0.9352  0.1188  0.1161  0.5642 
RadialBased Circularity  0.0720  1.0747  0.1172  0.1195  0.5466 
Radial Entropy  0.0507  0.9506  0.1174  0.1118  0.6503 
Area Ratio  0.0474  0.9537  0.1203  0.1226  0.6992 
Zerocrossing Count  0.1040  1.1096  0.1127  0.1275  0.4146 
Average Roughness Index  0.1156  0.8908  0.1552  0.2609  0.6577 
Margin Fluctuation  0.0669  0.9353  0.1104  0.1203  0.5785 
Bolding signifies features with value . 
All shapebased features, scaled like in the univariate analysis, were selected to build a multivariate Cox proportionalhazards model with a lasso penalty; using Harrell’s concordance measure, the penalty coefficient was determined through fold cross validation (glmnetsurvival; wang2018comprehensive). We chose the penalty coefficient such that this value was within standard error of the value that gave the minimum mean crossvalidated error (glmnetsurvival). The mean crossvalidated errors for various penalty coefficients are shown in Figure 7 (left).
Similarly, we summarized the standardized coefficients kept by the regularized CoxPH model in Table 5. Scaling the feature before fitting the model, allowed us to obtain the standardized coefficients. This allowed us to compare the regularized coefficients with the coefficients obtained in the univariate analysis (Table 4). Additionally, we can also show the importance of each feature in Figure 7 (right), based on the magnitudes of each regularized coefficient. Clustering of patient identifiers was not used in the regularized Cox proportionalhazards model; instead, we summarized the features at the patientlevel. The shapebased prognostic model, based on the predicted risk score of the regularized CoxPH model, was developed, and its predictive performance validated as follows.
Coefficient  exp(Coefficient)  

Aspect Ratio  –  – 
Net Area  –  – 
Elongation  0.3276  0.7206 
Thickness  –  – 
Number of Holes  0.0754  1.0783 
PolygonBased Perimeter  0.5106  1.6663 
PolygonBased Area  –  – 
Compactness  –  – 
Roundness  0.2139  1.2385 
Circularity  –  – 
Convexity  –  – 
Density  –  – 
Major Axis Length  –  – 
Major Axis Angle  0.1956  0.8223 
Minor Axis Length  0.0888  1.0929 
Rectangularity  –  – 
Eccentricity  0.2600  0.7710 
Fibre Length  –  – 
Fibre Width  –  – 
Curl  –  – 
Total Absolute Curvature  –  – 
Bending Energy  –  – 
Radial Mean  –  – 
Radial SD  –  – 
RadialBased Circularity  –  – 
Radial Entropy  –  – 
Area Ratio  –  – 
Zerocrossing Count  0.0054  1.0054 
Average Roughness Index  0.3468  1.4145 
Margin Fluctuation  –  – 
5.2 Predictive Performance
Since the multivariate model in Section 5.1 was based on the patientlevel features, similar processing was applied to the TCGA cohort. Additionally, the features were centered and scaled based on the means and standard deviations of the features in the NLST cohort. These transformations prevent bias and allow us to independently validate the prognostic performance of this model. A shapebased risk score was computed for each patient to assign patients into predicted highrisk and lowrisk groups, using the median as a cutoff (wang2018comprehensive). The survival curves for the predicted highrisk and lowrisk groups were estimated using the KaplanMeier method and are shown in Figure 8, as well as the results of the rank test to compare the survival difference between risk groups. We associate a worse survival outcome with the predicted highrisk group (rank test, value , Figure 8).
The prognostic performance of the shapebased risk score was validated by a multivariate CoxPH model and the resulting coefficients are summarized in Table 6. After adjusting for clinical variables, including age, gender, smoking status, and stage, the predicted risk groups independently predicted prognosis (highrisk vs. lowrisk, HR , CI –, value , Table 6).
Hazard Ratio (HR) with confidence interval (CI)  value  
HighRisk vs. LowRisk  2.11 (1.24–3.59)  0.0059 
Age  1.02 (0.99–1.04)  0.2008 
Female vs. Male  1.66 (0.94–2.96)  0.0829 
Smoker vs. NonSmoker  0.81 (0.48–1.36)  0.4200 
Stage II vs. Stage I  2.48 (1.34–4.59)  0.0038 
Stage III vs. Stage I  4.98 (2.59–9.61)  
Stage IV vs. Stage I  3.48 (1.46–8.28)  0.0047 
Bolding signifies features with pvalue . 
6 Discussion and Conclusion
A wholeslide image processing procedure was developed to extract tumor regions and compute shapebased descriptors from them. This procedure facilitated the analyses performed to study the association between tumor shape and patient survival outcome. While this procedure successfully extracted and visualized tumor regions, further inspection is necessary to ensure the extracted objects accurately represent the tumors in the wholeslide image. Specifically, various morphological operations (gonzalez2020digital) could be used in the procedure to improve the main tumor representation.
The univariate analysis discovered eight statistically significant features (at the level of significance); of these eight features, seven attempted to quantify the size and length of the tumor: thickness, polygonbased perimeter and area, major and minor axis length, and fibre length and width (Table 4). The only topological feature proved to be significant: the number of holes, which quantifies the degradations within the tumors. We note that none of these seven features were based on the derived shape representations; specifically, the thickness feature was based on the binary matrix, while the remaining features were based on the polygon chain. Therefore, with the features used, the derived shape representations were ineffective in diversifying the tumor shape. Interestingly, the results align with some of the significant features found in wang2018comprehensive such as the area, perimeter, major axis length, minor axis length, and the number of holes.
The multivariate analysis used regularization to avoid overfitting and selected nine features: perimeter, average roughness index, elongation, eccentricity, roundness, major axis angle, minor axis length, number of holes, and zerocrossing count (ordered from largest to smallest importance; Figure 7). Out of the nine features three were significant in the univariate analysis: perimeter, minor axis length, and the number of holes. Interestingly, the major axis angle, a negative control, was kept in the regularized model. Furthermore, the radialbased features, average roughness index, and zerocrossing count, insignificant in the univariate analysis, were also kept by the model. This indicates that further research regarding the derived shape representations and their features should be conducted. The resulting regularized model was used to compute patient risk scores to, subsequently, dichotomize patients into highrisk and lowrisk groups. Through an independent cohort, the predicted risk score proved to be a significant prognostic factor, alongside other clinical variables (Table 6).
Tumor regions obtained from wholeslide images tend to have a complex shape and boundary. These complexities can, potentially, introduce noise into the shape representations. As a result, this noise disrupts the computed features. This issue could be resolved by introducing noiseresistant shape representations and features. We also cannot disregard the dimensional spatial information that is lost due to twodimensional imaging procedures. Therefore, we can apply feature extraction to other imaging techniques, such as computed tomography (CT) colonography, magnetic resonance imaging (MRI), and positron emission tomography (PET)/CT colonography, to produce more comprehensive and diverse shapebased features that further quantify the tumor shape, geometry, and topology as well as improving the performance of current riskbased prognostic models (wang2018comprehensive).
In this study, we developed a wholeslide image processing procedure and shapebased prognostic model to evaluate the relationship between tumor shape, geometry, and topology and patient survival outcome. Our image processing procedure, efficiently, extracted tumor regions, and computed shapebased features from pathology images. This procedure can be adapted to include more shape representations and features; it can also be adapted to other cancer types such as the brain, breast, and kidney cancer. The univariate and multivariate provided new insights into the relationship between tumor shape and patient prognosis. Additionally, our results supported the findings in wang2018comprehensive. Our shapebased prognostic model, adapted from wang2018comprehensive, predicted patient risk scores. These risk scores served as a prognostic factor, independent of other clinical variables. In all, the proposed processing and model development pipeline provided an objective prognostic method as well as contributing to the shape analysis of tumors.
Appendix A Additional Formulas
a.1 Centroid
The centroid of a shape , also known as the center of gravity, is given by
where
is the signed area of the shape, obtained using Gauss’s area formula. For a region, represented in a twodimensional discrete plane, the centroid is given by
where is the number of points that make up the region.^{5}^{5}5 We can see that the coordinates of the centroid are the means of the coordinates that make up the region.
Comments
There are no comments yet.