Deriving Enhanced Geographical Representations via Similarity-based Spectral Analysis: Predicting Colorectal Cancer Survival Curves in Iowa

09/06/2018 ∙ by Michael T. Lash, et al. ∙ The University of Iowa 0

Neural networks are capable of learning rich, nonlinear feature representations shown to be beneficial in many predictive tasks. In this work, we use such models to explore different geographical feature representations in the context of predicting colorectal cancer survival curves for patients in the state of Iowa, spanning the years 1989 to 2013. Specifically, we compare model performance using "area between the curves" (ABC) to assess (a) whether survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical features improve predictive performance, (c) whether a simple binary representation, or a richer, spectral analysis-elicited representation perform better, and (d) whether spectral analysis-based representations can be improved upon by leveraging geographically-descriptive features. In exploring (d), we devise a similarity-based spectral analysis procedure, which allows for the combination of geographically relational and geographically descriptive features. Our findings suggest that survival curves can be reasonably estimated on average, with predictive performance deviating at the five-year survival mark among all models. We also find that geographical features improve predictive performance, and that better performance is obtained using richer, spectral analysis-elicited features. Furthermore, we find that similarity-based spectral analysis-elicited representations improve upon the original spectral analysis results by approximately 40



There are no comments yet.


page 3

page 13

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning has become more prevalent, powerful new technologies such as deep learning, which are capable of learning rich, non-linear representation, have also risen to the forefront of the field. The domains of public health and medicine have particularly benefited from these innovations; in this work we examine and propose deep learning methodologies applied to these areas. The focus of this work, therefore, is to explore how different geographical representations, learned through deep learning technologies, can improve survival curve predictions for colorectal cancer patients in the state of Iowa.

Figure 1 demonstrates the urgency of the problem we are addressing, showing colorectal cancer (CRC) mortality rates for patients in Iowa spanning the years 1989 to 2013; these are expressed in terms of a zipcode tabulation area (ZCTA) level of geography.

Figure 1: Colorectal cancer mortality rate by ZCTA in the state of Iowa for the years 1989 to 2013.

In Figure 1 we first observe that numerous ZCTAs have CRC mortality rates that are at or above the 30%, indicating the particularly nefarious nature of this disease, and highlighting the need for accurate survival outlook predictions at the time of diagnosis to better inform treatment decisions Zhang et al. (2015). Furthermore, Figure 1 demonstrates the geographical diversity in which CRC mortality rates are manifested: locale seems to be related to survival outlook.

The relationship between geography and survival outlook isn’t unforeseen, unfortunately. Physical locale manifests pertinent health-based factors, such as access to health care, environmental factors, such as ground contaminants, among others, all of which may affect disease manifestation and survival outlook Wan et al. (2013).

Provided the spatially heterogeneous manifestation of colorectal cancer mortality, a major challenge is to build spatially responsive models that can aid in accurate prediction of individual-specific colorectal cancer survival curves. For instance, rural areas may have a different variety of factors affecting colorectal cancer disease manifestation and survival than sprawling metropolitan cities. Therefore, we define and examine three geographical deep learning representation methods in this work: a simple binary representation (SBR), rich representation – spectral analysis (RR-SA), and rich representation –similarity-based spectral analysis (RR-SSA); we additionally craft two sub-representation methods that are utilized in RR-SSA.

The contributions of this work, which expand upon the results obtained in Lash, Sun, Zhou, Lynch & Street (2017), are enumerated as follows:

  1. We investigate a rich representation of spatial features through spectral analysis (RR-SA) of the underlying geographical relationship graph of the ZCTAs to address the spatial heterogeneity challenge.

  2. Modifying our RR-SA representation procedure, we explore the use of geographically descriptive features, paired with the underlying adjacency graph, to further address the spatially heterogeneous nature of the problem.

  3. We determine whether the simple binary representation (SBR) or richer, spectral analysis representation (RR-SA), or similarity-based spectral analysis representation (RR-SSA) leads to more accurate survival curve predictions.

  4. We determine whether RR-SA or RR-SSA representations lead to more accurate survival curve predictions and determine which sub-representation procedure – binary (bin) or full – produces more accurate survival curve predictions.

This works continues with a disclosure of the problem setting, followed by relation of our three methods of geographic representation and two sub-representation methods; we also present a graphic depicting the deep architecture of each method (Section 2). In Section 3, we describe our colorectal cancer patient dataset, containing  46000 individuals residing in Iowa at the time of their diagnosis; the dataset spans the years 1989 to 2013. Furthermore, we relate our geographical feature dataset, along with our experiments. In Section 4 we disclose works related to ours prior to concluding the paper in Section 5.

2 Learning Geographical Representations for Survival Curve Prediction

Prior to disclosing our methodology, we relate some preliminary notation, subsequently discussing and mathematically formulating the problem setting. Following this disclosure we reformulate the problem as one of Kaplan-Meier survival curve prediction before introducing and elaborating on our three methods of geographical representation learning and two sub-representations.

2.1 Preliminaries

Define to be a dataset of

instances, where feature vector

, event label , and time of event occurrence ; where represents a discrete time at which an event has occurred (i.e., ) or the last discrete time instance is observed, while an event has not occurred (i.e., ). In this latter case (), when we know the event never occurs to the instance during the study period (spanning discrete time periods). If, on the other hand, then we only know that the instance did not experience the event up to , but don’t know what happened during the remaining time. Representation of event-time data described as such are called censored data and, even more specifically, right-censored data. A censoring of instance occurs when and . We elaborate on the handling of these censored data in a proceeding subsection.

To be more concrete, may represent (as is the case in our experiments) six-month patient follow-up periods, with designating the entrance of a patient to the study. Study entrance occurs when a diagnosis of colorectal cancer is rendered. When an instance (i.e., patient) dies from colorectal cancer – – then designates a time in which this event occurred. Alternately, a patient may pass away from complications not related to their colorectal cancer disease, or may move elsewhere, switch doctors, or for some other reason become untrackable prior to the conclusion of the study period, then and , indicating a censoring.

Patient instance vectors represent quantified measurements of pertinent patient-based features. Later in this work, we will make reference to certain feature groups of which these instance vectors are composed. Therefore, we define notation that will conveniently relate to these groups. To such an end, let denote the full set of index values that reference the geographical features of ; further, denote to be the full set of index values of such that . We will use these index sets to make direct reference to the feature grouping components of ; for instance, is the subvector of instance housing the geographical feature values. Furthermore, using set difference notation, refers to the subvector of instance containing feature values that are non-geographical.

Notation Description
Feature vector of instance .
Event label of instance .
Discrete time of .
Outcome vector of instance .
Predicted outcome vector of instance .
Set of geographical feature index values.
Set of all feature index values.
A map.
Function that determines discrete
geographic entity membership.

Calculation of a probability.

Neural network.

An arbitrary loss function.

Output smoothing function.
Adjacency matrix constructed from .

SSA-elicited affinity matrix.

Design matrix for geographical
entities (descriptive geo feats).
Function that determines whether two geographic entities in
are adjacent.
Top eigenvectors from , selected based on largest
eigenvalues in .
The result of applying kMeans clustering to .
Function that assigns values in to an instance.
SSA procedure to produce .
Table 1: Notation used throughout this work.

For convenience, we provide the notation related in this and subsequent sections in Table 1.

2.2 Kaplan-Meier Re-representation

To begin elaborating on the censored nature of our data, as we mentioned in the previous section, instance has an event label and a discrete time of event occurrence : provided this, the goal is to transform this two-valued representation to that of a Kaplan-Meier survival curve (KMSC) representation Kaplan & Meier (1958). A KMSC, simply put, each temporal unit with a probability of the disease event not occurring at that particular temporal unit, dependent upon the probability of ‘‘not’’ event occurrence of the preceding temporal unit, for each instance .

More formally, the KMSC re-representation is in the form of a vector, denoted , where the index values express the temporal units and the indexed values denote the respective probabilities.

Our KMSC re-representation scheme is originally outlined in Chi et al. Chi et al. (2007). To instantiate the vector , the following is conducted:


where denotes the conditional probability of event occurring at provided that has not occurred at . As such, for patients whose CRC outcomes are known, exhibits values that are strictly 0 and 1. On the other hand, a censored patient’s vector exhibits estimation of survival probability beginning at the index position ; the ensuing values are conditional probability estimates.

2.3 Predicting Individual KMSC

Our goal in this work is to induce an optimal hypothesis of some [presently] arbitrarily defined hypothesis class , that is most apt at predicting instance-specific KMSCs. We formalize this problem as:


where expresses some loss function that measures the divergence between the predicted (henceforth expressed ) and the known .

The hypothesis class explored in this work is defined as both deep and shallow neural network architectures, the specifics of which are disclosed later in this section; we discuss the specific parameterizations employed across our experiments in the experiments section (Section 3). Deep neural network architectures are characterized by multiple hidden layers, and shallow architectures by a single hidden layer.

2.3.1 Output Smoothing

Construction of a neural network model is accomplished in layer-wise fashion, where a particular layer is composed of nodes. The first layer in a neural network is designated as the input layer, which is proceeded by any number of so-called hidden layers, the last of which is connected to the output layer. The output layer is somewhat unique to our problem setting of predicting KMSCs. First, the predicted probability elicited from each of the output nodes are ordered. In other words, the output of is ordered before because is temporally occurs before . Second, the ordered output probabilities of these nodes should be strictly decreasing: i.e., . The reasoning behind this ‘‘strictly decreasing’’ expectation is intuitive: the probability of survival, of a disease or otherwise, even after disease recovery, is never expected to go up. The loss function employed to induce mulitple-output networks, such as those in our problem setting, elicit a single loss value representing the loss across all nodes, meaning the desired strictly decreasing output cannot be guaranteed. In light of this, we develop a smoothing operation, denoted , formally expressed by


guaranteeing that the post-processed (i.e., smoothed) model output is strictly decreasing.

2.4 Geographic Feature Representation

Although our primary concern is to elicit a that produces the most accurate predictions, the novelty of the work is to:

  1. Demonstrate that geographic-based feature representations enhance the quality of predictions.

  2. Explore whether simple binary representations or a richer representations (defined shortly) produce more accurate predictions.

  3. Quantify the extent to which these representations improve predictive quality.

The details of our experiments and data are elaborated on in the next section where we explore three different geographical representations: a simple binary representation (SBR), a rich representation based on spectral analysis (RR-SA), and a rich representation employing similarity-based spectral analysis (RR-SSA).

2.4.1 Simple Binary Representation

Our simple binary representation (SBR) is minimalist in nature, the procedure consisting only of (a) determination of the discrete geographic entity membership of instance and (b) such membership being binarily re-represented (referred to as one hot encoding), thus eliciting a sparse vector-based encoding with a in the indexical location referring to the geographic entity of which is a member, and s in the remaining positions.

To devise a formulation that is aptly generalizable we assume that the geographic features of instance , expressed as , are defined such that encoded values are capable of eliciting the discrete geographic unit of which is a member (e.g., coordinates). For instance, we employ ZCTAs (zipcode tabulation area) as the discrete geographic unit in our experiments.

A formal procedure for eliciting discrete geographic unit membership can be expressed as


where the function performs a transformation on , the geographic feature values of the instance, to some identification (ID) value, which we denote as . This value denotes the unique, discrete geographic entity, belonging to map (which we define momentarily), of which instance is a member. The values represented by , along with the information expressed in map , dictate the procedure used by to perform the transformation.

The specific geographics features employed in this work are (latitude,longitude) coordinate pairs and, as such, we specify a definition (referred to as Definition 2.1) of map using geography defined in terms of these coordinate pairs.

Definition 2.1

Define to be a map, given by


where is the unique postal code of geographic unit and is an ordered set of (lat,lon) coordinate pairs denoting the bounding geographic region of .

We characterize map as a continuous geographic region by


where ).

Provided our definition of map , expressed in Definition 2.1, we define to be a function that takes (lat,lon) coordinate pairs and determines whether the point is on the interior of each ZCTA. The zipcode ID, corresponding to the ZCTA having the point on the interior, is subsequently initialized as the value of (i.e., ). Following this outlined

procedure, and additional binarization procedure, often referred to as one-hot encoding, which we denote

, is employed to produce a spare vector representation consisting of a single in the position referencing the ZCTA of which instance belongs, and s in other positions.

Figure 2: SBR neural network architecture.

Figure 2 illustrates the network architecture using the SBR methodology.

We expect that non-geographic representations will perform worse than representations employing the SBR representation

While models elicited from employing the SBR representation may enjoy some predictive performance improvement over hypotheses induced on strictly non-geogrpahic features, representations that consist of richer geographic encodings, capable of modeling the continuous nature of the geographic region of study promise to produce even better results.

2.4.2 Spectral Analysis Representation

To elicit richer geographical representations, we devise a spectral analysis approach, based on on a well-known procedure referred to as spectral clustering. The method begins by first computing an adjacency matrix among the discrete geographic entities represented in . Subsequently, spectral analysis solves for the eigenvectors and eigenvalues of the adjacency representation, selecting the top eigenvectors, based on the largest eigenvalues. The elicited representation is matrix, where the rows refer to the discrete geographic entities (i.e., a single row refers to one of the entities). The values that compose each row are used as geographic predictive input features.

To more formally relate this spectral analysis procedure, define to be the affinity (i.e., adjacency, similarity) matrix, in which the -th entry relates the geographic adjacency relationship among the -th and -th entities. We express this by


where the function determines if and have a common element. Provided , related by Definition 2.1, computes whether or not and share at least one coordinate pair.

Subsequently, spectral clustering is executed by performing Means clustering, , where the function assigns one of the cluster labels to each of the entries of ; where


The function searches and finds the largest values in , selects the appropriate columns in the matrix , and creates the matrix . The matrix , composed of eigenvectors, and corresponding vector , composed of eigenvalues, are obtained by solving the system of equations related by


Here, the columns of are used as the geographical features when inducing a hypothesis – we refer to this use of the matrix as spectral analysis. The labels, , obtained from application of the clustering procedure, referred to as spectral clustering, are used to visualize the elicited representations obtained from our experiments, related in the next section. Spectral analysis avoids making use of binarized label assignments of the spectral clustering procedure, instead using a subprocedure, termed spectral analysis, which preserves cluster composition and is a richer (i.e., non-sparse) representation.

In Algorithm 1, we relate the spectral clustering process, differentiating spectral analysis from spectral clustering via red highlighting; omission of this line produces the spectral analysis procedure.

1:  Obtain adjacency matrix using (7).
2:  Solve (9) for and .
3:  Obtain as outlined in (8).
4:  Apply kMeans clustering to to obtain .
Algorithm 1 Spectral Clustering

Simply put, spectral analysis is a sub-procedure of the spectral clustering process, yielded by omitting the clustering step.

Finally, for a test instance , a process is executed to obtain the appropriate -valued column entry of that is associated with the particular geographic entity that the test instance belongs to. Algorithm 2 outlines this procedure.

1:   From (4).
2:  Using find the such that .
2:  Return column vector
Algorithm 2 Enrich Geographic Features

The deep learning network architecture outlining the spectral analysis procedure in conjunction with learning a hypothesis, is depicted in Figure 3111As previously mentioned, denote (latitude,longitude) coordinate pairs..

Figure 3: RR-SA neural network architecture.

2.4.3 Similarity-based Spectral Analysis Representation

While the above spectral analysis-based approach produces richer geographic representations, the process is capable of leveraging only geographically relational information, as the input affinity matrix must be square (i.e., ). It may, however, be beneficial to leverage encodings that are both relational and descriptive in nature, as geographically-descriptive features, such as population demographics and types of land-use, may further enrich the spectral analysis-elicited representations.

To allow for such input matrices we adjust our spectral analysis process to first calculate the pairwise similarity among geographic entities, thus producing a square affinity matrix on which the spectral analysis procedure can be performed.

More formally, recall the previously discussed adjacency matrix and define to be a geographic entity design matrix whose rows represent geographic entities and whose columns represent features. Additionally, is constructed such that the th row of and the th row of refer to the same entity.

Using and , along with a similarity measure, we devise two sub-representation methods from which a affinity matrix can be derived.

The first sub-representation uses a single binary feature to indicate spatial adjacency between two geographic entities along with the geographically descriptive features of each entity to yield two vectors and . These can be formally expressed as


where represents the concatenation of a scalar or vector with another vector (in this case, it is scalar-vector concatenation). Also, note that . We term this sub-representation method SSA (bin).

The second sub-representation uses full geographic entity adjacency vectors instead of a definitive indicator of immediate spatial adjacency. This sub-representation method can be expressed by


where, in this case indicates two vectors being concatenated. We term this sub-representation method SSA (full).

Subsequently, using either SSA (bin) or SSA (full), the cosine similarity, denoted as

, between and is computed, thus producing an affinity matrix.

Algorithm 3, which denotes the procedure as , fully discloses this process, using either SSA (bin) or SSA (full), while Figure 4 expresses the process in the context of the neural network architecture.

2:  for  do
3:     for  do
4:        if SSA (bin) then
5:           Define and according to (10).
6:        else
7:           Define and according to (11).
8:        end if
11:     end for
12:  end for
12:  Return
Algorithm 3 Sub-rep to affinity
Figure 4: RR-SSA neural network architecture.

3 Predicting Colorectal Cancer Survival

We begin this section with an in-depth disclosure of the data employed in our experiments, subsequently outlining the technicalities involved in undertaking these experiments. Finally, we provide a discussion of the results elicited from performing these experiments by comparing the average predicted survival curve of each method against the average actual survival curve, leveraging a devised measure, referred to as area between the curves (ABC), discussed further on in this section.

3.1 Colorectal Cancer Survival Data for the State of Iowa

Our data were provided by the Iowa Cancer Registry (ICR), State Health Registry of Iowa (SHRI), and the Iowa Department of Public Health (IDPH). Each instance represents a patient who has been diagnosed with colorectal cancer and whose residence at the time of diagnosis is in the state of Iowa. The dataset consists of patients and, initially, features. After removing identifiers and features having a large number of instances with missing values (% missing 50%), we were left with distinct features (including unprocessed geographic coordinates). After binarizing discrete features, (excluding geographic features). When using SBR geographical re-representation, ( non geographic features and binarized geographic features), and when using the RR-SA geographic representation (where is parameterized and therefore user-dependent). When the Kaplan-Meier re-representation is applied to the dataset, we obtain vectors having elements, where each element represents the patient’s current vital status (alive or dead), or a probability of survival when an instance becomes censored, as described by (1). Each represents six months.

The distinct non-geographic features pertain to various patient-specific characteristics, which can be categorized as disease-based and demographic-based. Disease-based features include tumor grade, tumor histology and tumor marker; we show a histogram of tumor grade in Figure 5. Demographic-based features include marital status, race, and age at diagnosis; we show a histogram of age at diagnosis in Figure 6. These selected features (age and tumor grade) have been shown to be indicative of not receiving timely cancer treatment Ward et al. (2013), which we believe will help in predicting cancer survival, although analysis of such factors is beyond the scope of this work.

Figure 5: Tumor grade at diagnosis for patients in the state of Iowa: Years 1989 to 2013.
Figure 6: Age of colorectal cancer diagnosis for patients in the state of Iowa: Years 1989 to 2013.

3.2 Geographically-descriptive Features for the State of Iowa

We obtain geographically-descriptive features for the state of Iowa at the ZCTA-level of spatial granularity from the US Census Bureau’s American FactFinder 2 website. Three different geographically-descriptive features were obtained for each of the 978 ZCTAS in Iowa: population age demographics, land type, and median household income. Population age demographics and land type are categorical features and were represented in terms of proportional bins (e.g., ‘‘percentage of population aged 0-5 years’’).

3.3 Predictive Setting, Paramaterization and Results

As outlined in the introduction, we wish to address the following:

  1. On average, can colorectal cancer survival curves be reasonably predicted for patients in the state of Iowa?

  2. Do geographic features improve the quality of predicted colorectal cancer survival curves for patients in the state of Iowa?

  3. Do richer geographical feature representations improve predictive performance more than simpler representations?

  4. Can predictive performance be further improved by altering the RR-SA procedure to accommodate adjacency-descriptive geographical feature pairings (i.e., RR-SSA)?

  5. Which RR-SSA representation improves predictive performance the most: binary (bin) or full?

To such an end, we propose to use -fold validation where, for each fold, we find a for each of the following types of model:

  1. [label=()]

  2. A model constructed using no geographical features (No Geo).

  3. A model constructed using SBR-derived geographical features, as outlined by Figure 2 (SBR).

  4. Models constructed using RR-SA-derived geographical features, as outlined by Figure 3, where the values will be explored (RR-SA).

  5. Models construct using RR-SSA-derived geographical features, as outlined by Figure 4, using the binary-based adjacency representation, where the values will be explored (RR-SSA (bin)).

  6. Models construct using RR-SSA-derived geographical features, as outlined by Figure 4, using the full adjacency representation, where the values will be explored (RR-SSA (bin)).

We then assess predictive performance by computing each model’s average survival curve prediction on the test set, taken over folds, as compared to the actual average survival curve, taken over all , using a measure termed area between curves (ABC) that measures the area-wise disparity between the actual and predicted curves Lash, Sun, Zhou, Lynch & Street (2017).

3.3.1 Model Parameterization

Our models are constructed using Tensorflow, employing fully connected layers, trained using sigmoidal cross entropy as the loss function

. The logistic activation function is used for all nodes. Each model is trained using a maximum of

epochs with batch size ranging from to . While the connectedness of the architecture, activation function, epochs, and batch size are all tunable parameters, we elect to focus on finding the optimal number of hidden layers and corresponding hidden nodes for each layer (note that epochs of , , , and were explored). Table A.1, in the Appendix section, shows the average optimal architecture for each of the models, taken over the 10 folds.

(a) No geo feats
(b) SBR
(c) RR-SA,
(d) RR-SA,
(e) RR-SA,
(f) RR-SA,
(g) RR-SSA (bin),
(h) RR-SSA (bin),
(i) RR-SSA (bin),
(j) RR-SSA (bin),
(k) RR-SSA (full),
(l) RR-SSA (full),
(m) RR-SSA (full),
(n) RR-SSA (full),
Figure 7: Actual vs. Predicted; (when specified) denotes the parameterized for spectral analysis, and ABC represents the area between curves.

3.3.2 Average Actual vs Average Predicted Survival

The results comparing the average actual survival curve against the average predicted survival curve, by model, are presented in Figure 7. Henceforth, these curves will simply be referred to as actual and predicted. In these figures we also shade the region between the actual and predicted curves and provide a value representing the total area covered by this region. We will use this measure, developed in Lash et al., 2017 Lash, Sun, Zhou, Lynch & Street (2017), referred to as area between the curves (ABC for short), as a means of comparing the predictive quality of the 14 different models (where lower ABC is better).

Comparing Figure 6(a) with Figures 6(b) through 6(n) we first see that the addition of geographical features has uniformly improved the quality of the predictions, on average, as can be observed visually and by comparing ABC values. That is, with the exception of RR-SSA (bin) , which suggests that it is important to tune the spectral analysis value when using such representations.

Secondly, comparing Figure 6(b) with Figures 6(c) through 6(f), we observe that models using richer geographical representations (RR-SA) perform better (6(c) - 6(f)) than a model trained using a simple representation (6(b)). Furthermore, employing SSA-based representations yield even better improvements over SBR, depending on the parameterized value of .

However, there are also RR-SA model performance differences depending on the parameterized value. Interestingly, there seems to exist a non-linear relationship between and performance, with outperforming , and outperforming ; performs the best out of all models. We believe this nonlinear relationship may be accounted for by the fact that higher values of lead to more localized models, yet can also produce sparse, disjointed clusters. This point is supported by our clustering visualizations reported in Figure 8 and discussed in Section 3.3.3. These nonlinear response observations can also be extended to RR-SSA (bin) and RR-SSA (full).

Comparing RR-SA with RR-SSA representations, we can see even greater improvement in our predictions, on average. In fact, by employing RR-SSA (full), we achieve a 38.2% relative improvement in ABC value when comparing the best RR-SSA (full) result () with the best RR-SA result (). Interestingly, and perhaps not entirely unexpectedly, RR-SSA (bin) obtained less predictive improvement when compared with RR-SSA (full), but is able to improve upon the RR-SA result.

Curiously, however, depending upon the parameterized value, RR-SSA (bin) performs worse than models induced without geographical features and those induced using SBR. We conjecture that this may be attributable to the overly simple representation of geographic adjacency used in the sub-representation method of RR-SSA (bin). This is a reasonable conclusion as we can see that using a ‘‘fuller’’ representation (i.e., RR-SSA (full)) of adjacency produces uniformly improved results.

In examining the different predicted survival curves we have a few observations, summarized as follows. First, we observe that predictive performance increases are mostly realized after the five-year mark. This is, on one hand, intuitive because predicting survival at times closer to the diagnosis is easier than predicting survival at later times. On the other hand, noticeable deviation of the predicted curves uniformly occurs across all models at or around this five-year mark. Therefore, model improvement wrought by using richer geographical representations is realized, by-in-large, at times beyond the five-year mark. Explanation as to why such a deviation is present in all models requires further investigation beyond the scope of this work.

In summary, we find that

  1. On average, colorectal cancer survival curves can be reasonably predicted for patients in the state of Iowa.

  2. Geographic features do improve the quality of predicted colorectal cancer survival curves for patients in the state of Iowa by 53.5% (on average) (comparing models induced without geographic features with RR-SSA (full) ).

  3. On average, RR-SA feature representations improve predictive performance by 15% over simple representations (SBR) and RR-SSA improve predictive performance by 47.2% over SBR.

  4. On average, RR-SSA feature representations improve predictive performance by 38.2% over RR-SA representations (comparing RR-SSA (full) to RR-SA ).

  5. On average, RR-SSA (full) feature representations improve predictive performance by 32.1% (comparing RR-SSA (full) to RR-SSA (bin) ).

Figure 8: Spectral clustering results for , where color denotes cluster membership. Row one represents RR-SA results, row two represents RR-SSA (bin) results, and row three represents RR-SSA (full) clustering results.

3.3.3 Visualizing Geographic Cluster Assignment

Next, we briefly discuss the results of visualizing cluster assignment for for RR-SA, RR-SSA (bin), and RR-SSA (full). These results can be observed in Figure 8, where each unique color represents a single cluster.

For RR-SA (row 1), we first note that as increases, the elicited geographic regions become more precise, yet maintain geographic continuity. However, we secondly observe that some ZCTAs are not adjacent to any other ZCTA having the same cluster assignment. This disjointedness stems from the use of an adjacency representation of the affinity matrix on which spectral clustering is performed and is not unexpected. As increases it appears that the number of disjointed ZCTAs also increases. However, we see that the number of continuous regions also increases. In other words, while disjointedness seems to increase with , the desired result of more localized continuous geographical regions is still achieved. Interestingly, when , larger Iowa cities such as Des Moines (central Iowa) and Iowa City (central-eastern Iowa) begin to emerge.

In examining row 2 of Figure 8, representing the RR-SSA (bin) results we see entirely different geographic clusterings than that of RR-SA. First, we find that for smaller values of (

, cluster membership is very skewed, with a single cluster dominating the majority of the state, and the remaining cluster assignments being composed of single ZCTAs. These single-ZCTA clusters are found in Des Moines area, the largest urban area of Iowa. As

is increased (i.e., ), rural areas begin to decompose into cluster subsets – i.e., as the representation is allowed to become more specific (by increasing ), rural areas begin to become distinguished between. Urban areas, such as Des Moines and Iowa City, are also ascribed membership to clusters composed of fewer geographic entities.

Looking at row 3 of Figure 8, which constitutes the cluster results obtained from RR-SSA (full) models, we observe different clustering results from that of the previous two models. First, we can see that clusters often form ‘‘ring-like’’ patterns (this is particularly observable for ), which is a particularly interesting artifact of this representation. Secondly, juxtaposing these results, with that of the previous two rows (i.e., RR-SA and RR-SSA (bin)), we observe that this representation is somewhat of a ‘‘compromise’’ between RR-SA and RR-SSA (full) in the sense that RR-SA produces mostly geographically contiguous clusterings and RR-SSA (bin) produces more geographically disparate clusterings. This is not unexpected, as the sub-representation method of RR-SSA (full) employs the full adjacency representation used in RR-SA, which is not found in RR-SSA (bin). Interestingly, RR-SSA (full) has also discovered urban areas such as Des Moines and Iowa City, but does so at smaller values of than RR-SSA (bin) (e.g., RR-SSA (bin) is only able to discern areas around Des Moines, whereas RR-SSA (full) is able to discern Iowa City, Des Moines, Waterloo/Cedar Falls, Mason City, etc.). Finally, as is increased we observe that the representation is becomes more specific in terms of both urban and rural areas up to . When we observe that the clusterings are more disparate, where there appear to be approximately three different rural areas distinguished between (yellow, green, and white), and where urban ZCTAs are assigned to their own unique cluster. This may suggest that urban areas are much more heterogeneous than are rural areas.

4 Related Work

The topics related to and discussed throughout this work can best be categorized as disease and survival curve prediction and geographic-based predictions and representation.

There are many past works involving the prediction of diseases. These can be viewed as classification-based Khosravi et al. (2015); Belciug (2010); Ojha & Goel (2017); Sandhu et al. (2015); Gupta et al. (2011); Belciug & Gorunescu (2013); Puddu & Menotti (2012) and survival-based Cox (1992); Sharma et al. (2017); Chi et al. (2007); Gupta et al. (2011); Katzman et al. (2016); Samundeeswari & Saranya (2016). The focus of this work was on survival curve predictions. Such works can be examined by method, which include Cox proportional hazards model (CPH) Cox (1992)

, which has been historically used to make such predictions, decision trees

Sharma et al. (2017), and neural network-based models Chi et al. (2007); Gupta et al. (2011); Katzman et al. (2016); Samundeeswari & Saranya (2016), which are a more recent development. However, as Laurentiis and Ravdin De Laurentiis & Ravdin (1994) point out, CPH has several caveats as compared to neural network-based approaches, including the naivety of the proportional hazards assumption and inability to capture nonlinear feature interactions. Furthermore, decision trees are constructed using greedy methodology and do not have the architectural benefits of neural networks. Hence, this work employed neural networks.

There are also many works focusing on geographic-based prediction and representation. These works focus on incorporating geographical features into the predictive process. One method of representing geography is by fine grain lattice (i.e., grid) Khezerlou et al. (2017); Lash, Slater, Polgreen & Segre (2017); Yuan et al. (2017). Such methods are akin to our SBR representation and suffer from the same shortcomings. Spatially adaptive filters Tiwari & Rushton (2005), which can tie a single feature to geography when creating , which may be beneficial when the selected feature is particularly indicative of survival. This method would, however, still produce a binary feature representation, having the accompanying shortcomings discussed when disclosing SBR. Spectral clustering has been used to cluster both social networks White & Smyth (2005) and for representing geo-spatial features Frias-Martinez & Frias-Martinez (2014); van Gennip et al. (2013), as in this work, and produces a rich (i.e., non-sparse) vector of features.

5 Conclusions and Future Work

In this work we explored the use of four different geographical feature representations – a simple binary representation (SBR) and a rich representation based on spectral analysis (which we term spectral analysis and methodologically refer to as RR-SA), and two representations based on similarity-based spectral analysis (RR-SSA) – to predict colorectal cancer survival curves for patients in the state of Iowa. We show that (a) survival curves can be reasonably estimated, although predictive performance deviates near the five-year survival mark, (b) the use of geographical features generally lead to better predictions, (c) RR-SA trained models outperform those trained using SBR, (d) RR-SSA induced models, generally, outperform RR-SA models, and (e) RR-SSA (full) representations outperform RR-SSA (bin) representations. Future work will involve exploration of different geographical representations, particularly those learned in conjunction with . Additionally, continued exploration of domains and scenarios in which SBR, RR-SA, and RR-SSA geographic representations provide benefit should be undertaken.

6 Acknowledgements

The authors would like to thank the Iowa Cancer Registry, State Health Registry of Iowa, and the Iowa Department of Public Health for the data. The authors would also like to thank Gary Hulett and Jason Brubaker for their help in dataset construction and Prakash Nadkarni for his help with both data acquisition and the IRB process.


  • (1)
  • Belciug (2010) Belciug, S. (2010), ‘A two stage decision model for breast cancer detection’, Annals of the University of Craiova-Mathematics and Computer Science Series 37(2), 27–37.
  • Belciug & Gorunescu (2013)

    Belciug, S. & Gorunescu, F. (2013), ‘A hybrid neural network/genetic algorithm applied to breast cancer detection and recurrence’,

    Expert Systems 30(3), 243–254.
  • Chi et al. (2007) Chi, C.-L., Street, W. N. & Wolberg, W. H. (2007), Application of artificial neural network-based survival analysis on two breast cancer datasets, in ‘AMIA Annual Symposium Proceedings’, Vol. 2007, American Medical Informatics Association, p. 130.
  • Cox (1992) Cox, D. R. (1992), Regression models and life-tables, in ‘Breakthroughs in Statistics’, Springer, pp. 527–541.
  • De Laurentiis & Ravdin (1994) De Laurentiis, M. & Ravdin, P. M. (1994), ‘A technique for using neural network analysis to perform survival analysis of censored data’, Cancer Letters 77(2-3), 127–138.
  • Frias-Martinez & Frias-Martinez (2014) Frias-Martinez, V. & Frias-Martinez, E. (2014), ‘Spectral clustering for sensing urban land use using twitter activity’,

    Engineering Applications of Artificial Intelligence

    35, 237–245.
  • Gupta et al. (2011) Gupta, S., Kumar, D. & Sharma, A. (2011), ‘Data mining classification techniques applied for breast cancer diagnosis and prognosis’, Indian Journal of Computer Science and Engineering (IJCSE) 2(2), 188–195.
  • Kaplan & Meier (1958) Kaplan, E. L. & Meier, P. (1958), ‘Nonparametric estimation from incomplete observations’, Journal of the American Statistical Association 53(282), 457–481.
  • Katzman et al. (2016) Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T. & Kluger, Y. (2016), ‘Deep survival: A deep cox proportional hazards network’, arXiv preprint arXiv:1606.00931 .
  • Khezerlou et al. (2017) Khezerlou, A. V., Zhou, X., Li, L., Shafiq, Z., Liu, A. X. & Zhang, F. (2017), ‘A traffic flow approach to early detection of gathering events: Comprehensive results’, ACM Transactions on Intelligent Systems and Technology (TIST) 8(6), 74:1–74:24.
  • Khosravi et al. (2015) Khosravi, B., Pourahmad, S., Bahreini, A., Nikeghbalian, S. & Mehrdad, G. (2015), ‘Five years survival of patients after liver transplantation and its effective factors by neural network and cox proportional hazard regression models’, Hepatitis Monthly 15(9).
  • Lash, Slater, Polgreen & Segre (2017) Lash, M. T., Slater, J., Polgreen, P. M. & Segre, A. M. (2017), A large-scale exploration of factors affecting hand hygiene compliance using linear predictive models, in ‘Healthcare Informatics (ICHI), 2017 IEEE International Conference on’, pp. 66–73.
  • Lash, Sun, Zhou, Lynch & Street (2017) Lash, M. T., Sun, Y., Zhou, X., Lynch, C. F. & Street, W. N. (2017), Learning rich geographical representations: Predicting colorectal cancer survival in the state of iowa, in ‘Bioinformatics and Biomedicine (BIBM’17), 2017 IEEE International Conference on’, IEEE, pp. 778–785.
  • Ojha & Goel (2017) Ojha, U. & Goel, S. (2017), A study on prediction of breast cancer recurrence using data mining techniques, in

    ‘Cloud Computing, Data Science & Engineering-Confluence, 2017 7th International Conference on’, IEEE, pp. 527–530.

  • Puddu & Menotti (2012) Puddu, P. E. & Menotti, A. (2012), ‘Artificial neural networks versus proportional hazards cox models to predict 45-year all-cause mortality in the italian rural areas of the seven countries study’, BMC Medical Research Methodology 12(1), 100.
  • Samundeeswari & Saranya (2016) Samundeeswari, E. & Saranya, P. (2016), ‘An artificial neural network model for prediction of survival time of breast cancer dataset’, International Journal of Research in Engineering and Applied Sciences 6(1), 161–168.
  • Sandhu et al. (2015) Sandhu, I. K., Nair, M., Shukla, H. & Sandhu, S. (2015), ‘Artificial neural network: As emerging diagnostic tool for breast cancer’, International Journal of Pharmacy and Biological Sciences 5(3), 29–41.
  • Sharma et al. (2017) Sharma, A., Karthik, G., Mittal, N., Sindhu, V. & Pradeep, K. (2017), A survey on predictive analysis of cancer survivability rate using machine learning algorithm, in ‘7th International Conference on Recent Trends in Engineering, Science, and Management’, pp. 271–278.
  • Tiwari & Rushton (2005) Tiwari, C. & Rushton, G. (2005), Using spatially adaptive filters to map late stage colorectal cancer incidence in iowa, in ‘Developments in Spatial Data Handling, Proceedings of the 11th International Symposium on Spatial Data Handling. Springer, Berlin, Heidelberg’, Springer, pp. 665–676.
  • van Gennip et al. (2013) van Gennip, Y., Hunter, B., Ahn, R., Elliott, P., Luh, K., Halvorson, M., Reid, S., Valasik, M., Wo, J., Tita, G. E. et al. (2013), ‘Community detection using spectral clustering on sparse geosocial data’, SIAM Journal on Applied Mathematics 73(1), 67–83.
  • Wan et al. (2013) Wan, N., Zhan, F. B., Zou, B. & Wilson, J. G. (2013), ‘Spatial access to health care services and disparities in colorectal cancer stage at diagnosis in texas’, The Professional Geographer 65(3), 527–541.
  • Ward et al. (2013) Ward, M. M., Ullrich, F., Matthews, K., Rushton, G., Goldstein, M. A., Bajorin, D. F., Hanley, A. & Lynch, C. F. (2013), ‘Who does not receive treatment for cancer?’, Journal of Oncology Practice 9(1), 20–26.
  • White & Smyth (2005) White, S. & Smyth, P. (2005), A spectral clustering approach to finding communities in graphs, in ‘Proceedings of the 2005 SIAM international conference on data mining’, SIAM, pp. 274–285.
  • Yuan et al. (2017) Yuan, Z., Zhou, X., Yang, T., Tamerius, J. & Mantilla, R. (2017), Predicting traffic accidents through heterogeneous urban data: A case study, in ‘6th International Workshop on Urban Computing (UrbComp 2017)’.
  • Zhang et al. (2015) Zhang, R., Li, N., Yang, X. & Huang, Y. (2015), ‘Data mining technology and its application in diagnosis and treatment of clinical malignant tumor’, Journal of Medical Informatics pp. 50–54.


Model Avg Optimal Architecture
No Geo 1.5:[83,30]
SBR 1.9:[260,122]
RR-SA, 1.5:[82,36]
RR-SA, 1.5:[102,44]
RR-SA, 1.6:[87,45]
RR-SA, 1.5:[80,44]
RR-SSA (bin), 1.6:[82,50]
RR-SSA (bin), 1.7:[87,50]
RR-SSA (bin), 1.6:[70,33.33]
RR-SSA (bin), 1.5:[75,42]
RR-SSA (full), 1.6:[66,45]
RR-SSA (full), 1.7:[91,50]
RR-SSA (full), 1.7:[73,41.43]
RR-SSA (full), 1.9:[78,42.22]
Table A.1: Average optimal architecture by model over the 10 folds (e.g., No geo had 1.5 hidden layers, on average, where the first layer had 83 nodes , on average, and the second layer had 30 nodes, on average).

In Table A.1 we can see that, on average, the optimal architecture is relatively comparable among all models with the exception of SBR (and to a degree RR-SA, ). First, this suggests that the use of RR-SS and RR-SSA features do not affect the architectural complexity of the model. However, SBR seems to significantly increase such complexity. This is somewhat expected, as SBR is represented as a large, sparse vector, which can be contrasted with the comparatively small vector of RR-SA and RR-SSA.