Tropical Data Science

05/13/2020
by   Ruriko Yoshida, et al.
0

Phylogenomics is a new field which applies to tools in phylogenetics to genome data. Due to a new technology and increasing amount of data, we face new challenges to analyze them over a space of phylogenetic trees. Because a space of phylogenetic trees with a fixed set of labels on leaves is not Euclidean, we cannot simply apply tools in data science. In this paper we survey some new developments of machine learning models using tropical geometry to analyze a set of phylogenetic trees over a tree space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/05/2018

Navigating Diverse Data Science Learning: Critical Reflections Towards Future Practice

Data Science is currently a popular field of science attracting expertis...
06/28/2020

Data Science: Challenges and Directions

While data science has emerged as a contentious new scientific field, en...
12/17/2019

A literature survey of matrix methods for data science

Efficient numerical linear algebra is a core ingredient in many applicat...
03/09/2021

Performing Creativity With Computational Tools

The introduction of new tools in people's workflow has always been promo...
03/02/2020

Tropical Support Vector Machine and its Applications to Phylogenomics

Most data in genome-wide phylogenetic analysis (phylogenomics) is essent...
09/12/2018

Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms

Gradient boosted decision trees (GBDTs) have seen widespread adoption in...
03/17/2022

Kan Extensions in Data Science and Machine Learning

A common problem in data science is "use this function defined over this...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Outline

  1. Introduction

  2. Data Science Overview

    1. Supervised Learning (Inferential Statistics)

      1. Classifications

      2. Regression

  3. Phylogenetics to Phylogenomics

    1. Phylogenetic Trees

    2. Space of Phylogenetic Trees

  4. Basics in Tropical Geometry

  5. Tropical Unsupervised Learning (Tropical Descriptive Statistics)

    1. Tropical Fermat Weber Points

    2. Tropical Frécet Means

  6. Tropical Supervised Learning (Tropical Inferential Statistics)

    1. Tropical Classifications

      1. Tropical Linear Discriminant Analysis

    2. Tropical Regression

Key Words: Machine Learning Models, Max-Plus Algebra, Phylogenetics, Phylogenomics, Tropical Geometry, Ultrametrics.

1 Introduction

Due to increasing amount of data today, data science is one of most exciting fields. It finds applications in statistics, computer science, business, biology, data security, physics, and so on. Most of statistical models in data sciences assume that data points in an input sample are distributed over a Euclidean space if they have numerical measurements. However, in some cases this assumption can be failed. For example, a space of phylogenetic trees with a fixed set of leaves is an union of lower dimensional cones over , where with is the number of leaves [2]. Since the space of phylogenetic trees is an union of lower dimensional cones, we cannot just apply statistical models in data science to a set of phylogenetic trees [20].

There has been much work in spaces of phylogenetic trees. In 2001, Billera-Holmes-Vogtman (BHV) developed a notion of a space of phylogenetic trees with a fixed set of labels for leaves [4], which is a set of all possible phylogenetic trees with the fixed set of lebels on leaves and is an union of orthants, each orthant is for all possible phylogenetic trees with a fixed tree topology. In this space, two orthants are next to each other if the tree topology for one orthant is one nearest neighbor interchange (NNI) distance to the tree topology for the other orthant. They also showed that this space is CAT(0) space so that there is a unique shortest connecting paths, or geodesics, between any two points in the space defined by the -metric. There is some work in development on machine learning models with the BHV metric. For example, Nye defined a notion of the first order principal component geodesic as the unique geodesic with the BHV metric over the the tree space which minimizes the sum of residuals between the geodesic and each data point [14]. However, we cannot use a convex hull under the BHV metric for higher principal components because Lin et. al showed that the convex hull of three points with the BHV metric over the tree space can have arbitrarily high dimension [10].

In 2004, Speyer and Sturmfels showed a space of phylogenetic trees with a given set of labels on their leaves is a tropical Grassmanian [18], which is a tropicalization of a linear space defined by a set of linear equations [20] with the max-plus algebra. The tropical metric with max-plus algebra on the tree space is known to behave very well [1, 6]. For example, contrarily to the BHV metric, the dimension of the convex hull of tropical points is at most .

Thus, this paper focuses on the tropical metric over tree spaces. In this paper we review some development on statistical learning models with the tropical metric with max-plus algebra on tree spaces as well as the tropical projective space, and we overview some open problems.

2 Data Science Overview

In this section, we briefly overview statistical models in data science. For more details, we recommend to read Introduction of Statistical Learning with R http://faculty.marshall.usc.edu/gareth-james/ISL/.

In data science there are roughly two sub-branches of data science: unsupervised learning and supervised learning (Figure 1

). In unsupervised learning, our goal is to compute a descriptive statistics to see how data points are distributed over the sample space or how data points are clustered together. In statistics, unsupervised learning corresponds to descriptive statistics. In supervised learning, our goal is to predict/infer the response variable from explanatory variables. In statistics, supervised learning corresponds to inferential statistics. Like unsupervised learning and supervised learning, there are some notations with different names between machine learning and statistics. Thus we summarize some of the differences in Table

1.

Statistics Data Science
descriptive statistics Unsupervised learning
inferential statistics Supervised learning
response variable target variable
explanatory variable predictor variable
feature
Table 1: There are several notations with different names in statistics and data science.
Figure 1: Overview of data science

2.1 Basic Definitions

  1. Response variable – the variable for an interest in a study or experiment. It can be called as a dependent variable. In machine learning it is also called a target variable.

  2. Explanatory variable – the variable explains the changes in the response variable. It can be also called a feature or independent variable. In machine learning it is also called feature or predictor.

2.2 Unsupervised Learning

Since unsupervised learning is descriptive, there is no response variables. In unsupervised learning, we try to learn how data points are distributed and how they related to each other. Among them, there are mainly two categories: clustering and dimensionality reduction.

  • Clustering – grouping data points into subsets by their “similarity”. These similarities are defined by a user. These groups are called clusters.

  • Dimensionality reduction – reducing the dimension of data points with minimizing the loss of information. One of the most commonly used methods is principal component analysis (PCA), a dimension reduction procedure via linear algebra.

2.3 Supervised Learning

Supervised learning is inferential. Thus, there are the response variable and explanatory variables in an input data set. Depending on the scale of the response variable, we can separate two groups in supervised learning: classification and regression. In classification, the response variable has categorical scale and in regression, the response variable has numerical (interval) scale.

  • Classifications

    – the response variable is categorical. Under classification, there are algorithms like logistic regression, support vector machine, linear discriminant analysis, classification trees, random forests, adaboost and etc.

  • Regression

    – the response variable is numerical. There are algorithms like linear regression, regression trees, lasso, ridge regression, random forests, adaboost and etc.

For more details, see the following papers:

3 Phylogenetics to Phylogenomics

In this section we overview basics in phylogenetics and basic problem for phylogenomics.

3.1 Phylogenetic Trees

Evolutionary, or phylogenetic trees, show an organism’s evolutionary relationships over time, through the use of tree diagrams. Phylogenetic trees still consist of vertices (nodes) and edges (branches). Each node in a phylogenetic tree represents a past or present taxon or population: exterior nodes in a phylogenetic tree represent taxon or population at present; and interior nodes represent their ancestors. Edges in a phylogenetic tree have weights and a weight in each edge represents mutation rates multiplied by evolutional time from its ancestor to a taxon.

In Figure 2, an exterior vertex (leaf or tip) represents the current taxa (. An interior vertex represents an extinct taxa where ancestors split into two subgroups. Vertices and edges can still be labeled; however, only the leaves or tips are labeled in a phylogenetic tree. This is due to the past taxa often being inferred and not exactly known. Vertices in phylogenetic trees can be DNA sequences, shared genes or interrelated species, depending on the context of the tree. The root of the tree now represents the common ancestor of all leaves, , , and .

Figure 2: Example Rooted Binary Phylogenetic Tree

Phylogenetic trees are trees. Thurs, they remain acylic and connected. In terms of evolutional biology, these properties are intuitive, as a species must evolve from something and also as time progresses species can only evolve forward. Weights on edges in a tree represent the notion of time. The distance measures the dissimilarity between Species 1 and Species 2 with respect to time.

Let be the number of leaves on a phylogenetic tree. If a total of weights of all edges in a path from the root to each leaf in a rooted phylogenetic tree is the same for all leaves , then we call a phylogenetic tree equidistant tree. The height of an equidistant tree is the total weight of all edges in a path from the root to each leaf in the tree.

3.1.1 Phylogenetic Tree Reconstruction

Phylogenetic reconstruction uses genetic data to create an inferential evolutionary (phylogenetic) tree. These changing characters are the mutations in DNA sequences. DNA sequences represent a shared gene across multiple species. Trees are excellent at representing the evolutionary changes of this shared gene through node splits and leaves.

Even though we do not discuss details on a phylogenetic tree reconstruction in this paper, multiple steps and techniques are involved in the reconstruction process and there are several types of tree reconstruction methods;

  • Maximum likelihood estimation (MLE) methods

    – These methods describe evolution in terms of a discrete-state continuous-time Markov process.

  • Maximum Parsimony – Reconstructs tree with the least evolutionary changes which explain data.

  • Bayesian inference for trees

    – Use Bayes Theorem and MCMC to estimate the posterior distribution rather than obtaining the point estimation.

  • Distance based methods – Reconstructing a tree from a distance matrix.

3.2 Space of Phylogenetic Trees

There are several ways to define a space of phylogenetic trees with different metrics. One of the very well-known tree spaces is Billera-Holmes-Vogtmann tree space. In 2001, Billera-Holmes-Vogtmann (BHV) introduced a continuous space which models the set of rooted phylogenetic trees with edge lengths on a fixed set of leaves. In this space, edge lengths in a tree are continuous and we assign a coordinate for each interior edge. Note that unrooted trees can be accommodated by designating a fixed leaf node as the root. The BHV tree space is not Euclidean, but it is non-positively curved, and thus has the property that any two points are connected by a unique shortest path through the space, called a geodesic. The distance between two trees is defined as the length of the geodesic connecting them. While in this paper, we do not consider the BHV tree space, read [4] for interested readers.

Through this paper, we assume that all phylogenetic trees are equidistant trees. An equidistance tree is a rooted phylogenetic tree such that the sum of all branch lengths in the unique path from the root to each leaf in the tree, called the height of the tree, is fixed and they are the same for all leaves in the tree. In phylogenetics this assumption is fairly mild since the multispecies coalescent model assumes that all gene trees have the same height.

Example 1.

Suppose . Consider two rooted phylogenetic trees with the set of labels on the leaves in Figure 3. Note that for each tree, the sum of branch lengths in the unique path from the root to each leaf is . Therefore they are equidistant trees with their height are equal to .

Figure 3: Examples of equidistant trees with leaves with the set of labels and with their height equal to .

For the space of equidistant trees with the fixed set of labels on their leaves, the BHV tree space might not be appropriate [7]. Therefore, we consider the space of ultrametrics. To define ultrametrics and theire relations to equidistant trees, we need to define dissimilarity maps.

Definition 2.

[Dissimilarity Map] A dissimilarity map is a function such that

for all . If a dissimilarity map additionally satisfies the triangle inequality, that is:

for all , then is called a metric. If there exists a phylogenetic tree such that coincides with the total branch length of the edges in the unique path from a leaf to a leaf for all leaves , then we say a tree metric. If a metric is a tree metric and is the total branch length of all edges in the path from a leaf to a leaf for all leaves in a phylogenetic tree , then we say realises a phylogenetic tree or is a realisable of a phylogenetic tree .

Since

to simplify we write

Example 3.

We consider equidistant trees in Figure 3. The dissimilarity map obtained from the left tree in Figure 3 is

Similarly, the dissimilarity map obtained from the right tree in Figure 3 is

Since these dissimilarity maps are obtained from phylogenetic trees, they are tree metrics.

Definition 4 (Three Point Condition).

If a metric satisfies the following condition: For every distinct leaves ,

achieves twice, then we say that satisfies the three point condition.

Definition 5 (Ultrametrics).

If a metric satisfies the three point condition then is called an ultrametric.

Theorem 6 ([8]).

A dissimilarity map is ultrametric if and only if is realisable of an equidistant tree with labels . In addition, for each equidistant tree there exists a unique ultrametric. Conversely, for each ultrametric, there exists a unique equidistant tree.

Example 7.

We again consider equidistant trees in Figure 3. The dissimilarity map obtained from the left tree in Figure 3 is

Similarly, the dissimilarity map obtained from the right tree in Figure 3 is

Since these phylogenetic trees are equidistant trees, these dissimilarity maps are ultrametrics by Theorem 6.

From Theorem 6 we consider the space of ultrametrics with labels as a space of all equidistant trees with the label set . Let be the space of ultrametrics for equidistant trees with the leaf labels . In fact we can write as the tropicalization of the linear space generated by linear equations.

Let be the linear subspace defined by the linear equations such that

(1)

for . For the linear equations (1) spanning the linear space , the max-plus tropicalization of the linear space is the tropical linear space with such that

achieves at least twice for all . Note that this is exactly the three point condition defined in Definition 5.

Theorem 8.

[20, Theorem 2.18] The image of in the tropical projective torus coincides with .

For example, if , The space of ultrametrics is a two-dimensional fan with maximal cones.

For more details, see the following papers:

  • C. Semple and M. Steel. Phylogenetics, [17].

  • Lin et al. Convexity in Tree Spaces [11].

4 Basics in Tropical Geometry

Here we review some basics of tropical arithmetic and geometry, as well as setting up the notation through this paper.

Definition 9 (Tropical arithmetic operations).

Throughout this paper we perform arithmetic over the max-plus tropical semiring . Over this tropical semiring, the basic tropical arithmetic operations of addition and multiplication are defined as the following:

Over this tropical semiring, is the identity element under addition and is the identity element under multiplication.

Example 10.

Suppose we have . Then

Definition 11 (Tropical scalar multiplication and vector addition).

For any and for any , tropical scalar multiplication and tropical vector addition are defined as:

Example 12.

Suppose we have

and . Then we have

and

Throughout this paper we consider the tropical projective torus, that is, the projective space , where , the all-one vector.

Example 13.

Consider . Then let

Then over we have the following equality:

Note that is isometric to .

Example 14.

Consider . Then let

Also let . Then we have

In order to conduct a statistical analysis we need a distance measure between two vectors in the space. Thus we discuss a distance between two vectors in the tropical projective space. In fact the following distance between two vectors in the tropical projective space is a metric.

Definition 15 (Generalized Hilbert projective metric).

For any two points , the tropical distance between and is defined:

(2)

where and . This distance is a metric in . Therefore, we call tropical metric.

Example 16.

Suppose such that

Then the tropical distance between is

Similar to the BHV metric over the BHV tree space, we need to define a geodesic over the space of ultrametrics. In order to define a tropical geodesic we need to define a tropical polytope:

Definition 17.

Suppose we have a finite subset The tropical convex hull or tropical polytope of is the smallest tropically-convex subset containing written as the set of all tropical linear combinations of such that:

where . A tropical line segment between two points is a tropical convex hull of two points .

Note that the length between two points along the tropical line segment between equals to the tropical distance . In this paper we define a tropical line segment between two points as a tropical geodesic between these points.

Example 18.

Suppose such that

From the previous example, the tropical distance between is

Also the tropical line segment between is a line segment between these three points:

The length of the line segment is

Example 19.

Suppose we have a set where

Then we have the tropical convex hull of is shown in Figure 4.

Figure 4: Tropical polytope of three points in .

For more details, see the following papers:

  • D. Maclagan and B. Sturmfels. Introduction to Tropical Geometry [9].

5 Tropical Unsupervised Learning

Unsupervised learning is descriptive and we do not know much about descriptive statistics using tropical geometry with max-plus algebra, for example, tropical Fermat Weber (FW) points and tropical Frécet means.

In this section we discuss tropical FW points and tropical Frécet means, what they are and what we know and we do not know. In the end of this section, we discuss tropical principal component analysis (PCA). Over this section we consider the tropical projective torus .

5.1 Tropical Fermat Weber Points

Suppose we have a sample over . A tropical Fermat-Weber point minimizes the sum of distances to the given points.

(3)

There are properties of tropical Fermat-Weber points of a sample over .

Proposition 20.

Suppose . Then the set of tropical Fermat-Weber points of a sample over is a convex polytope. It consists of all optimal solutions

to the following linear program:

(4)

From Proposition 20, there can be infinitely many tropical Fermat-Weber points of a sample.

If we focus on the space of ultrametrics for equidistance trees with leaves, then we have the following proposition:

Proposition 21.

If a sample over the space of ultrametrics , then tropical Fermat-Weber points are in .

In [12], we showed explicitly how to compute the set of all possible Fermat-Weber points in . However, we do not know the minimal set of inequalities needed to define the set of all tropical Fermat-Weber points of a given sample. Thus here is an open problem:

Problem 22.

What is the minimal set of inequalities needed to define the set of all tropical Fermat-Weber points of a given sample? What is the time complexity to compute the set of tropical Fermat-Weber points of a sample of points in ? Is there a polynomial time algorithm to compute the vertices of the polytope of tropical Fermat-Weber points of a sample of points in in and ?

For more details, see the following papers:

  • B. Lin and R. Yoshida Tropical Fermat–Weber Points [12].

5.2 Tropical Frécet Means

Suppose we have a sample over . A tropical Fréchet mean minimizes the sum of distances to the given points.

(5)

As we formulated computing a tropical Fermat-Weber point as a linear programming problem, we can also formulate computing a tropical Frécet mean as a quadratic programming problem:

(6)

While we know some propertied of tropical Fermat-Weber points we do not know much about tropical Fréchen means. Here are some basics on tropical Fréchet means.

Proposition 23.

Suppose . Then the set of tropical Fréchen means of a sample over is a convex polytope. It consists of all optimal solutions to the following quadratic program:

(7)

Still we do not know much about tropical Fréchet means. First we have the following problem.

Problem 24.

If a sample over the space of ultrametrics , then are tropical Féchet means in ?

We still do not know how to compute tropical Fréchet means in efficient ways. So we have the following problem:

Problem 25.

Suppose we have over . Is there an algorithm to compute all tropical Fréchet means in ?

5.3 Tropical Principal Component Analysis (PCA)

Principal component analysis (PCA) is one of the most popular methods to reduce dimensionality of input data and to visualize them. Classical PCA takes data points in a high-dimensional Euclidean space and represents them in a lower-dimensional plane in such a way that the residual sum of squares is minimized. We cannot directly apply the classical PCA to a set of phylogenetic trees because the space of phylogenetic trees with a fixed number of leaves is not Euclidean; it is a union of lower dimensional polyhedral cones in , where is the number of leaves.

There is a statistical method similar to PCA over the space of phylogenetic trees with a fixed set of leaves in terms of the Billera-Holmes-Vogtman (BHV) metric.

In 2001, Billera-Holmes-Vogtman developed the space of phylogenetic trees with fixed labeled leaves and they showed that it is space [5]. Therefore, a geodesic between any two points in the space of phylogenetic trees is unique.

Short after that, Nye showed an algorithm in [15] to compute the first order principal component over the space of phylogenetic trees of leaves with the BHV metric.

Nye in [15] used a convex hull of two points, i.e., the geodesic, on the tree space as the first order PCA. However, this idea can not be generalized to higher order principal components with the BHV metric since the convex hull of three points with the BHV metric over the tree space can have arbitrarily high dimension [11].

On the other hand, the tropical metric in the tree space in terms of the max-plus algebra is well-studied and well-behaved [13]. For example, the dimension of the convex hull of points in terms of the tropical metric is at most . Using the tropical metric, Yoshida et al. in [20] introduced a statistical method similar to PCA with the max-plus tropical arithmetic in two ways: the tropical principal linear space, that is, the best-fit Stiefel tropical linear space of fixed dimension closest to the data points in the tropical projective torus; and the tropical principal polytope, that is, the best-fit tropical polytope with a fixed number of vertices closest to the data points. The authors showed that the latter object can be written as a mixed-integer programming problem to compute them, and they applied the second definition to datasets consisting of collections of phylogenetic trees. Nevertheless, exactly computing the best-fit tropical polytope can be expensive due to the high-dimensionality of the mixed-integer programming problem.

Definition 26.

Let be a tropical polytope with its vertices and let be a sample from the space of ultrametrics . Let , where is the tropical projection of onto a tropical polytope . Then the vertices of the tropical polytope are called the -th order tropical principal polytope of if the tropical polytope minimizes over all possible tropical polytopes with many vertices.

In [16]

, Page et.al developed a heuristic method to compute tropical principal polytope and they applied it to empirical data sets on genome data of influenza flu collected from New York city, Apicomplexa, and African coelacanth genome data sets.

Also Page et.al showed the following theorem and lemma:

Theorem 27 ([16]).

Let be a tropical polytope spanned by ultrametrics in . Then and any two points and in the same cell of are also ultrametrics with the same tree topology.

Lemma 28 ([16]).

Let be a tropical polytope spanned by ultrametrics. The origin is contained in if and only if the path between each pair of leaves passes through the root of some .

There are still some open problem on tropical PCA. Here is one of questions we can work on:

Conjecture 29.

There exists a tropical Fermat-Weber point of a sample of ultrametric trees which is contained in the th order tropical PCA of the dataset for .

For more details, see the following papers:

  • R. Yoshida, L. Zhang, and X. Zhang. Tropical Principal Component Analysis and its Application to Phylogenetics. [20].

  • R. Page, R. Yoshida, and L. Zhang. Tropical principal component analysis on the space of ultrametrics. [16].

6 Tropical Supervised Learning

For tropical supervised learning, there is not much done. For classification, there is some work done. Recently Tang et.al in [19] introduced a notion of tropical support vector machines (SVMs). In this section we discuss tropical SVMs and we introduce a notion of tropical linear discriminant analysis (LDA).

6.1 Tropical Classifications

For tropical classification, we consider the binary response variables. Suppose we have a data set given that

where and . Therefore, the response variable is binary. Thus, we can partition a sample of data points into two sets and such that

6.1.1 Tropical support vector machine SVMs

A support vector machine (SVM) is a supervised learning model to predict the categorical response variable. For a binary response variable, a classical linear SVM classifies data points by finding a linear hyperplane to separate the data points into two groups. In this paper we refer a classical SVM as a classical linear SVM over an Euclidean space

with norm.

For an Euclidean space , there are two types of SVMs: hard margin SVMs and soft margin SVMs. A hard margin SVM is a model with the assumption that all data points can be separated by a linear hyperplane into two groups without errors. A soft margin SVM is a model which maximizes the margin and also allows some data points in the wrong side of the hyperplane.

Similar to a classical SVM over a Euclidean space, a tropical SVM is a supervised learning model which classifies data points by finding a tropical hyperplane to separate them. In [19], as a classical SVM, Tang et.al defined two types of tropical SVMs: hard margin tropical SVMs and soft margin tropical SVMs. A hard margin tropical SVM introduced by [3] is, similar to a classical hard margin SVM, a model to find a tropical hyperplane which maximizes the margin, the minimum tropical distance from data points to the tropical hyperplane (which is in Figure 5), to separate these data points into open sectors. Note that an open sector of a tropical hyperplane can be seen as a tropical version of an open half space defined by a hyperplane. A tropical soft margin SVM introduced by [19] is a model to find a tropical SVM to maximizes the margin but it also allows some data points into a wrong open sector.

The authors in [3] showed that computing a tropical hyperplane for a tropical hard margin SVM from a given sample on the tropical projective space can be formulated as a linear programming problem. Again, note that, similar to the classical hard margin SVMs, hard margin tropical SVMs assume that there exists a tropical hyperplane such that it separates all data points in the tropical projective space into each open sector (see the left figure in Figure 5).

Figure 5: A hard margin tropical SVM (LEFT) and a soft margin tropical SVM (RIGHT) with the binary response variable. A hard margin tropical SVM assumes that all data points from the given sample can be separated by a tropical hyperplane. Red squared dots are data points from and blue circle dots are data points from . A tropical hard margin hyperplane for a tropical hard margin tropical SVM is obtained by maximizing the margin in the left figure, the distance from the closest data point from the tropical hyperplane (the width of the grey area from the tropical hyperplane in the left figure). A soft margin tropical hyplerplane for a soft margin tropical SVM is obtained by maximizing a margin similar to a hard margin tropical SVM and by minimizing the sum of and at the same time.

In order to discuss details on tropical SVMs, we need to define a tropical hyplerplane and their open sectors.

Definition 30.

Suppose . The tropical hyperplane defined by , denoted by , is the set of all points such that

is attained at least twice. is called the normal vector of .

Definition 31.

A tropical hyperplane divides the tropical projective space into components. These components divided by are called open sectors given that:

Example 32.

Consider . Then a tropical hypoerplane in has three open sectors seen as Figure 5. Note that is isometric to .

Now we define the tropical distance from a point to a tropical hyperplane.

Definition 33.

The tropical distance from a point to the tropical hyperplane is defined as:

A tropical hard margin SVM assumes that all points are separated by a tropical hyperplane and all data points with the same category for their response variable are assigned in the same open sector. Thus, to compute a tropical hard margin hyperplane for a tropical SVM, we Want to find the normal vector of a tropical hyperplane such that

where and are the largest and the second largest coordinate of the vector for all .

Theorem 34 ([19]).

The normal vector of the tropical hard margin for a tropical SVM is the optimal solution of the following linear programming problem:

(8)
(9)
(10)
(11)

As we discussed earlier, tropical soft margin SVMs are similar to tropical hard margin SVMs. They try to find a tropical hyperplane which maximizes the margin but also they allow some points to be in a wrong open sector by introducing extra variables in Figure 5. Tang et.al showed in [19] that a soft margin tropical hyperplane for a tropical SVM is the optimal solution of the following linear programming problem such that:

(12)
(13)
(14)
(15)
(16)

There are still many open questions we can ask in terms of tropical SVMs. In general, if we use methods to find a hard margin or soft margin tropical hyperplane developed in [19], then we have to go through exponentially many linear programming problems. However, we do not know the exact time complexity to find a tropical hard margin or soft margin tropical hyperplane for a tropical SVM.

Problem 35.

What is the time complexity of a hard or a soft margin tropical hyperplane for a tropical SVM over the tropical projective torus? Is it NP-hard?

In addition, the authors in [19] focused on tropical hyperplanes for tropical SVMs over the tropical projective torus not over the space of ultrametrics . Again note that is an union of dimensional cones over . Thus we are interested in how and a tropical SVM over related to each other. More specifically:

Problem 36.

Can we describe how a hard or soft margin tropical hyperplane for a tropical SVM over the tropical projective torus separates points in the space of ultrametrics in terms of geometry?

Also we are interested in defining a tropical SVM over and developing algorithms to compute them.

Problem 37.

Define tropical hard and soft margin ”hyperplane” for tropical SVMs over . To define them can we use a tropical polytope instead of a tropical hyperplane? How can we compute them? Can we formulate as an optimization problem?

For more details, see the following papers:

  • Tang, Wang, and Yoshida. Tropical Support Vector Machine and its Applications to Phylogenomics. [19].

6.1.2 Tropical Linear Discriminant Analysis (LDA)

In this section we discuss tropical linear discriminant analysis (LDA). LDA is one of the classical statistical methods to classify dataset into two classes or more as the same time they reduce the dimensionality.

LDA is related to PCA in a Euclidean space and these relations are shown in Figure 6. The different between PCA and LDA is how to find the direction of a linear plane.

Figure 6: There are two categories in the response variable, red and blue. The middle picture represents PCA and the right picture shows LDA on these points.

For two classes of samples , the linear space for the classical LDA can be found as the optimal solution of an optimization problem such that

(17)

Here we use the max-plus algebra in tropical setting. Also we consider the tropical projective space for now. Let as a tropical distance between two points in the tropical projective space . Then we can formulate the tropical linear space for tropical LDA in Equation (17) as

(18)
Problem 38.

Can we define a tropical LDA over the tropical projective space? If so how can we find a tropical linear space (or tropical polytope) for a tropical LDA?

Problem 39.

Can we define a tropical LDA over the space of ultrametrics ?

6.2 Tropical Regression

For a classical multiple linear regression, with the observed data set

where and , we try to find a vector such that

where with

is the Gaussian distribution with the mean

and the standard deviation

, is a response variable, and are explanatory variables with the smallest following value:

(19)

The value in Equation 19 is called the sum of squared residuals. Thus, for a classical multiple linear regression over the Euclidean space , we try to find the linear hyperplane with the smallest sum of squared residuals.

For tropical regression over the tropical projective space, one can define a tropical regression ”polytope” as the tropical polytope with

It has nothing done in tropical regression. Thus, it would be interesting to see how one can define them in the tropical projective space as well as the space of ultrametrics.

References

  • [1] M. Akian, S. Gaubert, N. Viorel, and I. Singer. Best approximation in max-plus semimodules. Linear Algebra Appl., 435:3261–3296, 2011.
  • [2] F. Ardila and C. J. Klivans. The bergman complex of a matroid and phylogenetic trees. journal of combinatorial theory. Series B, 96(1):38–49, 2006.
  • [3] B.Gärtner and M. Jaggi. Tropical support vector machines, 2006.
  • [4] L.J. Billera, S.P. Holmes, and K. Vogtmann. Geometry of the space of phylogenetic trees. Adv Appl Math, 27(4):733–767, 2001.
  • [5] Louis J. Billera, Susan P. Holmes, and Karen Vogtmann. Geometry of the Space of Phylogenetic Trees. Advances in Applied Mathematics, 27(4):733–767, 2001.
  • [6] G. Cohen, S. Gaubert, and J.P. Quadrat. Duality and separation theorems in idempotent semimodules. Linear Algebra Appl., 379:395–422, 2004.
  • [7] A. Gavryushkin and A.J. Drummond. The spaceofultrametricphylogenetictrees. Journal ofTheoreticalBiology, 403:197–208, 2016.
  • [8] C.J. Jardine, N. Jardine, and R. Sibson. The Structure and Construction of Taxonomic Hierarchies. Mathematical Biosciences, 1(2):173–179, 1967.
  • [9] M. Joswig. Essentials of tropical combinatorics, 2017.
  • [10] B. Lin, B. Sturmfels, X. Tang, and R. Yoshida. Convexity in tree spaces. SIAM Discrete Math, 3:2015–2038, 2017.
  • [11] Bo Lin, Bernd Sturmfels, Xiaoxian Tang, and Ruriko Yoshida. Convexity in Tree Spaces. SIAM Journal on Discrete Mathematics, 31(3):2015–2038, 2017.
  • [12] Bo Lin and Ruriko Yoshida. Tropical Fermat–Weber Points. SIAM Journal on Discrete Mathematics, 2018. To appear. Available at arXiv:1604.04674.
  • [13] D. Maclagan and B. Sturmfels. Introduction to Tropical Geometry, volume 161 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2015.
  • [14] T. M. W. Nye. Principal components analysis in the space of phylogenetic trees. Ann. Stat., 39(5):2716–2739, 2011.
  • [15] Tom M. W. Nye. Principal Components Analysis in the Space of Phylogenetic Trees. The Annals of Statistics, 39(5):2716–2739, 2011.
  • [16] R. Page, R. Yoshida, and L. Zhang. Tropical principal component analysis on the space of ultrametrics, 2019.
  • [17] C. Semple and M. Steel. Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, 2003.
  • [18] D. Speyer and B. Sturmfels. Tropical mathematics. Mathematics Magazine, 82:163–173, 2009.
  • [19] X. Tang, H. Wang, and R. Yoshida. Tropical support vector machines and its applications to phylogenomics, 2020.
  • [20] R. Yoshida, L. Zhang, and X. Zhang. Tropical principal component analysis and its application to phylogenetics. Bulletin of Mathematical Biology, 81:568–597, 2019.