1.1 Overview of information geometry
We present a concise and modern view of the basic structures lying at the heart of Information Geometry (IG), and report some applications of those information-geometric manifolds (termed “information manifolds”) in statistics (Bayesian hypothesis testing) and machine learning (statistical mixture clustering).
By analogy to Information Theory (IT) pioneered by Claude Shannon  (in 1948) which considers primarily the communication of messages over noisy transmission channels, we may define Information Sciences as the fields that study “communication” between (noisy/imperfect) data and families of models (postulated as a priori
knowledge). In short, Information Sciences (IS) seek methods to distill information from data to models. Thus, information sciences encompass information theory but also include Probability & Statistics, Machine Learning (ML), Artificial Intelligence (AI), Mathematical Programming, just to name a few areas.
In §5.2, we review some key milestones of information geometry and report some definitions of the field by its pioneers. A modern and broad definition of information geometry can be stated as the field that studies the geometry of decision making. This definition also includes model fitting (inference) that can be interpreted as a decision problem as illustrated in Figure 1
: Namely, deciding which model parameter to choose from a family of parametric models. This framework was advocated by Abraham Wald[72, 73, 17] who considered all statistical problems as statistical decision problems. Distances play a crucial role not only for measuring the goodness-of-fit
of data to model (say, likelihood in statistics, classifier loss functions in ML, objective functions in mathematical programming, etc.) but also for measuring the discrepancy (or deviance) between models.
Why adopting a geometric approach?
Geometry allows one to study invariance and equivariance111 For example, the triangle centroid is equivariant under affine transformation.
In Statistics, the Maximum Likelihood Estimator (MLE) is equivariant.
For example, the triangle centroid is equivariant under affine transformation. In Statistics, the Maximum Likelihood Estimator (MLE) is equivariant. Letdenote a monotonic transformation of the model parameter . Then we have , where the MLE is denoted by . of “figures” in a coordinate-free approach. The geometric language (e.g., ball or projection) also provides affordances that help us reason intuitively about problems. Note that although figures can be visualized (i.e., plotted in coordinate charts), they should be thought of as purely abstract objects, namely, geometric figures.
The paper is organized as follows:
In the first part (§2), we start by concisely introducing the necessary background of differential geometry in order to define a manifold equipped with a metric tensor and an affine connection . We explain how this framework generalizes the Riemannian manifolds by stating the fundamental theorem of Riemannian geometry that defines a unique torsion-free metric-compatible Levi-Civita connection from the metric tensor.
In the second part (§3), we explain the dualistic structures of information manifolds: We present the conjugate connection manifolds , the statistical manifolds where is a cubic tensor, and show how to derive a family of information manifolds for provided any given pair of conjugate connections. We explain how to get conjugate connections from any smooth (potentially asymmetric) distances (called divergences), present the dually flat manifolds obtained when considering Bregman divergences, and define, when dealing with parametric family of probability models, the exponential connection and the mixture connection that are coupled to the Fisher information metric. We discuss the concept of statistical invariance for the metric tensor and the notion of information monotonicity for statistical divergences. It follows that the Fisher metric is the unique invariant metric (up to a scaling factor), and that the -divergences are the unique separable invariant divergences.
In the third part (§4), we illustrate these information-geometric structures with two simple applications: In the first application, we consider Bayesian hypothesis testing and show how Chernoff information which defines the best error exponent, can be geometrically characterized on the dually flat structure of an exponential family manifold. In the second application, we show how to cluster statistical mixtures sharing the same component distributions on the dually flat mixture family manifold.
Finally, we conclude in §5 by summarizing the important concepts and structures of information geometry, and by providing further references and textbooks [12, 4] to more advanced structures and applications for further readings. We mention recent studies of generic classes of distances/divergences.
At the beginning of each part, we outline its contents. A summary of notations is provided page Notations.
2 Prerequisite: Basics of differential geometry
In §2.1, we review the basics of Differential Geometry (DG) for defining a manifold equipped with both a metric tensor and an affine connection . We explain these two independent metric/connection structures in §2.2 and in §2.3, respectively. From a connection , we show how to derive the notion of covariant derivative in §2.3.1, parallel transport in §2.3.2 and geodesics in §2.3.3. We further explain the intrinsic curvature and torsion of manifolds induced by the connection in §2.3.4, and state the fundamental theorem of Riemannian geometry in §2.4: The existence of a unique torsion-free Levi-Civita metric connection that can be calculated from the metric. Thus Riemannian geometry is obtained as a special case of the more general manifold structure : . Information geometry shall further consider a dual structure associated to , and the pair of dual structures shall form an information manifold .
2.1 Overview of differential geometry
Informally speaking, a smooth -dimensional manifold is a topological space that locally behaves like the Euclidean space
. Geometric objects (e.g., points and vector fields) and entities (e.g., functions and differential operators) live on, and are coordinate-free but can conveniently be expressed in any local coordinate222René Descartes (1596-1650) allegedly invented the Cartesian coordinate system while wondering how to locate a fly on the ceiling from his bed. In practice, we shall use the most expedient coordinate system to facilitate calculations. system of an atlas of charts ’s (fully covering the manifold) for calculations. A manifold is obtained when the change of chart transformations are . The manifold is said smooth when it is . At each point , a tangent plane locally best linearizes the manifold. On any smooth manifold , we can define two independent structures:
a metric tensor , and
an affine connection .
The metric tensor induces on each tangent plane an inner product space that allows one to measure vector magnitudes (vector “lengths”) and angles/orthogonality between vectors. The affine connection is a differential operator that allows one to define:
the covariant derivative operator which provides a way to calculate differentials of a vector field with respect to another vector field : Namely, the covariant derivative ,
the parallel transport which defines a way to transport vectors on tangent planes along any smooth curve ,
the notion of -geodesics which are defined as autoparallel curves, thus extending the ordinary notion of Euclidean straightness,
the intrinsic curvature and torsion of the manifold.
2.2 Metric tensor fields
The tangent bundle333The tangent bundle is a particular example of a fiber bundle with base manifold . of is defined as the “union” of all tangent spaces:
A tangent vector plays the role of a directional derivative444Since the manifolds are abstract and not embedded in some Euclidean space, we do not view a vector as an “arrow” anchored on the manifold. Rather, vectors can be understood in several ways in differential geometry like directional derivatives or equivalent class of smooth curves at a point. That is, tangent spaces shall be considered as the manifold abstract too., with informally meaning the derivative of a smooth function
(belonging to the space of smooth functions ) along the direction .
A smooth vector field is defined as a “cross-section” of the tangent bundle: , where
or denote the space of smooth vector fields.
A basis of a finite -dimensional vector space is
a maximal linearly independent set of vectors.555A set of vectors is linearly independent iff iff for all .
That is, in a linearly independent vector set, no vector of the set can be represented as a linear combination of the remaining vectors.
A vector set is linearly independent maximal when we cannot add another linearly independent vector.
. That is, in a linearly independent vector set, no vector of the set can be represented as a linear combination of the remaining vectors. A vector set is linearly independent maximal when we cannot add another linearly independent vector.Tangent spaces carry algebraic structures of vector spaces.666Furthermore, to any vector space , we can associate a dual covector space which is the vector space of real-valued linear mappings. We do not enter into details here to preserve this gentle introduction to information geometry with as little intricacy as possible. Using local coordinates on a chart , the vector field can be expressed as using Einstein summation convention on dummy indices (using notation ), where denotes the contravariant vector components (manipulated as “column vectors” in algebra) in the natural basis with . A tangent plane (vector space) equipped with an inner product yields an inner product space. We define a reciprocal basis of so that vectors can also be expressed using the covariant vector components in the natural reciprocal basis. The primal and reciprocal basis are mutually orthogonal by construction as illustrated in Figure 2.
For any vector , its contravariant components ’s (superscript notation) and its covariant components ’s (subscript notation) can be retrieved from using the inner product with the use of the reciprocal and primal basis, respectively:
The inner product defines a metric tensor and a dual metric tensor :
Technically speaking, the metric tensor is a -covariant tensor777We do not describe tensors in details for sake of brevity. A tensor is a geometric entity of a tensor space that can also be interpreted as a multilinear map. A contravariant vector lives in a vector space while a covariant vector lives in the dual covector space. We recommend this book  for a concise and well-explained description of tensors. field:
where is the dyadic tensor product performed on pairwise covector basis (the covectors corresponding to the reciprocal vector basis). Let and denote the matrices It follows by construction of the reciprocal basis that . The reciprocal basis vectors ’s and primal basis vectors ’s can be expressed using the dual metric and metric on the primal basis vectors ’s and reciprocal basis vectors ’s, respectively:
The metric tensor field (“metric tensor” or “metric” for short) defines a smooth symmetric positive-definite bilinear form on the tangent bundle so that for . We can also write equivalently . Two vectors and are said orthogonal, denoted by , iff . The length of a vector is induced from the norm . Using local coordinates of a chart , we get the vector contravariant/covariant components, and compute the metric tensor using matrix algebra (with column vectors by convention) as follows:
since it follows from the primal/reciprocal basis that , the identity matrix.
Thus on any tangent plane
, the identity matrix. Thus on any tangent plane, we get a Mahalanobis distance:
The inner product of two vectors and is a scalar (a -tensor) that can be equivalently calculated as:
A metric tensor of manifold is said conformal when . That is, when the inner product is a scalar function of the Euclidean dot product. In conformal geometry, we can measure angles between vectors in tangent planes as if we were in an Euclidean space, without any deformation. This is handy for checking orthogonality (in charts). For example, Poincaré disk model of hyperbolic geometry is conformal but Klein disk model is not conformal (except at the origin), see .
2.3 Affine connections
An affine connection is a differential operator defined on a manifold that allows us to define a covariant derivative of vector fields, a parallel transport of vectors on tangent planes along a smooth curve, and geodesics. Furthermore, an affine connection fully characterizes the curvature and torsion of a manifold.
2.3.1 Covariant derivatives of vector fields
A connection defines a covariant derivative operator that tells us how to differentiate a vector field according to another vector field . The covariant derivative operator is denoted using the traditional gradient symbol . Thus a covariate derivative is a function:
that has its own special subscript notation for indicating that it is differentiating a vector field according to another vector field .
By prescribing smooth functions , called the Christoffel symbols of the second kind, we define the unique affine connection that satisfies in local coordinates of chart the following equations:
The Christoffel symbols can also be written as , where denote the -th coordinate. The -th component of the covariant derivative of vector field with respect to vector field is given by:
The Christoffel symbols are not tensors (fields) because the transformation rules induced by a change of basis do not obey the tensor contravariant/covariant rules.
2.3.2 Parallel transport along a smooth curve
Since the manifold is not embedded888Whitney embedding theorem states that any -dimensional Riemannian manifold can be embedded into . in a Euclidean space, we cannot add a vector to a vector as the tangent vector spaces are unrelated to each others without a connection.999When embedded, we can implicitly use the ambient Euclidean connection , see . Thus a connection defines how to associate vectors between infinitesimally close tangent planes and . Then the connection allows us to smoothly transport a vector by sliding it (with infinitesimal moves) along a smooth curve (with and ), so that the vector “corresponds” to a vector : This is called the parallel transport. This mathematical prescription is necessary in order to study dynamics on manifolds (e.g., study the motion of a particle on the manifold). We can express the parallel transport along the smooth curve as:
The parallel transport is schematically illustrated in Figure 3.
2.3.3 -geodesics : Autoparallel curves
A connection allows one to define -geodesics as autoparallel curves, that are curves such that we have:
That is, the velocity vector is moving along the curve parallel to itself:
In other words, -geodesics generalize the notion of “straight Euclidean” lines.
In local coordinates , , the autoparallelism amounts to solve the following second-order Ordinary Differential Equations (ODEs):
, the autoparallelism amounts to solve the following second-order Ordinary Differential Equations (ODEs):
where are the Christoffel symbols of the second kind, with:
where the Christoffel symbols of the first kind.
Geodesics are 1D autoparallel submanifolds and -hyperplanes are defined similarly as autoparallel submanifolds of dimension
-hyperplanes are defined similarly as autoparallel submanifolds of dimension. We may specify in subscript the connection that yields the geodesic : .
2.3.4 Curvature and torsion of a manifold
An affine connection defines a 4D101010It follows from symmetry constraints that the number of independent components of the Riemann tensor is in dimensions. Riemann-Christoffel curvature tensor (expressed using components of a -tensor). The coordinate-free equation of the curvature tensor is given by:
where () is the Lie bracket of vector fields.
A manifold equipped with a connection is said flat (meaning -flat) when . This holds in particular when finding a particular111111For example, the Christoffel symbols vanish in a rectangular coordinate system of a plane but not in the polar coordinate system of it. coordinate system of a chart such that , i.e., when all connection coefficients vanish.
A manifold is torsion-free when the connection is symmetric. A symmetric connection satisfies the following coordinate-free equation:
Using local chart coordinates, this amounts to check that . The torsion tensor is a -tensor defined by:
In general, the parallel transport is path-dependent. The angle defect of a vector transported on an infinitesimal closed loop (a smooth curve with coinciding extremities) is related to the curvature. However for a flat connection, the parallel transport does not depend on the path. Figure 4 illustrates the parallel transport along a curve for a curved manifold (the sphere manifold) and a flat manifold ( the cylinder manifold121212The Gaussian curvature at of point of a manifold is the product of the minimal and maximal sectional curvatures: . For a cylinder, since , it follows that the Gaussian curvature of a cylinder is . Gauss’s Theorema Egregium (meaning “remarkable theorem”) proved that the Gaussian curvature is intrinsic and does not depend on how the surface is embedded into the ambient Euclidean space. ).
2.4 The fundamental theorem of Riemannian geometry: The Levi-Civita metric connection
By definition, an affine connection is said metric compatible with when it satisfies for any triple of vector fields the following equation:
which can be written equivalently as:
Using local coordinates and natural basis for vector fields, the metric-compatibility property amounts to check that we have:
A property of using a metric-compatible connection is that the parallel transport of vectors preserve the metric:
That is, the parallel transport preserves angles (and orthogonality) and lengths of vectors in tangent planes when transported along a smooth curve.
The fundamental theorem of Riemannian geometry states the existence of a unique torsion-free metric compatible connection:
Theorem 1 (Levi-Civita metric connection)
There exists a unique torsion-free affine connection compatible with the metric called the Levi-Civita connection .
The Christoffel symbols of the Levi-Civita connection can be expressed from the metric tensor as follows:
where denote the matrix elements of the inverse matrix .
The Levi-Civita connection can also be defined coordinate-free with the Koszul formula:
There exists metric-compatible connections with torsions studied in theoretical physics. See for example the flat Weitzenböck connection .
The metric tensor induces the torsion-free metric-compatible Levi-Civita connection that determines the local structure of the manifold. However, the metric does not fix the global topological structure: For example, although a cone and a cylinder have locally the same flat Euclidean metric, they exhibit different global structures.
2.5 Preview: Information geometry versus Riemannian geometry
In information geometry, we consider a pair of conjugate affine connections and (often but not necessarily torsion-free) that are coupled to the metric : The structure is conventionally written as . The key property is that those conjugate connections are metric compatible, and therefore the induced dual parallel transport preserves the metric:
Thus the Riemannian manifold can be interpreted as the self-dual information-geometric manifold obtained for the unique torsion-free Levi-Civita metric connection: . However, let us point out that for a pair of self-dual Levi-Civita conjugate connections, the information-geometric manifold does not induce a distance. This contrasts with the Riemannian modeling which provides a Riemmanian metric distance defined by the length of the geodesic connecting the two points and (shortest path):
Usually, this Riemannian geodesic distance is not available in closed-form (and need to be approximated or bounded) because the geodesics cannot be explicitly parameterized (see geodesic shooting methods ).
We are now ready to introduce the key geometric structures of information geometry.
3 Information manifolds
In this part, we explain the dualistic structures of manifolds in information geometry.
In §3.2, we first present the core Conjugate Connection Manifolds (CCMs) , and show how to build Statistical Manifolds (SMs) from a CCM in §3.3.
From any statistical manifold, we can build a -parameter family of CCMs, the information -manifolds. We state the fundamental theorem of information geometry in §3.5.
These CCMs and SMs structures are not related to any distance a priori but require at first a pair of conjugate connections
coupled to a metric tensor .
We show two methods to build an initial pair of conjugate connections.
A first method consists in building a pair of conjugate connections from any divergence in §3.6.
Thus we obtain self-conjugate connections when the divergence is symmetric: .
When the divergences are Bregman divergences (i.e., for a strictly convex and differentiable Bregman generator), we obtain Dually Flat Manifolds (DFMs) in §3.7.
DFMs nicely generalize the Euclidean geometry and exhibit Pythagorean theorems.
We further characterize when orthogonal -projections and dual -projections of a point on submanifold a is unique.131313In Euclidean geometry, the orthogonal projection of a point onto an affine subspace is proved to be unique using the Pythagorean theorem.
A second method to get a pair of conjugate connections consists in defining these connections from a regular parametric family of probability distributions can be recovered by considering the skewness Amari-Chentsov cubic tensor
consists in defining these connections from a regular parametric family of probability distributions. In that case, these ‘e’xponential connection and ‘m’ixture connection are coupled to the Fisher information metric . A statistical manifold
can be recovered by considering the skewness Amari-Chentsov cubic tensor, and it follows a -parameter family of CCMs, , the statistical expected -manifolds. In this parametric statistical context, these information manifolds are called expected information manifolds because the various quantities are expressed from statistical expectations . Notice that these information manifolds can be used in information sciences in general, beyond the traditional fields of statistics. In statistics, we motivate the choice of the connections, metric tensors and divergences by studying statistical invariance criteria, in §3.9. We explain how to recover the expected -connections from standard -divergences that are the only separable divergences that satisfy the property of information monotonicity. Finally, in §3.10, the recall the Fisher-Rao expected Riemannian manifolds that are Riemannian manifolds equipped with a geodesic metric distance called the Fisher-Rao distance, or Rao distance for short.
3.2 Conjugate connection manifolds:
We begin with a definition:
Definition 1 (Conjugate connections)
A connection is said to be conjugate to a connection with respect to the metric tensor if and only if we have for any triple of smooth vector fields the following identity satisfied:
We can notationally rewrite Eq. 31 as:
and further explicit that for each point , we have:
We check that the right-hand-side is a scalar and that the left-hand-side is a directional derivative of a real-valued function, that is also a scalar.
Conjugation is an involution: .
Definition 2 (Conjugate Connection Manifold)
The structure of the Conjugate Connection Manifold (CCM) is denoted by , where are conjugate connections with respect to the metric .
A remarkable property is that the dual parallel transport of vectors preserves the metric. That is, for any smooth curve , the inner product is conserved when we transport one of the vector using the primal parallel transport and the other vector using the dual parallel transport .
Property 1 (Dual parallel transport preserves the metric)
A pair of conjugate connections preserves the metric if and only if:
Given a connection on (i.e., a structure ), there exists a unique conjugate connection (i.e., a dual structure ).
We consider a manifold equipped with a pair of conjugate connections and that are coupled with the metric tensor so that the dual parallel transport preserves the metric. We define the mean connection :
with corresponding Christoffel coefficients denoted by . This mean connection coincides with the Levi-Civita metric connection:
The mean connection is self-conjugate, and coincide with the Levi-Civita metric connection.
3.3 Statistical manifolds:
Lauritzen introduced this corner structure  of information geometry in 1987. Beware that although it bears the name “statistical manifold,” it is a purely geometric construction that may be used outside of the field of Statistics. However, as we shall mention later, we can always find a statistical model corresponding to a statistical manifold . We shall see how we can convert a conjugate connection manifold into such a statistical manifold, and how we can subsequently derive an infinite family of CCMs from a statistical manifold. In other words, once we have a pair of conjugate connections, we will be able to build a family of pairs of conjugate connections.
We define a totally symmetric141414This means that for any permutation . The metric tensor is totally symmetric. cubic -tensor (i.e., -covariant tensor) called the Amari-Chentsov tensor:
or in coordinate-free equation:
Using the local basis, this cubic tensor can be expressed as:
Definition 3 (Statistical manifold )
A statistical manifold is a manifold equipped with a metric tensor and a totally symmetric cubic tensor .
3.4 A family of conjugate connection manifolds
For any pair of conjugate connections, we can define a -parameter family of connections , called the -connections such that are dually coupled to the metric, with , and . By observing that the scaled cubic tensor is also a totally symmetric cubic -covariant tensor, we can derive the -connections from a statistical manifold as:
where are the Levi-Civita Christoffel symbols, and (by index juggling).
The -connection can also be defined as follows:
Theorem 2 (Family of information -manifolds)
For any , is a conjugate connection manifold.
The -connections can also be constructed directly from a pair of conjugate connections by taking the following weighted combination:
3.5 The fundamental theorem of information geometry: -curved -curved
We now state the fundamental theorem of information geometry and its corollaries:
Theorem 3 (Dually constant curvature manifolds)
If a torsion-free affine connection has constant curvature then its conjugate torsion-free connection has necessarily the same constant curvature .
The proof is reported in  (Proposition 8.1.4, page 226). We get the following two corollaries:
Corollary 1 (Dually -flat manifolds)
A manifold is -flat if and only if it is -flat.
Corollary 2 (Dually flat manifolds ())
A manifold is -flat if and only if it is -flat.
Thus once we are given a pair of conjugate connections, we can always build a -parametric family of manifolds. Manifolds with constant curvature are interesting from the computational viewpoint as dual geodesics have simple closed-form expressions.
3.6 Conjugate connections from divergences:
Loosely speaking, a divergence is a smooth distance , potentially asymmetric. In order to define precisely a divergence, let us first introduce the following handy notations: , , and , etc.
Definition 4 (Divergence)
A divergence on a manifold with respect to a local chart is a -function satisfying the following properties:
for all with equality holding iff (law of the indiscernibles),
for all ,
The dual divergence is defined by swapping the arguments:
and is also called the reverse divergence (reference duality in information geometry). Reference duality of divergences is an involution: .
The Euclidean distance is a metric distance but not a divergence. The squared Euclidean distance is a non-metric symmetric divergence. The metric tensor yields Riemannian metric distance but it is never a divergence.
From any given divergence , we can define a conjugate connection manifold following the construction of Eguchi  (1983):
Theorem 4 (Manifold from divergence)
is an information manifold with:
The associated statistical manifold is with:
Since is a totally symmetric cubic tensor for any , we can derive a one-parameter family of conjugate connection manifolds:
In the remainder, we use the shortcut to denote the divergence-induced information manifold . Notice that it follows from construction that:
3.7 Dually flat manifolds (Bregman geometry):
We consider dually flat manifolds that satisfy asymmetric Pythagorean theorems. These flat manifolds can be obtained from a canonical Bregman divergence.
Consider a strictly convex smooth function called a potential function, with where is an open convex domain. Notice that the function convexity does not change by an affine transformation. We associate to the potential function a corresponding Bregman divergence (parameter divergence):
We write also the Bregman divergence between point and point as , where denotes the coordinates of a point .
The induced information-geometric structure is with:
Since all coefficients of the Christoffel symbols vanish (Eq. 54), the information manifold is -flat. The Levi-Civita connection is obtained from the metric tensor (usually not flat), and we get the conjugate connection from .
The Legendre-Fenchel transformation yields the convex conjugate that is interpreted as the dual potential function:
Theorem 5 (Fenchel-Moreau biconjugation )
If is a lower semicontinuous151515A function is lower semicontinous (lsc) at iff . A function is lsc if it is lsc at for all in the function domain. and convex function, then its Legendre-Fenchel transformation is involutive: (biconjugation).
In a dually flat manifold, there exists two dual affine coordinate systems and .
We have the Crouzeix  identity relating the Hessians of the potential functions:
where denote the identity matrix. This Crouzeix identity reveals that and are the primal and reciprocal basis, respectively.
The Bregman divergence can be reinterpreted using Young-Fenchel (in)equality as the canonical divergence :
The dual Bregman divergence yields
Thus the information manifold is both -flat and -flat: This structure is called a dually flat manifold (DFM). In a DFM, we have two global affine coordinate systems and related by the Legendre-Fenchel transformation of a pair of potential functions and . That is, , and the dual atlases are and .
In a dually flat manifold, any pair of points and can either be linked using the -geodesic (that is -straight) or the -geodesic (that is -straight). In general, there are