# Innovation Representation of Stochastic Processes with Application to Causal Inference

Typically, real-world stochastic processes are not easy to analyze. In this work we study the representation of any stochastic process as a memoryless innovation process triggering a dynamic system. We show that such a representation is always feasible for innovation processes taking values over a continuous set. However, the problem becomes more challenging when the alphabet size of the innovation is finite. In this case, we introduce both lossless and lossy frameworks, and provide closed-form solutions and practical algorithmic methods. In addition, we discuss the properties and uniqueness of our suggested approach. Finally, we show that the innovation representation problem has many applications. We focus our attention to Entropic Causal Inference, which has recently demonstrated promising performance, compared to alternative methods.

## Authors

• 15 publications
• 19 publications
• 14 publications
01/19/2022

### Data-Driven Innovation: What Is It

The future of innovation processes is anticipated to be more data-driven...
11/06/2021

### The How and Why of Bayesian Nonparametric Causal Inference

Spurred on by recent successes in causal inference competitions, Bayesia...
05/10/2020

### Maximal Algorithmic Caliber and Algorithmic Causal Network Inference: General Principles of Real-World General Intelligence?

Ideas and formalisms from far-from-equilibrium thermodynamics are ported...
06/23/2021

### Closed-Form, Provable, and Robust PCA via Leverage Statistics and Innovation Search

The idea of Innovation Search, which was initially proposed for data clu...
10/31/2019

### Causal Inference via Conditional Kolmogorov Complexity using MDL Binning

Recent developments have linked causal inference with Algorithmic Inform...
07/20/2020

### Identification, Tracking and Impact: Understanding the trade secret of catchphrases

Understanding the topical evolution in industrial innovation is a challe...
04/18/2020

### Integer-valued autoregressive process with flexible marginal and innovation distributions

INteger Auto-Regressive (INAR) processes are usually defined by specifyi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Consider a time-dependent stochastic process , and a corresponding realization . In this work we study an innovation representation problem of the form

 Xk≈g(Yk,xk−1) (1)

where is to be accurately described by a memoryless (independent over time) process that triggers a deterministic system, . We refer to as the innovation process of , as it summarizes all the new information that is injected to the process at time . Therefore, the representation problem in 1 is equivalent to a sequential reconstruction of a memoryless process from a given process .

Over the years, several methods have been introduced for related problems. The Gram-Schmidt procedure [2]

suggests a simple sequential method which projects every new component on the linear span of the components that where previously observed. The difference between the current component and its projection is guaranteed to be orthogonal to all previous components. Applied to a Gaussian process, orthogonality implies statistical independence and the subsequent process is therefore considered memoryless. On the other hand, non-Gaussian processes do not hold this property and a generalized form of generating a memoryless process from any given time dependent series is required. Several non-sequential methods such as Principal Components Analysis

[3][4, 5, 6] have received a great deal of attention, but we are aware of a little previous work on sequential schemes for generating memoryless innovation processes.

The importance of innovation representation spans a variety of fields. One example is dynamic system analysis in which complicated time dependent processes are approximated as independent processes triggering a dynamic system (human speech mechanism, for instance). Another common example is cryptography, where a memoryless language is easier to encrypt as it prevents an eavesdropper from learning the code by comparing its statistics with those of the serially correlated language. In Communications, Shayevitz and Feder presented the Posterior Matching (PM) principle [7], where an essential part of their scheme is to produce statistical independence between every two consecutive transmissions. Recently, innovation representation was further applied to causal inference [8]. Here, given two variables , we say that causes if can be represented as mapping of and an additional “small" independent innovation variable , i.e. . We discuss causal inference in detail in Section VI-A.

In this work we address the innovation representation problem in a broad perspective and introduce a general framework to construct memoryless processes from any given time-dependent process, under different objective functions and constraints.

## Ii Problem Formulation

Let

be a random process, described by its cumulative distribution function

. As mentioned above, we would like to construct such that:

1. can be uniquely recovered from for any .

In other words, we are looking for a sequential invertible transformation on the set of random variables

, so that the resulting variables are statistically independent.

Interestingly, we show that the two requirements can always be satisfied if we allow to take values over a continuous set. However, this property does not always hold when is restricted to take values over a finite alphabet and need to be relaxed in the general case. We discuss the continuous case in the next section, followed by a discussion on the discrete case in the remaining sections of this manuscript.

## Iii Innovation Representation for Continuous Variables

Following the footsteps of the Posterior Matching scheme [7], we define a generalized Gram-Schmidt method in the continuous case.

###### Theorem 1

Let be a random variable and be statistically independent of it. In order to shape

to a uniform distribution (and vice versa) the following applies:

1. Assume is a non-atomic distribution ( is strictly increasing) then

2. Assume

is discrete or a mixture probability distribution then

, where is the probability mass at the point .

A proof for this theorem is provided in Appendix of [7].

Going back to our problem, define as if is strictly increasing and otherwise. For a desired we construct our process by setting:

 Y1=F−1Y1(~FX1(X1)) (2)
 Yk=F−1Yk(~FXk|Xk−1(Xk|xk−1))∀k>1 (3)

Theorem 1 guarantees that is uniformly distributed and applying to it shapes it to the desired continuous distribution, . In other words, this method suggests that for every possible history of the process at a time , the transformation shapes to the same (uniform) distribution. This ensures independence of its history. The method then reshapes it to the desired distribution. It is easy to see that are statistically independent as every is independent of . Moreover, since is strictly increasing and is uniformly distributed we can uniquely recover from according to the construction of Theorem 1. Simple induction steps show that this is correct for every for .

Interestingly, we show (Appendix A) that this scheme is unique for all monotonically increasing transformations . Moreover, any non-increasing transformation that satisfies the requirements above is necessarily a measurable permutation of our scheme. A detailed discussion on these uniqueness properties is provided in Appendix A.

## Iv Innovation Representation for Discrete Variables - the Lossy Case

Let us now assume that both and take values on finite alphabet size of and respectively (for every ). Even in the simplest case, where both are binary and

is a first order non-symmetric Markov chain, it is easy to see that no transformation can meet both of the requirements mentioned above. We therefore relax our problem by replacing the uniquely recoverable requirement (

b) with mutual information maximization of . This way, we make sure that the mutual information between the two processes is maximized at any time given its history. Mutual information maximization is a well-established criterion in many applications; it is easy to show that it corresponds to minimizing the logarithmic loss, which holds many desirable properties [9, 10]. Notice that the case where is uniquely recoverable from given its past, results in achieving its maximum as desired.

Our problem is reformulated as follows: for any realization of , given any possible history the process , find a set of mapping functions to a desired distribution such that the mutual information between the two processes is maximal. For example, in the binary case where is a first order Markov process, and

is i.i.d. Bernoulli distributed,

 Yk∼Ber(βk),PXk(Xk=0)=γk (4)
 PXk|Xk−1(Xk=0|Xk−1=0)=α1
 PXk|Xk−1(Xk=0|Xk−1=1)=α2

we would like to maximize

 I(Xk;Yk|Xk−1)=γk−1I(Xk;Yk|Xk−1=0)+(1−γk−1)I(Xk;Yk|Xk−1=1). (5)

In addition, we would like to find the distribution of such that this mutual information is maximal. This distribution can be viewed as the closest approximation of the process as a memoryless process in terms of maximal mutual information with it. Notice that this problem is a concave minimization over a convex polytope-shaped set [11], and the maximum is guaranteed on to lie on one of the polytope’s vertices. Unfortunately, this problem is hard and generally there is no closed-form solution to it. Several approximations and exhaustive search solutions are available for this kind of problems, such as [12]. However, there are several simple cases in which a closed-form solution exists. One notable example is the binary case.

### Iv-a The Binary Case

Let us first consider the following problem: given two binary random variables and and their marginal distributions and we would like to find the conditional distributions such that the mutual information between and is maximal. Simple derivation shows that the maximal mutual information is:
For :

 Iβ>αmax(X;Y)=hb(β)−(1−α)hb(β−α1−α). (6)

For :

 Iβ<αmax(X;Y)=hb(β)−αhb(βα). (7)

Applying these results to the first order Markov process setup described above, and assuming all parameters are smaller than , we get that the maximal mutual information is simply:

For :

 I(Xk;Yk|Xk−1)=γk−1Iβ<α1max(X;Y)+(1−γk−1)Iβ<α2max(X;Y). (8)

For :

 I(Xk;Yk|Xk−1)=γk−1Iβ>α1max(X;Y)+(1−γk−1)Iβ<α2max(X;Y). (9)

For :

 I(Xk;Yk|Xk−1)=γk−1Iβ>α1max(X;Y)+(1−γk−1)Iβ>α2max(X;Y). (10)

It is easy to verify that is continuous in . Simple derivation shows that for the maximal mutual information is monotonically increasing in and for it is monotonically decreasing in . It can also be verified that all optimum points in the range of are local minima which leads to the conclusion that the maximum must be on the boundary of the range, or . The details of this derivation are located in Appendix B. For example, Figure 1 illustrates the shape of as a function of , for , .

Since we are interested in the that maximizes the mutual information between the two possible options, we are left with a simple decision rule

 γk−1βk=α2≶βk=α1hb(α2)−hb(α1)+α2hb(α1α2)α2hb(α1α2)+(1−α1)hb(α2−α11−α1) (11)

which determines the conditions according to which we choose our , depending on the parameters of the problem .

Further, assuming that the process is at its stationary state yields that is fixed for every and . Applying this result to the decision rule above (11), it is can be verified (Appendix B) that for we have:

 α21−α1+α2

which leads to the conclusion that .

The derivation above is easily generalized to all values of and . This results in a decision rule stating that equals the parameter closest to :

 βopt=argmaxθ∈{α1,α2,1−α1,1−α1}(12−θ). (12)

In other words, in order to best approximate a binary first order Markov process at its stationary state we set the distribution of the binary memoryless process to be similar to the conditional distribution which holds the largest entropy. Generalizing this result to an -order Markov process we have Bernoulli distributions to be mapped to a single one, . The maximization objective is therefore

 I(Xk;Yk|Xk−1)=R−1∑i=0γiI(Xk;Yk|[Xk−1…Xk−R−1]T=i) (13)

where . Notice that is either or , depending on and , as described in (67). Simple calculus shows that as in the case, the mutual information reaches its maximum on one of the inner boundaries of ’s range

 βopt=argmaxβ∈{αj}⎛⎝hb(β)−∑β<αjγjαjhb(βαj)−∑β>αjγj(1−αj)hb(β−αj1−αj)⎞⎠. (14)

Unfortunately, here it is not possible to conclude that equals the parameter closest to , as a result of the nature of our concave minimization problem.

## V Innovation Representation for Discrete Variables - the Lossless Case

The lossy approximation may not be adequate in applications where unique recovery of the original process is required. It is therefore necessary to increase the alphabet size of the output so that every marginal distribution of , given any possible history of the process, is accommodated. This problem can be formulated as follows:

Assume we are given a set of random variables, , such that each random variable is multinomial distributed, taking on values, . Notice that corresponds to , as appears in the previous section, and is the marginal alphabet size of the original process . Using the notation from previous sections, we have that corresponds to . In addition, we denote as the symbol of random variable . We would like to find a distribution where the ’s and alphabet size are unknown. In addition, we are looking for sets of conditional probabilities between every possible realization and , such that can be uniquely recoverable from , for every and . Further, we would like the entropy of to be as small as possible so that our memoryless process is as “cheap" as possible to describe. Since the transformation is invertible, we have that , for every . This means that by minimizing we actually minimize . In other words, our problem may be viewed as follows: given a set of marginal distributions for , we are looking for a joint probability distribution such that the joint entropy is minimal. This problem was recently introduced as Minimum Entropy Coupling [8] and shown to be NP hard.

Without loss of generality we assume that for all , since we can always order them this way. We also order the sets according to the smallest parameter, . Notice we have for all , as an immediate consequence.

For example, for and , it is easy to verify that is a necessary condition for to be uniquely recoverable from . Simple calculus shows that the conditional probabilities which achieve the minimal entropy are and , as appears in Figure 2 for .

### V-a Minimizing B

Let us start by finding the minimal alphabet size of the output process , such that is guaranteed to be uniquely recoverable from it. Looking at the free parameters of our problem, we first notice that defining the distribution of requires exactly parameters. Then, defining conditional probability distributions between each alphabet size and the output process takes parameters. For to be uniquely recoverable from , each value of needs to be assigned to at most a single value of (see Figure 2 for example). This means that for each of the sets, we have constraints ( possible realizations of , each of them has zero conditional probability constraints). Therefore, in order to have more free parameters than constraints we require that:

 (B−1)+R(A−1)(B−1)≥RB(A−1). (15)

 B≥R(A−1)+1. (16)

For example, assuming the takes over a binary alphabet, we get that . There exist several special cases in which it is possible to go under this lower bound, like cases where some parameters are additions or subtraction of other parameters. For example, in the binary case. Our derivation focuses on the most general case. Notice that a similar result appears in Lemma 3 of [8]. However, it is important to emphasize that an earlier conference version of our results was already published in [1] and [13], several years before [8].

### V-B The Optimization Problem

The problem above can be formulated as the following optimization problem:

 minH(Y)s.t.H(Xk|Y=yb,[Xk−1,…,Xk−R−1]T=i)=0∀i=1,…,R,b=1,…,B (17)

Unfortunately this is a concave minimization problem over a non-convex set. However, we show that this problem can also be formulated as a mixed integer problem.

### V-C Mixed Integer Problem Formulation

In order to formulate our problem as a mixed integer problem we first notice that the free parameters are all conditional probabilities, as they fully determine the outcome distribution. We use the notation to describe the conditional probability . The equality constraints we impose on our minimization objective are as follows:

• All conditional probability sets must result in the same output distribution:

 P(Y=yb)= A∑a=1P(Y=yb|Xi=xa)P(Xi=xa)=A∑a=1piabαai

for all and . Since the parameters are given, we have that

 A∑a=1piabαai−A∑a=1pjabαaj=0

for all and

• is a valid conditional distribution function:

 A∑b=1piab=1

for all and

In addition, the inequality constraints are:

• For convenience, we ask that for all :

 A∑a=1piabαai−A∑a=1pia(b+1)αai≤0for all1≤b≤B.
• Zero conditional entropy constraint: as stated above, a necessary and sufficient condition for zero conditional entropy is that for every value , in every set , there is only a single value such that . Therefore, for each of the sets, and for each of the values can take on, we define boolean variables, , that satisfy:

 piab−Tiab≤0,A∑a=1Tiab=1,Tiab∈{0,1}.

Notice that the summation ensures only a single equals one, for which . For each of the other the inequality constraint verifies that . This set of constraints can also be written using Boolean variables:

 piab−Tiab≤0∀a=1,…,A
 piAb−(1−A−1∑a=1Tiab)≤0⇔piAb+(A−1∑a=1Tiab)≤1
 Tiab∈{0,1}∀a=1,…,A.

Therefore, our minimization problem can be written as follows: define a vector of parameters

. Define and as the equality constraints in a matrix and vector forms respectively. Define and as the inequality constraints in a matrix and vector forms respectively. This leads to

 minf(z) (18)
 s.t.Aeqz=beq
 Aineqz≤bineq
 0≤z≤1
 Tiab∈{0,1}∀i,a,b

where f(z) is the entropy of the random variable in terms of and boolean indicators define which elements in correspond to . Mixed integer problems are studied broadly in the computer science community. There are well established methodologies for convex minimization in a mixed integer problem and specifically in the linear case [14, 15]

. The study of non-convex optimization in mixed integer problems is also growing quite rapidly, though there is less software available yet. The most broadly used mixed integer optimization solver is CPLEX, developed by IBM. CPLEX provides a mixed integer linear programming (MILP) solution, based on a branch and bound oriented algorithm. We use the MILP in lower bounding our objective function (

18) as described in the following sub-sections.

### V-D Greedy Solution

The entropy minimization problem can also be viewed as an attempt to minimize the entropy of a random variable on a set of discrete points representing valid solutions to the problem we defined. Let us remember that for all as stated in the previous sections.

###### Proposition 1

is not greater than

• Assume . Then, for this there must be at least two values and for which and . This contradicts the zero conditional entropy constraint.

For example, Figure 2 demonstrates the optimal lossless solution for , and . We see that (for ). Moreover, it is easy to verify that is not a feasible solution, as Proposition 1 suggests.

Therefore, a greedy algorithm would like to “squeeze” all the distribution to the values which are less constrained from above, so that it is as large as possible.

Our suggested algorithm works as follows: first set according to the bound presented in Section V-A. Then, in every step of the algorithm we set . This leaves us with a problem (of setting the remaining values of ). Given this mapping, we rearrange the remaining probabilities and repeat the previous step. We terminate once we set the smallest value, . This process ensures that in each step we increase the least constrained value of as much as possible.

We notice that the same greedy algorithm was recently introduced and studied by Kocaoglu et al. [8] in the context of causal inference. However, as stated in Section V, an earlier conference version of our results, including this algorithm, was already published in [1] (and in a more detail in [13]) several years before [8].

### V-E Lowest Entropy Bound

As discussed in the previous sections, we are dealing with an entropy minimization problem over a discrete set of valid solutions. Minimizing the entropy over this set of points can be viewed as a mixed integer non-convex minimization, which is a hard problem. In this section we introduce a lower bound to the optimal solution by relaxing the domain of solutions to a continuous set. In other words, instead of searching for ’s over a discrete set (as a result of the requirement for invertible mappings), we now search for ’s over a continuous set, which naturally includes additional unfeasible solutions. We would like to consider the smallest continuous set that includes all feasible solutions. For this purpose, we find an upper and lower bound for each of the parameters and solve the problem when may take any value in this range. This way, we relax the search over a set of valid solutions to a search in a continuous space, bounded by a polytope, and attain a lower bound to the desired entropy.

We find the boundaries for each by changing our minimization objective to a simpler linear one (minimize/maximize , for every at a time). This problem is a simple MILP as shown above. By looking at all these boundaries together we may minimize the entropy in this continuous space and find a lower bound for the minimal entropy one can expect. Theorem 2 states the constructive conditions for an optimal solution in this case.

We notice that this bound is not tight, and we even do not know how far it is from a valid minimum, as it is not necessarily a valid solution. However, it gives us a benchmark to compare our greedy algorithm against and decide if we are satisfied with it, or require more powerful tools. We also note that as increases, the number of valid solutions grows exponentially. This leads to a more packed set of solutions which tightens the suggested lower bound as we converge to a polytope over a continuous set.

###### Theorem 2

Let be a random variable over multinomial distribution, . Assume that parameters satisfy:

1. for all .

2. and (to ensure the existence of a feasible solution).

3. and for all .

Then, the minimal entropy is achieved by

1. for all .

2. for all .

3. .

for some .

A proof for this theorem is provided in Appendix C.

## Vi Applications

As mentioned above, the innovation representation problem has many applications in a variety of fields. In this section we focus on two important applications. We begin with causal inference, as recently introduced by kocaoglu et al. [8].

### Vi-a Causal Inference

Consider two random variables and . We say that causes (denote as ) if a change in the value of is a consequence of a change in the value of (and vice versa). A general solution to the causal inference problem is to conduct experiments, also called interventions (for example, [16]

). For many problems, it can be very difficult to create interventions since they require additional experiments after the original data-set was collected. Nevertheless, researchers would still like to discover causal relations between variables using only observational data, using so-called data-driven causality. However, a fundamental problem in this approach is the symmetry of the underlaying distribution; the joint distribution

may be factorized as either or . This means that we cannot infer the causal direction directly from the joint distribution, and additional assumptions must be made about the mechanisms that generate the data [17].

The most popular assumption for two-variable data-driven causality is the additive noise model (ANM) [18]. In ANM, we assume a model where is a random variable that is statistically independent of . Although restrictive, this assumption leads to strong theoretical guarantees in terms of identifiability, and provides the state of the art accuracy in real datasets. Shimizu et al. [18] showed that if is linear and the noise is non-Gaussian, then the causal direction is identifiable. Hoyer et al. [19] showed that when is non-linear, irrespective of the noise, identifiability holds in a non-adverserial setting of system parameters. Peters et al. [17] extended ANM to discrete variables.

Recently, Kocaoglu et al. [8] extended the ANM framework and introduced the Entropic Causal Inference principle. Specifically, they argue that if the true causal direction is , then the random variable satisfies where is an arbitrary function and is a “simple" random variable that is statistically independent of . The “simplicity" of is characterized by a low Rényi entropy. This means that for any model in the wrong direction, , the random variable has a greater Rényi entropy than . Kocaoglu et al. focused on two special case of Rényi entropy: , which corresponds to the cardinality of , and , which is the classical Shannon entropy. They proved an identifiability result for , showing that if the probability values are not adversarially chosen, for most functions, the true causal direction is identifiable under their model. Further, they showed that by using Shannon entropy (), they obtain causality tests that work with high probability in synthetic datasets, and slightly outperform state of the art alternative tests in real-world datasets.

The causality test was driven as follows: consider where is independent of . Let be the mapping from to when , i.e., . Then where the last equality follows from the independence of and . Thus, the conditional distributions are treated as distributions that emerge by applying some function to some unobserved variable . Then the problem of identifying with minimum entropy given the joint distribution becomes equivalent to the following: given distributions of the variables , find the distribution with minimum entropy (distribution of ) such that there exists functions which map this distribution to the observed distributions of . It can be shown that . Denote as a random variable . Then, the best lower bound on can be obtained by minimizing . Further, it is shown that it is always possible to construct a random variable that achieves this minimum. Thus the problem of finding the with minimum entropy given the joint distribution is equivalent to the problem of finding the minimum entropy joint distribution of the random variables , given the marginal distributions .

Notice that this problem is equivalent to our lossless representation problem in Section V: Given distributions (which correspond to the ’s) , we seek invertible mappings, from each of the distributions to , such that the entropy of is minimal. Since the mappings are invertible, we have the and the problem is equivalent to minimizing subject to the marginal distributions of the ’s.

Kocaoglu et al. [8] showed that this problem is NP-hard. Further, they conjectured that the identifiability result they proved for entropy also holds in this case and proposed a greedy algorithm. Interestingly, their suggested algorithm is exactly the same as ours (Section V-D). However, it is important to emphasize that our algorithm was already introduced in [1] (and later in [13]), several years before the work of Kocaoglu et al. [8].

In a more recent work, Kocaoglu et al. [20] continue the study of the proposed greedy solution. They showed that it converges to a local minimum and derived several algorithmic properties. In addition, they derived a variant of greedy algorithm which is easier to analyze.

The Entropic Causal Inference principle has gained a notable interest, mostly due its intuitive interpretation and promising empirical results. On the other hand, the proposed greedy algorithm does not guarantee to converge to the optimal solution. Moreover, it is not clear how far it is from the global minimum (or some infimum). Our suggested solutions address these concerns, as described in detail in Sections V-C and V-E.

### Vi-B The IKEA Problem

An additional application of our suggested framework comes from industrial engineering, as it deals with optimal design of mass production storage units.

Consider the following problem: a major home appliances vendor is interested in mass manufacture of storage units. These units hold a single and predetermined design plan according to the market demand. Assume that the customers market is defines by major storing types (customers) and each of these customers is characterized by a different distribution of items they wish to store. The vendor is interested in designing a single storage unit that satisfies all of his customers. In addition, the vendor would like the storage unit to be as “compact" and “cheap" as possible. We refer to this problem as the IKEA problem. Consider the customer distributions such that each customer is over a multinomial distribution with values, . We assume that all customer distributions have the same cardinality . It is easy to generalize our solution to different cardinalities. We are interested in mapping the customer distributions into a single storage distribution, , which represents the storage unit to be manufactured.

First, we would like every customer to be able to store its items exclusively; different items shall not be stored together. As in the previous sections, we use the notation to define the symbol of the random variable . For our storing units problem, we would like to find a multinomial distribution over values ( is unknown), , and sets of conditional probabilities between every and , such that can be uniquely recoverable (reversible) from for every and .

In addition, we would like the storing unit to be “compact" and “cheap". For most functionalities, a compact storing unit is rectangular shaped (closets, cabins, dressers etc.) and it is made of multiple compartments (shelves) in numerous columns. We define the number of columns in our storage unit as and the number of shelves as . Therefore, we would like to design a rectangular shaped storing unit such that given a number of columns , every customer is able to store its items exclusively and the number of shelves is minimal. The corresponding technical requirements, in terms of our problem, are discussed in the following sections.

This problem is again NP hard, for the same reasons as in the previous sections, but it can be reformulated to a set of Mixed Integer Quadratic Programming (MIQP) problems, which is an established research area with extensive software available.

### Vi-C Mixed Integer Quadratic Programming Formulation

Let us first assume we are given both the number of columns in our desired storing unit and the number of shelves . Since we require the storing unit to be rectangular, we need to find such distribution that can be partitioned to columns with no residue. Therefore, we define equivalent partitions in the size of for which each is exclusively assigned. We are interested in such distribution that the assignment can be done with no residue at all. To guarantee an exclusive assignment for a partition we introduce integer variables , indicating which of the is assigned to it. Therefore, we have

 B∑b=1Tlbβb=δl,L∑l=1Tlb=1,Tlb∈{0,1},for alll=1,…,Landb=1,…,B (19)

and the optimization objective is simply

 L∑l=1(δl−1L)2→min. (20)

Our constraints can easily be added to the mixed integer formulation presented in the previous sections and the new optimization problem is:

 minzTccTz−2LcTz (21)
 s.t.Aeqz=beq
 Aineqz≤bineq
 0≤z≤1
 Tlb∈{0,1}∀l,b

where is a vector of all parameters in our problem and .

### Vi-D Minimizing the Number of Shelves

The problem of minimizing the residue of the assignment, given the number of columns and the number of shelves, may be formulated as a MIQP. In this section we focus on finding the minimal number of shelves that guarantees zero residue. Notice that for large enough the residue goes to zero, as tends to take values on a continuous set. We also notice that the residue is a monotonically non-increasing function of , since by allowing a greater number of shelves we can always achieve the same residue by repeating the previous partitioning up to a meaningless split of one of the compartments. These two qualities allow very efficient search methods (gradient, binary etc.) to find the minimal for which the residue is “- close" to zero.

Here, we suggest the following simple binary search based algorithm for minimizing the number of shelves for a rectangular shaped storing unit:

1. Choose a large enough initial value such that applying it in the MIQP presented above results in zero residue.

2. Define a step size as .

3. Apply the MIQP with .

4. If the residue is zero repeat previous step with and . Otherwise repeat the previous step with and . Terminate if .

## Vii Lossless Innovation Representation and its Relation to the Optimal Transportation Problem

The essence of the lossless innovation representation problem is finding a single marginal distribution to be matched to multiple ones under varying costs functions. This problem can be viewed as a design generalization of a multi-marginal setup for the well-studied optimal transportation problem [21]. In other words, we suggest that the optimal transportation problem can be generalized to a design problem in which we are given not a single but multiple source probability measures. Moreover, we are interested not only in finding mappings that minimizes some cost function, but also in finding the single target probability measure that minimizes that cost.

### Vii-a The Optimal Transportation Problem

The optimal transportation problem was presented by [21] and has generated an important branch of mathematics in the last decades. The optimal transportation problem has many applications in multiple fields such as Economics, Physics, Engineering and others. The problem originally studied by Monge was the following: assume we are given a pile of sand (in ) and a hole that we have to completely fill up with that sand. Clearly the pile and the hole must have the same volume and different ways of moving the sand will give different costs of the operation. Monge wanted to minimize the cost of this operation. Formally, the optimal transportation problem is defined as follows. Let and be two seperable metric spaces such that any probability measure on (or ) is a Radon measure. Let be a Borel-measurable function. Given probability measure on and on , Monge’s optimal transportation problem is to find a mapping that realizes the infimum

 inf{∫Xc(x,T(x))dμ(x)∣∣∣T∗(μ)=ν}

where denotes the push forward of by . A map that attains the infimum is called the optimal transport map.

Notice that this formulation of the optimal transportation problem can be ill-posed in some setups where there is no “one-to-one" transportation scheme. For example, consider the case where the original pile is a Dirac measure but the hole is not shaped in this manner. A major advance in this problem is due to Kantorovich [22] who proposed the notation of a “weak solution" to the optimal transportation problem; he suggested looking for plans instead of transport maps [23]. The main difference between Kantorovich work and Monge formulation is that while the original Monge problem is restricted to transportation of the complete mass at each point on the original pile, the relaxed Kantorovich version allows splitting of masses. However, it is clear that no such result can be expected without additional assumptions on the measures and cost. The first existence and uniqueness result is due to Brenier [24]. In his work, Brenier considered the case where both the pile and the hole satisfy , and the cost function is . Then, he showed that if the probability measure of is absolutely continuous with respect to the Lebesgue measure there exists a unique optimal transport map. Following [24], many researchers started working on this problem, showing existence of optimal maps with more general costs. More recently, Pass published a series of papers discussing a multi-marginal generalization of the optimal transportation problem [25, 26, 27]. In his work, Pass considered multiple marginal distributions to be matched to a single destination with a given distribution. In his papers, Pass discussed the existence and uniqueness of solutions for both a Monge-like and Kantorovich-like multi-marginal problems, under different measures and cost functions, and the connection between the two formulations.

In our work we generalize the multi-marginal optimal transportation from a design perspective; we look at the multi-marginal optimal transportation problem not only as a minimization problem over a set of mappings but also ask ourselves what is the optimal target measure such that the cost function is minimal. We show that this problem has very broad use in many fields, especially when taking an equivalent form of multiple source measures matched to a single target. More specifically, we focus our interest on a set of mappings that allow unique recovery between the measures. That is, given the source measures and a target measure, one can uniquely recover any realization of the sources from a given realization of the target. This type of mappings hold a special interest in many applications, as it is shown throughout the previous sections.

## Viii Discussion

In this work we introduce a method to represent any stochastic process as an innovation process, under different objectives and constraints. We show that there exists a simple closed-form solution if we allow the outcome process to take values over a continuous set. However, restricting the alphabet size may cause lossy recovery of the original process. Two solutions are presented in the face of two possible objectives in the discrete case. First, assuming the alphabet size is too small to allow lossless recovery, we aim to maximize the mutual information with the original process. Alternatively, we may seek a minimal alphabet size so that a unique recovery is guaranteed, while minimizing the entropy of the resulting process. In both cases the problem is shown to be hard and several approaches are discussed. In addition, a simple closed-form solution is provided for the binary first order Markov process.

It is important to mention that our work focuses on sequential analysis of stochastic processes. This means that at every time stamp , we construct a corresponding innovation process from . This framework may be generalized to a non-sequential setup, where is analyzed as a batch. In fact, this problem is equivalent to the well-studied Independent Component Analysis, both over continuous [4] and discrete [28, 29] variables. The ICA problem has many applications in learning and inference. Recently, it was further applied to source coding and data compression [30, 31, 32, 33, 34].

The innovation representation problem has many applications, as it simplifies complex systems and allows simpler analytical and computational solutions. In this work we show that the lossless innovation representation may be applied to infer causality, as was previously shown in[8] . In addition, we introduce a practical application from the industrial world, denoted as the IKEA problem.

The problem of finding a single marginal distribution function to be fitted to multiple ones under varying costs functions can be viewed as a multi-marginal generalization of the well-studied optimal transportation problem. In other words, we suggest that the optimal transportation problem can be generalized to a design problem in which we are given not a single but multiple source distribution functions. In this case, we are interested not only in finding conditional distributions to minimize a cost function, but also in finding the optimal target distribution. We argue that this problem has multiple applications in the fields of Economics, Engineering and others.

## Appendix A

For the simplicity of the presentation we reformulate our problem as follows: assume a random variable is to be constructed from a random variable given ’s past, denoted as . Therefore, we would like to construct a memoryless random variable , with a given , such that

1. is statistically independent in .

2. can be uniquely recovered from given .

3. .

Therefore, our goal is to find such and discuss its uniqueness.

Let us first consider a special case where is uniformly distributed, for all . For to be statistically independent of it must satisfy

 FY|Xp(y|Xp=xp)=FY(y). (22)

Expanding the left hand side of (22) we have that for every ,

 FY|Xp(y|Xp=xp)=P(Y≤y|Xp=xp)=P(g(X,Xp)≤y|Xp=xp).

The second requirement suggests that is uniquely recovered from and , which implies . Assume is monotonically increasing with respect to . Then, we have that

 FY|Xp(y|Xp=xp)= P(g(X,Xp)≤y|Xp=xp)=P(X≤g−1Xp(y)|Xp=xp)=FX|Xp(g−1Xp(y)|Xp=xp) (23)

where the second equality follows from the monotonically increasing behavior of with respect to . Therefore, we are looking for a monotonically increasing transformation such that

 FX|Xp(g−1Xp(y)|Xp=xp)=FY(y)=y.

The following lemmas discuss the uniqueness of monotonically increasing mappings when is a non-atomic (Lemma 3) or atomic (Lemma 4) measures.

###### Lemma 3

Assume is a non-atomic random variable with a strictly monotonically increasing commutative distribution function (that is, takes values on a continuous set). Suppose there exists a transformation on its domain, such that

 FX(x)|x=h(y)=FY(y).

Then,

1. is unique

2. is monotonically non decreasing (increasing, if is strictly increasing).

• Let us begin by proving (1). The transformation satisfies

 FX(x)|x=h(y)=FX(h(y))=P(X≤h(y))=FY(y).

Suppose there is another transformation that satisfies the conditions stated above. Then,

 FX(x)|x=g(y)=FX(g(y))=P(X≤g(y))=FY(y).

Therefore,

 P(X≤g(y))=P(X≤h(y))∀y.

Suppose . This means that there exists at least a single where and . It follows that

 P(X≤h(~y)+δ)=P(X≤h(~y))

or in other words

 FX(h(~y))=FX(h(~y)+δ)

which contradicts the monotonically increasing behavior of where the transformation is defined.

As for (2), we have that for all . Therefore,

 F