 # Analysis for the Slow Convergence in Arimoto Algorithm

In this paper, we investigate the convergence speed of the Arimoto algorithm. By analyzing the Taylor expansion of the defining function of the Arimoto algorithm, we will clarify the conditions for the exponential or 1/N order convergence and calculate the convergence speed. We show that the convergence speed of the 1/N order is evaluated by the derivatives of the Kullback-Leibler divergence with respect to the input probabilities. The analysis for the convergence of the 1/N order is new in this paper. Based on the analysis, we will compare the convergence speed of the Arimoto algorithm with the theoretical values obtained in our theorems for several channel matrices.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Arimoto  proposed a sequential algorithm for calculating the channel capacity of a discrete memoryless channel. Based on the Bayes probability, the algorithm is given by the alternating minimization between the input probabilities and the reverse channel matrices. For arbitrary channel matrix the convergence of the Arimoto algorithm is proved and the convergence speed is evaluated. In the worst case, the convergence speed is the order, and if the input distribution that achieves the channel capacity is in the interior of the set of input distributions, the convergence is exponential.

In this paper, we first consider the exponential convergence and evaluate the convergence speed. We show that there exist cases of exponential convergence even if is on the boundary of . Moreover, we also consider the convergence of the order, which is not dealt with in the previous studies. Especially, when the input alphabet size , we will analyze the convergence of the order in detail and the convergence speed is evaluated by the derivatives of the Kullback-Leibler divergence with respect to the input probabilities.

As a basic idea for evaluating the convergence speed, we consider that the function which defines the Arimoto algorithm is a differentiable mapping from to , and notice that the capacity achieving input distribution is the fixed point of . Then, the convergence speed is evaluated by analyzing the Taylor expansion of about the fixed point .

## 2 Related works

There have been many related works on the Arimoto algorithm. For example, extension to different types of channels , , , acceleration of the Arimoto algorithm , , characterization of Arimoto algorithm by divergence geometry , , , etc. If we focus on the analysis for the convergence speed of the Arimoto algorithm, we see in ,,

that the eigenvalues of the Jacobian matrix are calculated and the convergence speed is investigated in the case that

is in the interior of .

In this paper, we consider the Taylor expansion of the defining function of the Arimoto algorithm. We will calculate not only the Jacobian matrix of the first order term of the Taylor expansion, but also the Hessian matrix of the second order term, and examine the convergence speed of the exponential or order based on the Jacobian and Hessian matrices. Because our approach for the evaluation of the convergence speed is very fundamental, we hope that our results will be applied to all the existing works.

## 3 Channel matrix and channel capacity

Consider a discrete memoryless channel with the input source and the output source . Let be the input alphabet and be the output alphabet.

The conditional probability that the output symbol is received when the input symbol was transmitted is denoted by

and the row vector

is defined by . The channel matrix is defined by

 Φ=⎛⎜ ⎜⎝P1⋮Pm⎞⎟ ⎟⎠=⎛⎜ ⎜⎝P11⋯P1n⋮⋮Pm1⋯Pmn⎞⎟ ⎟⎠. (1)

We assume that for any there exist at least one with . This means that there are no useless output symbols.

The set of input probability distributions on the input alphabet

is denoted by . The interior of is denoted by . Similarly, the set of output probability distributions on the output alphabet is denoted by .

Let be the output distribution for the input distribution , where the representation by components is , then the mutual information is defined by . The channel capacity is defined by

 C=maxλ∈Δ(X)I(λ,Φ). (2)

The Kullback-Leibler divergence for two output distributions is defined by

 D(Q∥Q′)=n∑j=1QjlogQjQ′j. (3)

The Kullback-Leibler divergence satisfies , and if and only if .

An important proposition for investigating the convergence speed of the Arimoto algorithm is the Kuhn-Tucker condition on the input distribution to achieve the maximum of (2).

Theorem (Kuhn-Tucker condition) In the maximization problem (2), a necessary and sufficient condition for the input distribution to achieve the maximum is that there is a certain constant with

 D(Pi∥λ∗Φ){=~C,\rm for i \rm with λ∗i>0,≤~C,\rm for i \rm with λ∗i=0. (4)

In (4), is equal to the channel capacity .

Since this Kuhn-Tucker condition is a necessary and sufficient condition, all the information about the capacity achieving input distribution can be derived from this condition.

## 4 Arimoto algorithm for calculating channel capacity

### 4.1 Arimoto algorithm 

A sequence of input distributions

 {λN=(λN1,⋯,λNm)}N=0,1,⋯⊂Δ(X) (5)

is defined by the Arimoto algorithm as follows. First, let be an initial distribution taken in , i.e., . Then, the Arimoto algorithm is given by the following recurrence formula;

 λN+1i=λNiexpD(Pi∥λNΦ)m∑k=1λNkexpD(Pk∥λNΦ),i=1,⋯,m,N=0,1,⋯. (6)

On the convergence of this Arimoto algorithm, the following results are obtained in Arimoto ;

By defining

 C(N+1,N)≡−m∑i=1λN+1ilogλN+1i+m∑i=1n∑j=1λN+1iPijlogλNiPijm∑k=1λNkPkj, (7)

they obtained the following theorems;

Theorem A1: If the initial input distribution is in , then

 limN→∞C(N+1,N)=C. (8)

Theorem A2: If , then

 0≤C−C(N+1,N)≤logm−h(λ0)N, (9)

where is the entropy of .

Theorem A3: If the capacity achieving input distribution is in , then

 0≤C−C(N+1,N)

where and is a constant.

In , they consider the Taylor expansion of by , and the Taylor expansion of by , however they do not consider the Taylor expansion of the mapping , which will be considered in this paper. Further, in the above Theorem A3, they consider only the case , where the convergence is exponential.

In Yu , they consider the mapping and the Taylor expansion of about . They calculate the eigenvalues of the Jacobian matrix , however they do not consider the Hessian matrix. Further, they consider only the case as in .

### 4.2 Mapping from Δ(X) to Δ(X)

Let be the defining function of the Arimoto algorithm (6), i.e.,

 Fi(λ)=λiexpD(Pi∥λΦ)m∑k=1λkexpD(Pk∥λΦ),i=1,⋯,m. (11)

Define , then we can consider that is a differentiable mapping from to , and (6) is represented by

 λN+1=F(λN). (12)

In this paper, for the analysis of the convergence speed, we assume

 rankΦ=m. (13)
###### Lemma 1

The capacity achieving input distribution is unique.

Proof: By Csiszàr, p.137, eq.(37), for arbitrary ,

 m∑i=1λiD(Pi∥Q)=I(λ,Φ)+D(λΦ∥Q). (14)

By the assumption (13), we see that there exists  with

 D(P1∥Q0)=⋯=D(Pm∥Q0)≡C0. (15)

Substituting into (14), we have . Because is a constant,

 maxλ∈Δ(X)I(λ,Φ)⟺minλ∈Δ(X)D(λΦ∥Q0). (16)

Define , then is a closed convex set, thus by Cover , p.297, Theorem 12.6.1, that achieves exists and is unique. By the assumption (13), the mapping is one to one, therefore, with is unique.

###### Remark 1

Due to the equivalence (16), the Arimoto algorithm can be obtained by Csiszàr , Chapter 4, “Minimizing information distance from a single measure”, Theorem 5.

###### Lemma 2

The capacity achieving input distribution is the fixed point of the mapping in . That is, .

Proof: In the Kuhn-Tucker condition (4), let us define as the number of indices with , i.e.,

 λ∗i{>0,i=1,⋯,m1,=0,i=m1+1,⋯,m, (17)

then

 D(Pi∥λ∗Φ){=C,i=1,⋯,m1,≤C,i=m1+1,⋯,m. (18)

We have

 m∑k=1λ∗kexpD(Pk∥λ∗Φ)=m1∑k=1λ∗keC=eC, (19)

hence by (11), (17), (19),

 Fi(λ∗) ={e−Cλ∗ieC,i=1,⋯,m1,0,i=m1+1,⋯,m, (20) =λ∗i,i=1,⋯,m, (21)

which shows .

The sequence of the Arimoto algorithm converges to the fixed point , i.e.,

 λN→λ∗,N→∞. (22)

We will investigate the convergence speed by using the Taylor expansion of about .

### 4.3 Type of index

Now, we classify the indices

in the Kuhn-Tucker condition (4) in more detail into the following 3 types;

 D(Pi∥λ∗Φ)⎧⎪ ⎪⎨⎪ ⎪⎩=C,\rm for i \rm with λ∗i>0 (type I),=C,\rm for i \rm with λ∗i=0 (type II),

Let us define the sets of indices as follows;

 all the indices: I≡{1,⋯,m}, (24) type I indices: II≡{1,⋯,m1}, (25) type II indices: III≡{m1+1,⋯,m1+m2}, (26) type III indices: IIII≡{m1+m2+1,⋯,m}. (27)

, , , . We have and .

is not empty and for any channel matrix, but and may be empty for some channel matrix.

### 4.4 Examples of convergence speed

Let us consider the difference of convergence speed of the Arimoto algorithm depending on the channel matrices.

For many channel matrices , the convergence is exponential, but for some special the convergence is very slow. Let us consider the following examples taking types I, II, III into account, where the input alphabet size and the output alphabet size .

###### Example 1

(only type I) If only type I indices exist, then , hence is in the interior of . As a concrete channel matrix of this example, let us consider

 Φ(1)=⎛⎜⎝0.8000.1000.1000.1000.8000.1000.2500.2500.500⎞⎟⎠. (28)

For this , we have and . See Fig.1. The vertices of the large triangle in Fig.1 are the output probability distributions . We have , then considering the analogy to Euclidean geometry, can be regarded as an “acute triangle”.

###### Example 2

(types I and II) If there are type I and type II indices, we can assume without loss of generality, hence is on the side and . As a concrete channel matrix of this example, let us consider

 Φ(2)=⎛⎜⎝0.8000.1000.1000.1000.8000.1000.3000.3000.400⎞⎟⎠. (29)

For this , we have and . See Fig.2. Considering the analogy to Euclidean geometry, can be regarded as a “right triangle”.

###### Example 3

(types I and III) If there are type I and type III indices, we can assume without loss of generality, hence is on the side and . As a concrete channel matrix of this example, let us consider

 Φ(3)=⎛⎜⎝0.8000.1000.1000.1000.8000.1000.3500.3500.300⎞⎟⎠. (30)

For this , we have and . See Fig.3. Considering the analogy to Euclidean geometry, can be regarded as an “obtuse triangle”.

For the above , Fig.4 shows the state of convergence of . By this Figure, we see that in Examples 1 and 3 the convergence is exponential, while in Example 2 the convergence is slower than exponential.

From the above three examples, it is inferred that the Arimoto algorithm converges very slowly when type II index exists, and converges exponentially when type II index does not exist. We will analyze this phenomenon in the following.

## 5 Taylor expansion of F(λ) about λ=λ∗

We will examine the convergence speed of the Arimoto algorithm by the Taylor expansion of about the fixed point . Taylor expansion of the function about is

 F(λ)=F(λ∗)+(λ−λ∗)J(λ∗)+12!(λ−λ∗)H(λ∗)t(λ−λ∗)+o(∥λ−λ∗∥2), (31)

where denotes the transpose of and denotes the Euclidean norm .

In (31), is the Jacobian matrix at , i.e.,

 J(λ∗) =(∂Fi∂λi′∣∣∣λ=λ∗)i′,i=1,⋯,m. (32)

We consider in this paper that the input probability distribution is a row vector, thus the Jacobian matrix is such as

 ←i→ J(λ∗) =↑i′↓⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝∂F1∂λ1∣∣∣λ=λ∗⋯∂Fm∂λ1∣∣∣λ=λ∗⋮⋮∂F1∂λm∣∣∣λ=λ∗⋯∂Fm∂λm∣∣∣λ=λ∗⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠∈Rm×m, (33)

i.e., is the component. Note that our is the transpose of a usual Jacobian matrix corresponding to column vector.

Because by (11), we have by (33),

###### Lemma 3

Every row sum of is equal to .

In (31), , where is the Hessian matrix of at , i.e.,

 Hi(λ∗)=(∂2Fi∂λi′∂λi′′∣∣∣λ=λ∗)i′,i′′=1,⋯,m, (34)

and is an abbreviated expression of the dimensional vector

###### Remark 2

satisfy the constraint , but in (31), (32), (34) we consider as independent variables to have the Taylor series approximation (31). This approximation is justified as follows. By the Kuhn-Tucker condition (4), , hence by the assumption put below (1), we have . See . For , define , i.e., is an open ball in centered at with radius . Note that is free from the constraint . Taking sufficiently small, we can have , for any . The function is defined for with , even if some . Therefore, the domain of definition of can be extended to , where is the inverse image of by the mapping . is an open neighborhood of in . Then is a function of as independent variables (free from the constraint ). We can consider (31) to be the Taylor expansion by independent variables , then substituting into (31) to obtain the approximation for about .

Now, substituting into (31), then by and , we have

 λN+1=λ∗+(λN−λ∗)J(λ∗)+12!(λN−λ∗)H(λ∗)t(λN−λ∗)+o(∥λN−λ∗∥2). (35)

Then, by putting , (35) becomes

 μN+1=μNJ(λ∗)+12!μNH(λ∗)tμN+o(∥μN∥2). (36)

By (22), we will investigate the convergence

 μN→0,N→∞, (37)

based on the Taylor expansion (36). Let

 μNi≡λNi−λ∗i,i=1,⋯,m, (38)

denote the components of , and write by components as , then we have

 m∑i=1μNi=0,N=0,1,⋯, (39)

because .

### 5.1 Basic analysis for fast and slow convergence

For the investigation of the convergence speed, we consider the following simple case.

Let us define a real sequence by the recurrence formula;

 μN+1 =θμN−ρ(μN)2,N=0,1,⋯, (40) 0 <θ≤1,ρ>0,0<μ0<θ/ρ. (41)

If , then we have , hence decays exponentially.

While, if , (40) becomes . This recurrence formula cannot be solved explicitly, however, we see the state of convergence by Fig.5.

Because the differential coefficient of the function at is 1, the convergence speed is very slow. In fact, this convergence is slower than exponential. From Lemma 7 in section 7 below, we will see that the convergence speed is the order and .

### 5.2 On Jacobian matrix J(λ∗)

Let us consider the Jacobian matrix for any . We are assuming in (13), hence .

We will calculate the components (32) of .

Defining

 Di≡D(Pi∥λΦ),i=1,⋯,m, (42) Fi≡Fi(λ),i=1,⋯,m, (43)

we can write (11) as

 Fi=λieDim∑k=1λkeDk,i=1,⋯,m. (44)

From (44),

 Fim∑k=1λkeDk=λieDi, (45)

then differentiating the both sides of (45) by , we have

 ∂Fi∂λi′m∑k=1λkeDk+Fi∂∂λi′m∑k=1λkeDk=δi′ieDi+λieDi∂Di∂λi′, (46)

where is the Kronecker delta.

Before substituting into the both sides of (46), we define the following symbols. Remember that the integer was defined in (17). See also (25).

Let us define

 Q∗ ≡Q(λ∗)=λ∗Φ, (47) Q∗j ≡Q(λ∗)j=m∑i=1λ∗iPij=m1∑i=1λ∗iPij,j=1,⋯,n, (48) D∗i ≡D(Pi∥Q∗),i=1,⋯,m, (49) D∗i′,i ≡∂Di∂λi′∣∣∣λ=λ∗,i′,i=1,⋯,m, (50) F∗i ≡Fi(λ∗),i=1,⋯,m. (51)
###### Lemma 4
 m∑k=1λkeDk∣∣ ∣∣λ=λ∗=eC, (52) ∂Di∂λi′=−n∑j=1Pi′jPijQj,i′,i=1,⋯,m, (53) ∂∂λi′m∑k=1λkeDk∣∣ ∣∣λ=λ∗=eD∗i′−eC,i′=1,⋯,m, (54) F∗i=λ∗i,i=1,⋯,m. (55)

Proof: We have (52), (53) by simple calculation. See (19). (55) is the result of Lemma 2. (54) is proved as follows;

 ∂∂λi′m∑k=1λkeDk∣∣ ∣∣λ=λ∗ =m∑k=1(δi′keDk+λkeDk∂Dk∂λi′)∣∣ ∣∣λ=λ∗ =eD∗i′+m1∑k=1λ∗keC⎛⎝−n∑j=1PkjPi′jQ∗j⎞⎠ =eD∗i′−eCn∑j=1Pi′j1Q∗jm1∑k=1λ∗kPkj =eD∗i′−eC.

Note that , from Remark 2.

Substituting the results of Lemma 4 into (46), we have

 ∂Fi∂λi′