Nonconvex Demixing From Bilinear Measurements

09/18/2018 ∙ by Jialin Dong, et al. ∙ 0

We consider the problem of demixing a sequence of source signals from the sum of noisy bilinear measurements. It is a generalized mathematical model for blind demixing with blind deconvolution, which is prevalent across the areas of dictionary learning, image processing, and communications. However, state-of- the-art convex methods for blind demixing via semidefinite programming are computationally infeasible for large-scale problems. Although the existing nonconvex algorithms are able to address the scaling issue, they normally require proper regularization to establish optimality guarantees. The additional regularization yields tedious algorithmic parameters and pessimistic convergence rates with conservative step sizes. To address the limitations of existing methods, we thus develop a provable nonconvex demixing procedure viaWirtinger flow, much like vanilla gradient descent, to harness the benefits of regularization-free fast convergence rate with aggressive step size and computational optimality guarantees. This is achieved by exploiting the benign geometry of the blind demixing problem, thereby revealing that Wirtinger flow enforces the regularization-free iterates in the region of strong convexity and qualified level of smoothness, where the step size can be chosen aggressively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Demixing a sequence of source signals from the sum of bilinear measurements provides a generalized mathematical modeling framework for blind demixing with blind deconvolution [1, 2, 3]. It spans a wide scope of applications ranging from communication [4], imaging [5]

, and machine learning

[6], to the recent application in the context of the Internet-of-Things for sporadic and short messages communications over unknown channels [3]. Although blind demixing can be regarded as a variant of blind deconvolution [7] by extending the problem of “single-source” setting to the “multi-source” setting, it is nontrivial to accomplish the extension. The main reason is that the “incoherence” between different sources brings unique challenges to develop effective algorithms for blind demixing with theoretical guarantees [1, 2, 8]. In addition, the bilinear measurements in the blind demixing problem hamper the extension of the results for the demixing problem with linear measurements [9]. Moreover, the demixing procedure often involves solving highly nonconvex optimization problems which are generally dreadful to tackle. In particular, local stationary points bring severe challenges since it is usually intractable to even check local optimality for a feasible point [10].

Despite the general intractability, recent years have seen progress on convex relaxation approach for demixing problems. Specifically, sharp recovery bound for convex demixing with linear measurements has been established in [11] based on the integral geometry technique [11] for analyzing the convex optimization problems with random constraints. Moreover, by lifting the original bilinear model into the the linear model with rank-one matrix, the provable convex relaxation approach for solving the blind deconvolution problem via semidefinite programming has been developed in [7]. Ling et al. in [1] further extended the theoretical analysis for blind deconvolution with single source [7] to the blind demixing problem with multiple sources. The theoretical guarantees for blind demixing have been recently improved in [2], which are built on the concept of restricted isometry property originally introduced in [12]

. Despite attractive theoretical guarantees, such convex relaxation methods fail in the high-dimensional data setting due to the high computational and storage cost for solving large-scale semidefinite programming problems.

To address the scaling issue of the convex relaxation approaches, a recent line of works has investigated computationally efficient methods based on nonconvex optimization paradigms with theoretical guarantees. For high-dimensional estimation problems via nonconvex optimization methods, state-of-the-art results can be divided into two categories, i.e., local geometry and global geometry. In the line of works that focuses on the local geometry, one shows that iterative algorithm converges to global solution rapidly when the initialization is close to the ground truth. The list of this line of successful works includes matrix completion

[13], phase retrieval [14, 15, 10], blind deconvolution [16] and blind demixing [8]. The second line of works explores the global landscape of the objective function and aims to show that all local minima are globally optimal under suitable statistical conditions while the saddle points can be escaped efficiently via nonconvex iterative procedures with random initialization. The successful examples include matrix sensing [17], matrix completion [18], dictionary learning [19]

, tensor decomposition

[20], synchronization problem [21]

and learning shallow neural networks

[22].

The nonconvex optimization paradigm for high-dimensional estimation has also recently been applied in the setting of blind demixing. Specifically, a nonconvex Riemannian optimization algorithm was developed in [3]

by exploiting the manifold geometry of fixed-rank matrices. However, due to complicated iterative strategies of in the Riemannian trust-region algorithms, it is challenging to provide high-dimensional statistical analysis for such nonconvex strategy. Ling

et al. in [8]

developed a regularized gradient descent procedure to optimize the nonconvex loss function directly, in which the regularization accounts for guaranteeing incoherence. Although the regularized nonconvex procedure in

[8] provides appealing computational properties with optimality guarantees, it usually introduces tedious algorithmic parameters that need to be carefully tuned. Moreover, theoretical analysis in [8] provides a pessimistic convergence rate with a severely conservative step size.

In contrast, the Wirtinger flow algorithm [14], which consists of spectral initialization and vanilla gradient descent updates without regularization, turns out to yield theoretical guarantees for important high-dimensional statistical estimation problems. In particular, the optimality guarantee for phase retrieval was established in [14]. However, the theoretical results in [14] only ensure that the iterates of the Wirtinger flow algorithm remain in the -ball, in which the step size is chosen conservatively, yielding slow convergence rate. The statistical and computational efficiency was further improved in [15] via the truncated Wirtinger flow by carefully controlling search directions, much like regularized gradient descent. To harness all benefits of regularization free, fast convergence rates with aggressive step size and computational optimality guarantees, Ma et al. [10] has recently uncovered that the Wirtinger flow algorithm (without regularization) implicitly enforces iterates within the intersection between -ball and the incoherence region, i.e., the region of incoherence and contraction, for the nonconvex estimation problems of phase retrieval, low-rank matrix completion, and blind deconvolution. By exploiting the local geometry in such a region, i.e., strong convexity and qualified level of smoothness, the step size of the iterative algorithm can be chosen more aggressively, yielding faster convergence rate.

In the present work, we extend the knowledge of implicit regularization in the nonconvex statistical estimation problems [10] by studying the unrevealed blind demixing problem. It turns out that, for the blind demixing problem, our theory suggests a more aggressive step size for the Wirtinger flow algorithm compared with the results in [8], yielding substantial computational savings for blind demixing problem. The extension turns out to be nontrivial since the “incoherence” between multiple sources for blind demixing leads to distortion to the statistical property in the single source scenario for blind deconvolution. The similar challenge has also been observed in [1, 2] by extending the convex relaxation approach (i.e., semidefinite programming) for blind deconvolution to the setting of blind demixing. Furthermore, the noisy measurements also bring additional challenges to establish theoretical guarantees. The extra technical details involved in this paper to address these challenges shall be demonstrated clearly during the presentation.

Notations

Throughout this paper, or denotes that there exists a constant such that whereas means that there exists a constant such that . denotes that there exists some sufficiently large constant such that . In addition, the notation means that there exist constants such that .

Ii Problem Formulation

In this section, we present mathematical model of the blind demixing problem in the noisy scenario. As this problem is highly intractable without any further structural assumptions, the coupled signals are thus assumed to belong to known subspaces [1, 2, 8].

Let denote the conjugate transpose of matrix . Suppose we have bilinear measurements ’s, which are represented in the frequency domain as

(1)

where and

are known design vectors,

is the additive white complex Gaussian noise with and

as the measurement of noise variance

[8]. Each

is assumed to follow an i.i.d. complex Gaussian distribution, i.e.,

. The first

columns of the unitary discrete Fourier transform (DFT) matrix

with form the matrix [8]. Based on the above bilinear model, our goal is to simultaneously recover the underlying signals ’s and ’s by solving the following blind demixing problem [3, 8]

(2)

To simplify the presentation, we denote , where We further define the discrepancy between the estimate and the ground truth as the distance function, given as

(3)

where for . Here, and each is the alignment parameter.

Iii Main Results

In this section, we shall present the Wirtinger flow algorithm along with the statistical analysis for blind demixing .

Iii-a Wirtinger Flow Algorithm

The Wirtinger flow algorithm [14] is a two-stage approach consisting of spectral initialization and vanilla gradient descent update procedure without regularization. Specifically, the gradient step in the second stage of Wirtinger flow is characterized by the notion of Wirtinger derivatives [14], i.e., the derivatives of real valued functions over complex variables. For each , and denote the Wirtinger gradient of with respect to and respectively as follows:

(4a)
(4b)
0:  , and .
1:  Spectral Initialization:
2:  for all  do in parallel
3:     Let and

be the leading singular value, left singular vector and right singular vector of matrix

respectively.
4:     Set and .
5:  end for
6:  for all  do
7:     for all  do in parallel
8:        
9:     end for
10:  end for
Algorithm 1 Wirtinger flow for blind demixing

The Wirtinger flow for the blind demixing problem is presented in Algorithm 1, in which is the maximum number of iterations and the constant is the step size.

We now provide some numerical evidence by testing the performance of the Wirtinger flow algorithm for blind demixing problem (2). We first consider the blind demixing problem in the noiseless scenario in order to clearly demonstrate the effectiveness of the Wirtinger flow algorithm. Specifically, for each , and , we generate the design vectors ’s and ’s for each , according to the descriptions in Section II. The underlying signals , , are generated as random vectors with unit norm. With the chosen step size in all settings, Fig. 1 shows the relative error versus the iteration count, where denotes the Frobenius norm. We observe that, in the noiseless case, Wirtinger flow with constant step size enjoys extraordinary linear convergence rate which rarely changes as the problem size varies.

Fig. 1: Numerical results.

In the noiseless scenario, we further demonstrate that the performance and convergence rate of the Wirtinger flow actually depend on the condition number, i.e., . In this experiment, we let , , , the step size be and set for the first component and for the second one with . Fig. 1 shows the relative error versus the iteration count. As we can see, the larger yields slower convergence rate. This phenomenon may be caused by bad initial guess for weak components via spectral initialization [8]. Moreover, the strong components may pollute the gradient directions for weak components, which yields slow convergence rate [8]. We further provide empirical results for the Wirtinger flow algorithm in the presence of noise. We set the size of source signals , the sample size , the user number , the step size . The underlying signals , , are generated as random vectors with unit norm. Fig. 1 shows the relative error defined above versus the signal-to-noise ratio (SNR), where the SNR is defined as [8] since it is easy to access the signal . Both the relative error and the SNR are shown in the dB scale. As we can see, the relative error scales linearly with the SNR, which implies that the Wirtinger flow is robust to the noise. The main purpose of this paper is to theoretically analyze the promising empirical observations of the Wirtinger flow algorithm for blind demixing in the noisy scenarios. We will demonstrate that for the problem the Wirtinger flow algorithm can achieve fast convergence rates with aggressive step size and computational optimality guarantees without explicit regularization.

Iii-B Theoretical Results

Before stating the main theorem, we need to introduce the incoherence parameter [8], which characterizes the incoherence between and for .

Definition 1 (Incoherence for blind demixing).

Let the incoherence parameter be the smallest number such that

(5)

The incoherence between and for specifies the smoothness of the loss function (2). Within the region of incoherence and contraction (defined in Section IV-A) that enjoys the qualified level of smoothness, the step size for iterative refinement procedure can be chosen more aggressively according to generic optimization theory [10]. Based on the definition of incoherence, our theory shall show that the iterates of Algorithm 1 will retain in the region of incoherence and contraction, which is endowed with strong convexity and the qualified level of smoothness.

Without loss of generality, we assume for and define the condition number with . Define then the main theorem is presented in the following.

Theorem 1.

Suppose the step size obeys and , then the iterates (including the spectral initialization point) in Algorithm 1 satisfy

(6a)
(6b)
(6c)

for all

, with probability at least

if the number of measurements for some constants and sufficiently large constant .

Here, we denote for as the alignment parameter such that

(7)

In addition, with probability at least , there holds for some absolute constant and is defined in Section II.

Note that the assumption of the same length of and only serves the purpose of simplifying the presentation. Our theoretical results can be easily extended to the scenario where and have different sizes. Specifically, for each , if and , the requirement of sample size turns out to be .

Theorem 1 endorses the empirical results shown in Fig. 1, Fig. 1 and Fig. 1. Specifically, compared to the step size (i.e., ) suggested in [8] for regularized gradient descent, our theory yields a more aggressive step size (i.e., ) even without regularization. According to (6a), in the noiseless scenario, the Wirtinger flow algorithm can achieve -accuracy within iterations, while previous theory in [8] suggests iterations. In the noisy scenario, the convergence rate of the Wirtinger flow algorithm is independent of the number of measurements and related to the level of the noise. The sample complexity, i.e., with sufficiently large constant , is comparable to the result in [8] which uses explicit regularization. However, we expect to reduce the sample complexity to with sufficiently large constant by a tighter analysis, e.g., eluding controlling terms involved , which is left for future work.

For further illustrations, we plot the incoherence measure (in Fig. 1) and (in Fig. 1) of the gradient iterates versus iteration count, under the setting , , , , with for . We observe that both incoherence measures remain bounded by befitting values for all iterations.

Iv Trajectory Analysis for Blind Demixing

In this section, we prove the main theorem via trajectory analysis for blind demixing via the Wirtinger flow algorithm. We shall reveal that iterates of Wirtinger flow, i.e., Algorithm 1, stay in the region of incoherence and contraction by exploiting the local geometry of blind demixing . The steps of proving Theorem 1 are summarized as follows.

  • Characterizing local geometry in the region of incoherence and contraction (RIC). We first characterize a region , i.e., RIC, where the objective function enjoys restricted strong convexity and smoothness near the ground truth . Moreover, any point satisfies the error contraction and the incoherence conditions. This will be established in Lemma 1. Provided that all the iterates of Algorithm 1 are in the region , the convergence rate of the algorithm can be further established, according to Lemma 2.

  • Constructing the auxiliary sequences via the leave-one-out approach. To justify that the Wirtinger Flow algorithm enforces the iterates to stay within the RIC, we introduce the leave-one-out sequences. Specifically, the leave-one-out sequences are denoted by for each , obtained by removing the -th measurement from the objective function . Hence, and are independent with and , respectively.

  • Establishing the incoherence condition via induction. In this step, we employ the auxiliary sequences to establish the incoherence condition via induction. That is, as long as the current iterate stays within the RIC, the next iterate remains in the RIC.

    • Concentration between original and auxiliary sequences. The gap between and is established in Lemma 3 via employing the restricted strong convexity of the objective function in RIC.

    • Incoherence condition of auxiliary sequences.Based on the fact that and are sufficiently close, we can instead bound the incoherence of (resp. ) with respect to (resp. ), which turns out to be much easier due to the statistical independence between (resp. ) and (resp.).

    • Establishing iterates in RIC. By combining the above bounds together, we arrive at via the triangle inequality. Based on the similar arguments, the other incoherence condition will be established in Lemma 4.

    • Establishing initial point in RIC. Lemma 5, Lemma 6 and Lemma 7 are integrated to justify that the spectral initialization point is in RIC.

Iv-a Characterizing Local Geometry in the Region of Incoherence and Contraction

We first introduce the notation of Wirtinger Hessian. Specifically, let denote the entry-wise conjugate of matrix and denote the objective function of noiseless case. The Wirtinger Hessian of with respect to can be written as

(8)

where and The Wirtinger Hessian of with respect to is thus represented as where the operation generates a block diagonal matrix with the diagonal elements as the matrices . Please refer to Appendix C for more details on the Wirtinger Hessian. In addition, we say is aligned with , if the following condition is satisfied

(9)

Let denote the spectral norm of matrix . We have the following lemma.

Lemma 1.

(Restricted strong convexity and smoothness for blind demixing problem ). Let be a sufficiently small constant. If the number of measurements satisfies , then with probability at least , the Wirtinger Hessian obeys

(10)

simultaneously for all

Here satisfies

(11a)
(11b)
(11c)

where is aligned with , and one has for and ’s satisfy that for , for Therein, are numerical constants.

Proof.

Please refer to Appendix B for details. ∎

Conditions (11a)-(11c) identify the local geometry of blind demixing in the noiseless scenario. Specifically, (11a) identifies a neighborhood that is close to the ground truth in -norm. In addition, (11b) and (11c) specify the incoherence region with respect to the vectors and for , respectively. This lemma paves the way to the proof of Lemma 2 and Lemma 3. Specifically, the quantities of interest in these lemmas are decomposed into the part with respect to and the part with respect to the noise such that Lemma 9 can be exploited to bound the first part.

Based on the local geometry in the region of incoherence and contraction, we further establish contraction of the error measured by the distance function (3).

Lemma 2.

Suppose the number of measurements satisfies and the step size obeys and . Then with probability at least , provided that

(12a)
(12b)
(12c)
for some constants and a sufficiently small constant . Here, and are defined as and for .
Proof.

Please refer to Appendix E for details. ∎

Remark 1.

The key idea of proving Lemma 2 is to decompose the gradient (4) to the the part of pure gradient (resp. ) and the part relative to the noise, i.e., (resp. ). The pure gradient (resp. ) is required in Lemma 1.

As a result, if satisfies condition (12) for all for some arbitrary constant , then there is

(13)

with probability at least for some arbitrary constant , where . In the absence of noise (), exact recovery can be established and it yields linear convergence rate due to . In addition, stable recovery can be achieved in the presence of noise, where the estimation error is controlled by the noise level.

Iv-B Establishing Iterates in the Region of Incoherence and Contraction

In this subsection, we will demonstrate that the iterates of Wirtinger flow algorithm stay within the region of incoherence and contraction. In particular, the leave-one-out argument has been introduced to address the statistical dependence between (resp. ) and (resp.). Recall that are defined in the recipe for proving Theorem 1. For simplicity, we denote where and . We further define the alignment parameters , signals and in the context of leave-one-out sequence.

We continue the proof by induction. For brief, with where , the set of induction hypotheses of local geometry is listed as follows:

(14a)
(14b)
(14c)
(14d)

where are some sufficiently small constants, while are some sufficiently large constants. In particular, (14a) and (14b) can be also represented with respect to :

(15a)
(15b)

for . We aim to specify that the induction hypotheses (14) hold for -th iteration with high probability, if these hypotheses hold up to the -th iteration. Since (14a) has been identified in (12a) as , we begin with the hypothesis (14b) in the following lemma.

Lemma 3.

Suppose the number of measurements satisfies and the step size obeys and . Under the hypotheses (14) for the -th iteration, one has , with probability at least .

Proof.

Please refer to Appendix F for details. ∎

Remark 2.

The key idea of proving Lemma 3 is similar to the one in Lemma 2 that decomposes the gradient (4) in the update rule into the part of pure gradient and the part relative to the noise. Combining Lemma 1 and Lemma 10, we finish the proof.

Before proceeding to the hypothesis (14c), let us first show the incoherence of the leave-one-out iterate with respect to for all , . Based on the triangle inequality, one has

(16)

where (i) arises from Lemma 2 and Lemma 3 and (ii) holds as long as . Using the inequality (IV-B), the standard Gaussian concentration inequality in [10] and the statistical independence, it follows that