Minimax Optimal Sparse Signal Recovery with Poisson Statistics

01/21/2015 ∙ by Mohammad H. Rohban, et al. ∙ 0

We are motivated by problems that arise in a number of applications such as Online Marketing and Explosives detection, where the observations are usually modeled using Poisson statistics. We model each observation as a Poisson random variable whose mean is a sparse linear superposition of known patterns. Unlike many conventional problems observations here are not identically distributed since they are associated with different sensing modalities. We analyze the performance of a Maximum Likelihood (ML) decoder, which for our Poisson setting involves a non-linear optimization but yet is computationally tractable. We derive fundamental sample complexity bounds for sparse recovery when the measurements are contaminated with Poisson noise. In contrast to the least-squares linear regression setting with Gaussian noise, we observe that in addition to sparsity, the scale of the parameters also fundamentally impacts ℓ_2 error in the Poisson setting. We show tightness of our upper bounds both theoretically and experimentally. In particular, we derive a minimax matching lower bound on the mean-squared error and show that our constrained ML decoder is minimax optimal for this regime.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In this paper, we study sparse signal recovery problem with Poisson observations. This problem is motivated by many practical applications where the observations are the counts of an event. The mean count in these applications depends linearly on a sparse subset of parameters. Our goal is to extract the sparse subset from a potentially large number of parameters. Some of the practical applications motivating our problem include explosive identification based on photon counts in fluoroscopy [1], and eMarketing based on website traffic [2].

Motivated by these applications we propose a high-dimensional sparse signal estimation problem governed by Poisson distributed observations. We model the rate of the Poisson process as a positive mixture of known signatures. The rates for different observations are obtained from heterogeneous sensors or different measurement settings and therefore the observations are not identically distributed. Specifically, we model observations

as follows.

where is the rate of the background Poisson noise and assumed to be known. Each

is a distinct vector corresponding to the

-th sensor. The collection of these vectors form the sensing matrix, . Our goal is to recover the sparse vector, , from . The parameters are the mixing proportions for different sensors.

Our high-dimensional setting entails and the ground truth mixture parameter is assumed to be sparse with norm . We derive upper and lower bounds for sparse signal recovery for this Poisson setting. Our upper bounds are based on analyzing the performance of an constrained maximum-likelihood estimator. The optimal ML decoder is obtained by solving a convex optimization problem involving a non-linear objective function. We then analyze performance of minimax estimators, namely, optimal estimators for worst-case mean-squared error to obtain lower bounds. These lower bounds are derived by means of Fano bounds and are obtained by embedding the estimation problem into a detection setting by utilizing Gilbert-Varshamov bound [3]. Specifically, for a given sparsity level , and a parameter amplitude we show that the upper bound for the estimation error scales as . We then obtain a matching lower bound thus demonstrating the minimax optimality of our constrained ML estimator.

Dependence on Scale: The dependence of mean-squared error on scale parameter

does not arise in conventional sparse linear least squares regression setting. Indeed sample complexity is primarily determined by sparsity for many random matrix designs and sample complexity improves with scale of the ground truth parameter

for a fixed level of noise [4]. In contrast, in our problem setting, sample complexity degrades with the scale of the parameter vector. Intuitively this results from the fact that the curvature of the log likelihood function decreases with scale of the parameter vector. Indeed, for large values of , partial changes in the parameter vector translate to significantly smaller changes of the likelihood function resulting in lack of identifiability. Consequently, unlike the conventional case we inevitably have to suffer the effects of scale for the Poisson case.

Another aspect of the problem is the high dimensional sparse setting, which leads to fundamental issues such as the Hessian being singular [5, 6]. To overcome these issues we follow along the lines of sparse linear regression and consider optimizing the Poisson likelihood function under constraints. These constraints have the effect of constraining the error patterns to a cone of feasible directions (descent cone):


As a result it turns out that we need to ensure that the loss function is “well-behaved” only on the feasible cone. A recent important development 

[7] in this context is to impose strong convexity of the loss function on the feasible cone. Given suitable assumptions on the design matrix , this requirement is generally satisfied for many loss functions including least-squares losses.

Unfortunately as it turns out, our Poisson case does not lead to strongly convex losses. Strong-convexity of the loss function amounts to the assumption that we see a non-trivial change in the loss function as a result of underlying parameter variation regardless of the ground truth . This requirement can be viewed as a need for the curvature of the loss function to be non-vanishing on the cone. In the Poisson case, the perturbation in the loss function behaves linearly in large regimes (i.e. the curvature vanishes in the limit) and so the loss function is no longer strongly convex on the cone. This issue motivates us to incorporate signal strength in the definition of plausible error patterns . We then characterize the change in loss function at various amplitudes and sparsity levels in terms of the changes in the parameter vector (). We show that the loss function is strongly convex on the newly defined set of plausible error directions, as long as the elements of are bounded and satisfies the so-called

Restricted Eigenvalue (RE)

condition [7].

A wide variety of sensing matrices ranging from deterministic to random designs can be shown to satisfy RE condition. In particular our results apply to both deterministic and random sensing matrices and we present several results for both cases. We also conduct several synthetic and real-world experiments and demonstrate the tightness of the oracle bounds on error as well as the efficacy of our method. Specifically, it has been suggested in the literature that LASSO can handle exponential family noise such as that arises in our application [8]. It turns out that constrained ML estimator uniformly outperforms LASSO and has significantly superior performance in many interesting regimes.

The paper is organized as follows: In Section II, we introduce the notation and state our sparse estimation problem. Section III describes our theoretical results on the convergence of the regularized ML decoder. The numerical results for different interesting scenarios are demonstrated in Section IV. Finally, the detailed proof of the main theorems and lemmas are provided in Section VI.

I-a Related Work

Parameter estimation for non-identical Poisson distributions has been studied in the context of Generalized Linear Models (GLMs). However, our model is inherently different from the exponential family of GLM models that has been studied in [7, 9, 10, 11]. In particular the GLM model corresponding to the Poisson distributed data studied in the literature has the following form:

Therefore, the log likelihood takes the following form:

In contrast, in the setting we are interested in, the observations are modeled as follows:

and the log likelihood function has the form:

There are several important differences between the two models. We observe that imposing sparsity on in Model I corresponds to smaller number of multiplicative terms in the Poisson rates. On the other hand, being sparse in Model II results in fewer number of additive terms in the Poisson rates of the corresponding model. At a more fundamental level the loss function (negative log-likelihood) for Model I has an exponential term (). The assumptions of strong convexity on the feasible cone could be proved if elements of are independent sub-gaussian draws [7]. Consequently, unlike our case, the issue of signal amplitude no longer arises for this model.

We can view model I as an instance of a general class of sparse recovery problems. Indeed, [10] studies the convergence behavior of regularized ML estimation for exponential family distributions and GLM in this context. The bounds on error for sparse recovery of the parameter are based on the RE condition. Moreover, in order to get useful bounds on estimation error of GLM, they additionally need the natural sufficient statistic of the exponential family to be sub-gaussian. This condition could clearly be violated in our setting where the data is Poisson distributed and there is no constraint on the sensing matrix to be sub-gaussian.

More generally, [7] describes a unified framework for analysis of regularized estimators in high dimensions. They also mention extension of their framework to GLMs and describe “strong convexity” of the objective function on as a sufficient condition to obtain consistency of M-estimators under Model I. As described earlier, this requirement of strong convexity on is not consistent with our model. In addition the statistical aspects in that work requires that the components of the sensing matrix be characterized by

sub-gaussian distributions

, which we do not require here.

Statistical guarantees for sparse recovery in settings similar to model II have been provided in [12, 13, 14] in the context of photon limited measurements. They assume that the observations are distributed as follows

where elements of the signal and sensing matrix are positive, and the sensing matrix satisfies the so-called Flux Preserving assumption:

The latter assumption arises in some photon counting applications, like imaging under Poisson noise, where the total number of expected measured photons cannot be larger than the intensity of the original signal. The upper bound on reconstruction error of the constrained ML estimator is given in the paper [13]. Surprisingly, the upper bound scales linearly with the number of measurements. However, this sounds reasonable under the Flux Preserving assumption. In fact this behavior is due to the fact that for a fixed signal intensity, more measurements lead to lower SNR for each observation. As a result, unlike conventional compressive sensing bounds, the estimates do not converge to the ground truth with increasing the sample size. Nevertheless, Flux Preserving constraint does not arise in our setting and consequently the application and methods of analysis are different.

In summary the fundamental differences in the underlying model as well as the different assumptions in the sensing matrices from the previous work warrants new analysis techniques which is the subject of this paper.

I-B Applications

In the sequel, we will introduce two applications, which motivate the model described earlier:

1. Explosive Identification: In the explosive identification example, an unknown mixture of explosives is exposed to different fluorophores. The goal is then to estimate the mixture components based on the observed fluorophores photon counts. Here and could be considered as the photon counts and background emission rates for fluorophore . is measured before the exposure of the explosive mixture and is known. represents a quenching effect of explosive on fluorophore , and is the weight of explosive in the mixture [1].

2. eMarketing: In the eMarketing example [2], weekly traffic of different websites within the same market are measured. The traffic of site , denoted by , is assumed to be affected by the number of bought links to it on different advertisement websites. Each advertisement website is also assumed to contribute to the visiting rate by a fixed weight . These weights could be viewed as measures of popularity/dominance of advertisement website . Moreover, is the average traffic that visits website directly (not through intermediate advertisement website) and is acquired through online statistics of the website. is the number of backward links that the business website has bought from the advertisement website . Estimation of ’s can lead to discovery of dominant advertisement websites in some industry. Fig. 1 illustrates this eMarketing model.

Fig. 1: Illustration of eMarketing model. Node on the left are the business websites, Nodes on the right are the advertisement websites. A connecting edge, , is the number of backward links purchased by the business website from the advertisement website .

Ii Definitions and Problem Setup

Problem Definition: Let , with and . Furthermore, and . The main objective of this paper is to analyze the error of the maximum likelihood estimation of given and .

Likelihood Function: Let the objective function be the negative log likelihood of the data given the unknown parameter :

We will let , , (the harmonic average of the rates), , and throughout the paper.

The guarantees on the error of the maximum likelihood estimator relies on an assumption on the sensing matrix known as Restricted Eigenvalue Condition:

Definition 1.

Restricted Eigenvalue condition: A matrix satisfies RE, if there exists a constant , such that for any set of indices satisfying , and vector

we have:


where is the restriction of the vector to the indices in , and .

Iii Main Results

Iii-a Upper Bound on Error

We will analyze the performance of constrained Max-Likelihood estimator to obtain upper bounds. Specifically, we consider the following estimator:

Theorem 1.

Supose satisfies RE, and its elements are positive and bounded by

. Then, with probability at least

, the following error bound holds:

where .

Mean Squared Error: Note that since the rates are uniformly bounded in terms of the scale parameter and the error probability can be made arbitrarily small, we can obtain an upper bound for the mean squared error (MSE) in a straightforward manner. The MSE upper bounds has an identical scaling with respect to sparsity, scale and the number of measurements.

Upper Bound for Special Cases: Consider the case where base rates are constant and so . For this scenario we have with probability that,

In addition if we also have for a constant

, so that the harmonic mean

and , then, it follows that the error behaves as

Other Distributions

: The techniques in the proof could be used to extend this result to other heavy tailed distributions such as Gamma distribution, when the scale parameter is an affine function of

. Similar to the Poisson case, the curvature of the loss function associated with the log-likelihood of Gamma distribution vanishes with scale .

Additional Linear Constraints: The proof could be easily modified when there are additional linear constraints on the values. In that case, the constraints should be included in the set . Hence, the results can be generalized to the case where , so long as the Poisson rates are constrained to be positive.

Brief Proof Outline: The main idea of our approach is that we can show that the error can be bounded by if no feasible perturbation of magnitude around the true parameter value can lead to a decrease in the objective function . Furthermore, if is strongly convex around , an appropriate upper bound on makes the change in the objective function positive for all plausible error patterns. Finally, the upper bound on also yields an upper bound on the error. As the loss function associated with Poisson distributed data is not strongly convex on , we replace the cone (Eq. (1)) by another set , which depends on . This leads to a loss function that is well-behaved and subsequently we can derive our results.

Iii-B Sparsity Scaling

Theorem 1 describes an upper bound for estimation error for matrices belonging to RE. Since in turn depends on sparsity , the dependence of estimation error on sparsity is implicit. One difficulty in making this dependence explicit is that RE condition for matrices that are positive and bounded has not been extensively studied. The best known bound is due to [15] for positive and bounded matrices, who shows matrix constructions with satisfying RE condition with a constant that is independent of . This in turn implies that for positive bounded matrices it is sufficient to have measurements for our estimation error bounds to hold.

Nevertheless, this introduces a seemingly large gap between sparse linear regression results where the sample complexity grows linearly with sparsity and our corresponding Poisson model. It is unclear whether this gap is due to matrix construction or our proof bounding technique. To bridge this gap we consider a different matrix construction and a variant of our optimization problem111We thank the anonymous reviewer for suggesting this scheme..

In particular, we construct our matrix based on another elementary matrix . The components of are i.i.d. instantiations of a bounded sub-gaussian random variable, whose minimum and maximum values are denoted as and , respectively. The main reason for this construction is that we can now obtain designs with linear scaling, namely, for it is possible to construct matrices with a constant factor bound [16] in Eq. 2. We emphasize that it is important here for to take both positive and negative values for this to hold.

We now construct matrices based on , namely,


where denotes an all one matrix of size . Finally, we consider a variant of our constrained problem:


The main difference between Eq. 3 and Eq. 5 is that the constraint is replaced by . While this equality constraint is somewhat more restrictive it allows us the flexibility of utilizing matrices with negative components. Furthermore, it bridges the gap between the sparse linear regression setting and this problem as shown in the following corollary.

Corollary 1.

Let , , and be constructed according to Eq. (4), then the estimator , of Eq. 5 satisfies the following bound:

with probability of at least , where , , , and are positive universal constants, and satisfies the conditions of Theorem 1.

Iii-C Minimax Mean Squared Error Lower Bound

In this section, we derive a lower bound for the mean-squared error and show that the upper bounds obtained in the previous section match our lower bound. Our proof technique is based on a generalization of Fano’s inequality [17]. In particular, we show that no estimation algorithm can have a worst case mean squared error (MSE) less than :

Theorem 2.

Let and be the maximum eigenvalue of , and minimum element of , respectively. Furthermore, let be the set of feasible parameter values. If and , the minimax error rate is bounded as follows:

where is an arbitrary estimation algorithm, and and are two constants.

Matching Lower Bound: The error bound in Theorem 2 matches (order-wise) the upper bound in Theorem 1 as long as . This establishes the fact that constrained ML decoder is minimax optimal in the regime where the Poisson rates vary within a constant factor.

Brief Proof Outline: Let be a subset of of size . We consider as a set of hypotheses. It follows from Fano’s bound [17] that the worst case MSE of any estimator is lower bounded by , where is the hypothesis test error probability and is the minimum distance of any two members of . We are now left to construct an appropriate hypothesis subset. We utilize the Gilbert-Varshamov bound [3] in this context to construct a subset such that is lower bounded by for . Furthermore, can be lower bounded through the Fano bound by ensuring that KL-divergence of any two observation distributions with parameters in is no more than . The detailed proof is provided in the Appendix, Sec. VI-D.

Fig. 2: Linear scaling of average w.r.t signal intensity for three different random constructions of .

Iv Numerical Results

Iv-a Poisson Rates Harmonic Mean

In this section, we consider three different random matrice constructions and verify whether scales with . This linear scaling together with the bounds derived in Theorem 1 and Corollary 1 imply that the error matches the minimax rate up to a factor. Specifically, we consider i.i.d. copies of three different distributions to construct

. Namely, uniform distribution on

, Beta distribution with

and , and a random variable defined as



is a normally distributed random variable with

, . We take the average of over Monte Carlo runs. In each run, and are selected at random. Elements of on the support are selected from the uniform distribution on , and the support is equiprobable across all possible supports. The average value of harmonic mean is plotted against in Fig. 2. This verifies the linear scaling of w.r.t .

Iv-B Tightness of the Error Bounds

Theorem 2 characterizes the minimax optimality of constrained ML decoder in a certain regime where rates vary within a constant factor. On the other hand the upper bound in Theorem 1 and Corollary 1 scales as . As noted in Fig. 2 the Harmonic mean, , generally scales linearly with . Consequently, the upper bound matches our worst-case lower bound. Nevertheless, it is still possible that our upper bounds for ML constrained decoder is conservative for the average case. In this experiment we attempt to show that the upper bounds are indeed tight for the average case as well.

Fig. 3: error vs. scaled derived error upper bound. This demonstrates that actual error lies within two linear scalings of the derived bound with significant probability, which in turn shows that derived bound is tight. Entries of are drawn from the uniform distribution over (left) and an alternative distribution defined in Eq. (6) (right).

These reasons motivate us to perform experiments with synthetic data to verify the tightness of the derived bound in Theorem 1. Elements of are drawn from the uniform distribution over the unit interval. We also consider realizations where the elements are i.i.d. instantiations of a suitably truncated Gaussian random variable defined in Eq. (6).

We set the dimensionality , sparsity level , sample size , signal strength , and the base rate . We fix the direction of throughout the experiment and change the scale parameter . We also fix throughout all the experiments. Therefore, the uncertainty in drawing Poisson distributed data is the only uncertainty across all the experiments. The experiment is repeated (for the same values of and ) for times. Notice that for the truncated normal distribution, we set some components of design matrix to zero.

Recorded points in all the experiments are then plotted. We use the interior-point optimization algorithm implemented in MATLAB with the termination tolerance value of on the objective function. The termination tolerance on the variables is also set to . The gradient of the objective function is also given to the optimization algorithm initially.

In each run of the experiment, the norm of the error is recorded. The error is plotted against in Fig. 3. The hypothetical dashed lines passing through the origin shows that the error scales as with high probability. This suggests that our upper bounds are tight even for the average case. The estimation error is also a function of sparsity level . We next experiment with varying levels of sparsity, , while fixing the scale . These results are consistent with our bounds. This is not surprising because both our lower and upper bounds suggest linear scaling of squared error with . This has been illustrated in Fig. 4 for the case that . We do not consider experiments in which dimensionality is changed, simply because the derived bounds do not explicitly rely on as long as RE condition is satisfied. This happens for random constructions, when is large compared to and , i.e. .

Fig. 4: error changes with when . This is consistent with the upper and lower bounds in Theorems 1 and 2.

Iv-C Gaussian vs. Poisson Model

In this section, we compare the norm error of he constrained maximum likelihood decoder, by using a Gaussian likelihood function even though the observations are generated using a Poisson model. We then compare the resulting estimation error with the estimation error obtained using a Poisson likelihood function. Elements of are generated according to the Beta distribution with and , and are sampled independently during 500 Monte Carlo runs. During each run a weight vector is chosen at random, and Poisson observations are made based on the model described in Sec. II. We set and change from to or alternatively fix and change from to . We plot the average of ratio of error in the two cases in Fig. 5. This figure shows that Poisson model outperforms the Gaussian one especially when both is large and

is small. This could be partly explained by the fact that under this condition, the variance of observations is small relative to the mean. Hence, if the background noise is small, each observation would convey more information about

, making performance of the true model significantly better than the conventional Gaussian one.

Fig. 5: errors of Maximum Likelihood estimation assuming a Gaussian observation model vs. assuming the true Poisson distribution. The ratio of errors in two cases across different signal amplitudes (left) shows that the true Poisson model performs constantly better than the Gaussian one when signal amplitude is large. The same plot across various values (right) shows that the difference between two models is significant when the background noise is small. These two plots together suggest that under high SNR values there is a significant difference in the performances of the two models.

Iv-D Rescaled LASSO vs. Regularized ML

Parameter estimation based on LASSO for the Poisson setting has been studied in [8]. The idea is to view the problem as an additive noise problem, where noise belongs to an exponential family of distributions. Alternatively, in [8] the problem is viewed as an additive Gaussian noise problem with noise variance being equal to its mean to mimic “Poisson like” behavior. This results in a rescaled version of LASSO, which is then used to estimate model parameters. This amounts to scaling the loss function associated with each observation by the corresponding mean value (or equivalently the variance).

This approach motivates us to compare the regularized ML method against rescaled LASSO for poisson distributed data. This will highlight the essence of using regularized ML instead of rescaled LASSO for our setting. In this section, we will demonstrate that our regularized ML outperforms rescaled LASSO in several regimes including low SNR, high dimensions, and moderate to low sparsity levels. We use probability of successful support recovery as a measure of performance. We use this measure because of the scale invariance property of this measure. It avoids the issue of signal strength in impacting the performance metric and obviates the need to normalize the measure with respect to , as we did in the last section.

To compare the performance of regularized ML and rescaled LASSO, we first generate a random sensing matrix where each element is an independent truncated Gaussian random variable. We then generate a random . To recover , we draw Poisson distributed samples with coefficients specified in as:

We first solve the non linear optimization where is constrained to satisfy .

For the purpose of comparison, we also compute the rescaled LASSO estimator:

We then threshold the solution by zeroing out components of and below a pre-defined small threshold . We average the estimation performance over 100 Monte Carlo runs. The performance of the two methods are compared in Fig. 6 and Fig. 7. The results are compared for different number of observations , and different sparsity levels , respectively.

In Fig. 6, we compare the performance of regularized ML estimation with that of rescaled LASSO as a function of . We fix , , , and . At each iteration, we estimate based on observations where varies from to . We compare the performance of the two approaches based on probability of successful recovery of the support set. This error is 0 if the thresholded support set of the estimation is equal to that of the ground truth and 1 otherwise. We average this measure over 100 samples of for a fixed .

Fig. 6: Probability of successful support recovery as a function of for , , , , and Monte Carlo loops. This figure illustrates twice faster convergence of probability of success with respect to number of observations for Regularized ML in comparison to Rescaled LASSO.

In Fig. 7, we compare the performance of regularized ML estimation with that of rescaled LASSO for different sparsity levels, . This time, we fix , , , and . For each , we generate 100 samples of -sparse ’s and recover them from observations. Since for all values of , we threshold each element of by , to obtain their sparse support set. We measure the performance of the two estimators based on average probability of successful recovery of the thresholded support set for each value of . Notice that the error bars in Fig. 6 and Fig. 7 indicate that the difference between the methods is indeed statistically significant.

Fig. 7: Probability of successful support recovery as a function of for , , , and Monte Carlo loops. As this figure suggests, probability of successfull recovery drops faster when the sparsity level increases for Rescaled LASSO in comparison to Regularized ML. This shows robustness of ML approach to model parameters.

In Fig. 8, we compare the result of regularized ML estimation with that of rescaled LASSO in terms of the ROC curves. In our ROC curve, we plot the average number of true detections against the average number of false alarms. True detections are indices that are common in the thresholded estimated support set and that of the Ground Truth, whereas, false alarms are the indices in the thresholded estimated support set that are not included in the support set of the Ground Truth. This time, we fix , , , , and . We fix a sensing matrix , and generate 100 random ’s. By applying different thresholds to we obtain the different points in the ROC plot. We average Probability of Detection (PD) and Probability of False alarm (PF) over Monte Carlo runs.

Fig. 8: ROC curve for Regularized ML and Rescaled LASSO for , , , , and Monte Carlo loops. This figure illustrates the superiority of Regularized ML in comparison to Rescaled LASSO for parameter estimation under Poisson models.

Iv-E Explosive Identification

In this experiment, we first measure the light intensities of different fluorophores before and after separate exposures to a unit weight of different explosives. The intensities are measured by counting the number of photons received at each photo-sensor. Each explosive has a unique quenching effect in the fluorescence property of each fluorophore , which we denote by . In the experimental setting, is the before exposure intensity for fluorophore and is estimated by averaging the before exposure photon counts from multiple experiments. Therefore, ’s can be assumed to be known. We model the after exposure intensity of -th explosive in sensor as :

In the next step, fluorophores are exposed to an unknown mixture of these explosives. The goal is to recover which and how much of each explosive is contained in that mixture.

The physics of the problem suggests that when the fluorophore is exposed to a mixture of explosives, the quenching effects are additive in the regime where the mixture weights are small [1]. Therefore, our observations are best modeled by a Poisson distribution with additive rate model for each fluorophore:

where is the amount of the explosive in the mixture. We solve this problem through Regularized ML and Rescaled LASSO and compare the results.

In this problem, matrix , the responses of fluorophores to basic explosives is given. Based on this given data and our additive model for mixtures, we generated 10 mixtures by combining up to 3 random explosives. We used Regularized ML and Rescaled LASSO to identify these mixtures through their effect on fluorescence property of our fluorophores. The result is shown in the form of a grid in Fig. 9. In this grid, rows are different mixtures and columns are different explosives. Dark squares indicate the absence (or negligible contribution), whereas lighter squares indicate higher amount of the corresponding explosive in the associated mixture.

Fig. 9: Sparse recovery results for . From left to right: Ground Truth, , ML estimate of , Rescaled LASSO estimate of . Columns: Basic explosives. Rows: Synthesized mixtures with basic explosives. Non-black squares at each row show the explosives that the corresponding mixture is composed of. Lighter colors show larger amounts.

Iv-F Internet Marketing Application

This application presents a different scenario where the ultimate objective is prediction and inference. These experiments are useful because they further our understanding of Poisson models in terms of the why/how and in what contexts Poisson models are more meaningful in comparison to Gaussian models for Poisson distributed data. While the goals here are outside the purview of the theory we have developed, it nevertheless serves the purpose of understanding sparse high-dimension Poisson models in the context of not only estimation but also for prediction.

We discuss the eMarketing application in this section, where we will illustrate Poisson model is more meaningful than the conventional Gaussian noise model. In this application, our goal is to identify the most effective advertisement websites that result in higher website traffic in the clothing market. Our assumption is that the website traffic is generated as a superposition of the traffic generated from current customers and the traffic from advertisement through backward links (links in advertisement websites that are linked to these business websites). In general, big business websites typically buy a total of 1000-1500 backward links from a number of advertisement websites. However, the hypothesis is that only a few of these advertisement websites are efficiently directing costumers. Our goal is to identify those dominant advertisement websites.

We model the number of daily visits, , by:

where models the current customers who visit the site directly and is obtained through online statistics of the website. Specifically, a long run average of traffic which is not referred by any advertisement website gives a reasonable estimate and so we can assume that ’s are actually known. This traffic could be logged and acquired through online statistics of the business websites.

Moreover, is the number of backwards link for the website in the advertisement website . Our model assumes that each of the backward links brings independent traffic to the website. Therefore, we used the Poisson distributed random variable described earlier to model the number of visits to a business website.

Our observations are the daily online visits to 50 top clothing brands. From the information provided in, we chose the top 150 advertisement websites for these brands along with the number of backward links for each website. Our goal is to recover the weight vector , where is a measure of dominance for advertisement website in clothing market. We recover via regularized ML and rescaled LASSO (weights smaller than 0.01 are theresholded to 0). The result is provided in Table 1. In this table, we illustrate the corresponding score for each popular website based on their dominance in advertising for clothing brands.

Backward link ML estimated weight
Amazon 0.32
Twitter 0.21
Pinterest 0.17
Google 0.15
Blogger 0.06
Bing 0.05
douban 0.01
tumblr 0.01
Backward link Lasso estimated weight
Amazon 0.35
Pinterest 0.17
Twitter 0.16
Google 0.16
Bing 0.13

TABLE I: Top backwardlist websites for clothing brands using Regularized ML and Rescaled LASSO

To compare the result of the two approaches mentioned above, we use the Bayes factor

[18] and predictive held-out log likelihood comparison mentioned in [19]. It should be mentioned that these tests are interpretable only when the number of parameters are comparable in the hypothesis models. In our problem, in fact, the two models have equal number of parameters.

Given a set of observed data

, and a model selection problem in which we have to choose between two models, Bayesian inference compares the plausibility of the two different models

and through a likelihood test:

When the parameters of models and are not known a priori, in Bayes factor test, we estimate them from and then use those estimations in computing the likelihood ratio. On the other hand, in predictive held-out log likelihood comparison, we divide the data into two groups. We estimate the model parameters for , and using the first group of data, and we compare the likelihoods for the second part of data given and specified by the first group.

Since Poisson is a PMF distribution on integers, to compare the two models using Bayes factor, we need to superimpose the Gaussian distribution on a histogram defined on integer valued ’s. For a Gaussian distribution characterized by , the value of the histogram at each integer valued is computed as:


It is easy to show that this histogram corresponds to a valid PMF. We denote this PMF by .

After this conversion, the Bayes factor as a function of sparsity level, , is calculated as:

where and are sparse thresholded approximations of and , respectively. The Bayes factor log curve as a function of sparsity is presented in Fig. 10.

Fig. 10: The Bayes Factor Ratio for regularized ML and rescaled LASSO. As we can see, higher Bayes Factor suggests that Poisson setting introduced earlier is a better model for this problem compared to the traditional mean squared loss.

To compute the predictive held-out log likelihood for each method, we first use 80% of the data (40 training data) to calculate an estimation of the parameters, and . We use and for each model to compute the log likelihood function for the remaining 20% of data (10 test data):

where stands for the tail probability of standard normal distribution here. Intuitively, the model that is closer to the ground truth results in higher log likelihood value. The log likelihood values for the two approaches are shown in Fig. 11. The large gap between the predictive log likelihood of the two models implies that Poisson is a better underlying model for this application.

Fig. 11: The Predictive Held-out Log Likelihood Comparison. The huge gap between the Predictive Held out Likelihood of Regularized ML and Rescaled LASSO implies that Poisson is the right model for this problem therefore, Regularized ML approach should be taken.

Iv-G Dynamics of Online Marketing

In the previous section, our results show that ML estimator and Poisson model outperforms LASSO approach for the problem of online marketing. Therefore, in this section, we apply ML method to estimate the weights, , for the advertisement websites over time.

Fig. 12: Dynamics of SEM and SMM over time for clothing market. This figure gives a quantitative comparison of the two most controversial methods of online marketing.
Backward link Estimated weights in 2013 Estimated weights in 2008 0.18 0.02 0.18 0.02 0.14 0.00 0.28 0.22 0.15 0.52 0.05 0.00 0.00 0.11

TABLE II: Top advertisement websites for clothing market in 2013 and 2008

A brief look at Table II shows how ’s have changed dramatically over time. To study this change closely, we estimated ’s for different advertisement websites from 2004 to 2013. We group the Social networks, such as,,, etc, together to study the effect of Social Media Marketing (SMM). We also group search engines, such as,,, etc, together to represent Search Engine Marketing (SEM). We add the scores of the corresponding websites in each group. Fig. 12 demonstrates the dynamics of SMM and SEM, the most controversial forms of online marketing, over time [2]. Although SEM has been thought to be the most powerful media marketing tool, recent empirical studies show the growing influence of SMM during the last couple of years [20]. The gigantic size of social media coupled with the relatively low cost per impression and the so called word of mouth have made SMM a powerful marketing tool. Our results confirm the significant influence of SMM relative to SEM since 2012.

V Conclusions

In this paper we proposed a high dimensional sparse estimation problem where the observations are governed by heterogeneous Poisson rates motivated by applications such as Online Marketing and Explosives detection. Unlike least-squares linear regression setting, we showed that the scale of the parameter has a significant effect on sample complexity. In particular we derived upper bounds by analyzing the performance of an constrained decoder and develop matching lower bounds for worst-case mean-squared error. We presented a number of synthetic and real-world experimental results to not only demonstrate the importance of scale in Poisson problems but also the superior performance of constrained ML over rescaled LASSO for a number of problems and metrics that are of practical interest.

Vi Appendix A

Vi-a Definitions

Definition 2.

Strong Convexity : A differentiable function is called Strong Convex around a point over a set of perturbations when the following holds for constants , and :

Vi-B Proof of Theorem 1

The proof technique is similar to the one in [21], with the exception that the allowed perturbation set now depends on . Let , and


For the sake of brevity, we will use to refer to this set in the rest of the paper. We first show that if for all , then . First note that for any , and , also belongs to :

But we have . Moreover all elements of are positive. Hence belongs to .

Next suppose . Note that . We will assume and show that for at least one element of , . Let . Based on the earlier argument, it is clear that . Furthermore, it belongs to . Next, as is convex, Jensen’s inequality could be used to show that :

The last part follows by noting that and as is the minimizer of .

Hence to bound the error by , we need to show that is positive all across . But, using Lemma 2, could be lower bounded as follows:

where and are defined in Lemma 2. We also have

where the last step has been derived using Lemma 4 and is defined therein. Putting all these together:


where in the last step we have used the fact that:

where . follows because for any member of , (check Lemma 1), and holds because for any vector , . Now using Eq. (9), it is easy to see if , . This completes the proof.

Lemma 1.

For any defined in Eq. (8), , where .


From the definition of in Eq. (8), we know:

Moreover, from for all , we have:


Lemma 2.

Let be the negative log likelihood of the Poisson model introduced in Sec. II, and . If satisfies , will be Strong Convex around over the perturbation set (defined in Eq. 8) for

with probability of at least , where .



and . Using some algebraic manipulations, could be written as follows:

Leting , it can be simplified to:

Assuming that satisfies RE, can be lower bounded for all using Lemma 3:

On the other hand, absolute value of could be shown to be upper bounded as:

Suppose . Then, can be upper bounded as follows:

The bound could be simplified to obtain:

Then, the first order Taylor expansion of the first term around zero yields: