1 Introduction
Variable selection is an important problem in statistics, machine learning and signal processing. One of the recent attractive topics regarding this study is compressed sensing
[3, 2, 4, 7], in which a sparse signal is reconstructed from (linear) measurements with smaller measurement number than the overall signal dimension. If we denote the measurement result as and the measurement matrix as , a naive formulation of compressed sensing can be written as(1) 
where denotes the norm giving the number of nonzero components and the sparse vector
becomes our estimator of the true signal
embedded in the data generation process. This formulation requires solving discrete optimization problems; therefore, it involves serious computational difficulty when and are large [18]. Due to this computational difficulty, certain approximations are needed to treat highdimensional datasets.Some representative approximations, such as orthogonal matching pursuit (OMP) [23, 6] and the iteratively reweighted least squares method (IRLS) [5], are available. OMP is an approximate algorithm to directly solve eq. (1) by incrementing the set of used columns of to approximate in a greedy manner, whereas IRLS first relaxes the constraint to an constraint, where , and then recursively solves the least squares problem with coefficients reweighted by the previous least squares solution; IRLS effectively solves eq. (1) with a small enough . Another more common method to find an approximate solution of eq. (1) is to relax to [25, 16, 1, 10]. This relaxation converts the problem into a convex optimization problem; hence, a unique solution can be efficiently obtained using versatile solvers. Even with this relaxation, it has been shown that a perfect reconstruction of the signal in the noiseless limit is possible in undersampling situations of a reasonable level [8, 12]. These findings have motivated using the relaxation (LASSO) for a wide range problems and inspired a search for better inference methods enabling a perfect reconstruction with fewer measurements than the case with a reasonable computational cost. A successful example of this search is based on the Bayesian framework [14]: it showed that a perfect reconstruction is possible with fewer measurements than the case with a sufficiently low computational cost when employing the socalled approximate message passing (AMP) algorithm. However, this impressive result is not perfect because prior knowledge about the generation of and is required in the Bayesian framework but is not always available in realistic settings, and also its performance is guaranteed only for a certain class of measurement matrices. Solvers have their own advantages and disadvantages, and the appropriate choice depends on the available resources and purpose of the signal processing. Therefore, it is useful to increase the options of such solvers.
Under these circumstances, some of the authors proposed a solver based on MonteCarlo (MC) sampling methods. The solver employs simulated annealing (SA) [13], which is a versatile metaheuristic for optimization, and its performance was examined numerically and analytically [22, 17, 21, 20]. The analytical result shows that a fairly wide parameter region exists in which the phase space structure is rather simple and a perfect reconstruction is possible; the existence limit of this region defines the algorithmic limit to achieve a perfect reconstruction and this limit is shown to be comparable with that of the Bayesian approach [22]. Therefore, the SAbased algorithm, which is applicable for any measurement matrices, can exhibit comparable performance with a Bayesian method even without prior knowledge of the generative processes of and , although its computational cost is higher than that of AMP.
The above result, including the larger computational cost of the SAbased algorithm and the simple structure of the phase space clarified by the analytical computation, inspires a further simplification of the MCbased algorithm, which is the goal of this study. In particular, we propose a greedy algorithm to solve eq. (1) based on MC sampling, and call it the greedy MonteCarlo search (GMC) algorithm. GMC can also be regarded as a version of the SAbased algorithm quenched to zero temperature. Below we examine the performance of GMC via numerical experiments. The result suggests that the GMC performance does not reach that of the SAbased method but is better than LASSO.
2 Formulation and algorithm
Let us suppose a data vector is generated from the measurement matrix and a –sparse signal via
(2) 
We aim to infer the signal from a given and for the underdetermined situation . We denote the estimated signal as . To solve this underdetermined problem, we assume that the number of nonzero components is smaller than or equal to , where .
The starting point of our inference framework is the least squares method. To apply this method to the undetermined situation, we explicitly specify the variables or columns of used to represent by a binary vector : if the th variable or column of is used, then is one; otherwise, it is zero. We call the sparse weight. For any –dimensional vector , the components whose corresponding sparse weights are unity (zero) are called active (inactive) components; we denote the vector of active components as , and that of inactive ones as (). Similarly, the submatrix of consisting only of the active components is denoted as . If is smaller than or equal to , the following least squares estimator is unambiguously defined:
(3a)  
(3b) 
The full dimensional form of this solution is denoted as . Using this least squares estimator, the original problem (1) can be represented as
(4)  
(5) 
Therefore, we can solve the problem (1
) by solving the combinatorial optimization problem (
5) with respect to . To solve this problem, the SAbased algorithm was proposed in [21]. Although the SAbased algorithm has been shown to be effective [22, 21, 20], herein we examine a further simplified approach. In particular, we propose a simpler algorithm that solves eq. (5) in a greedy manner using MC sampling. We call this algorithm GMC, and the details are given in the following subsection. For convenience, we introduce a meansquare error (MSE) when representing such that(6) 
and call this the output MSE. This plays the role of ‘energy’ in the MC update as shown below.
2.1 Gmc
The outline of GMC is as follows. Starting from a given initial configuration or state of , we update the state in an MC manner. We randomly generate a new state and update it as if and only if the output MSE or energy decreases. If the energy cannot be decreased by any single ‘local flip’, then the algorithm stops and returns the current state as the output.
Comprehensively, during the update, we require the number of nonzero components to remain constant. To efficiently generate a new state satisfying this requirement, we employ the ‘pair flipping’ of two sparse weights: one equals and the other equals . In particular, we randomly choose an index from and another index from and we set while keeping the other components unchanged. The pseudocode for pair flipping is given in Alg. 1.
We define one MC step (MCS) as pairflipping trials.
Pair flipping gives a concrete meaning to the ‘local flip’ mentioned above. Accordingly, we can introduce the stopping condition for GMC as follows. If the configuration of is invariant during MCSs, we examine all states accessible via a single pair flip from the current state and compute the associated energy values; if there is no lower energy state than the current one, then the algorithm stops; otherwise, the state is updated to the lowest energy state of the locally accessible ones and the usual update is continued. The pseudocode for the entire GMC procedure is given in Alg. 2.
Typically, we set in the following experiments.
If we denote the necessary MCSs until convergence as , the scaling of the total computational cost of GMC is , where the first term is for the search of locally accessible states and the last term is for the MC update; is the computational cost of the energy, which is estimated as when a naive method of matrix inversion is used but can be reduced to by using the technique mentioned in [21]. Certainly, for the scaling, the computational cost of GMC is in the same order as that of the SAbased algorithm; however, GMC is actually faster because it does not need the annealing procedure required in the SAbased algorithm, reducing the computational cost by a factor of . In the next section, we examine the actual behaviour of GMC using numerical experiments. The scaling of will also be examined.
3 Numerical experiments
Herein, the performance of the GMC algorithm is numerically examined. Both simulated and realworld datasets are used.
3.1 Simulated dataset
In this subsection, we examine the performance of GMC on simulated datasets, particularly focusing on whether a perfect reconstruction of is achieved. To more directly quantify the reconstruction accuracy of , we introduce another MSE in addition to :
(7) 
which is referred to as the input MSE hereafter.
Our simulated datasets are generated as follows. Each component of the design matrix is independent and identically distributed (i.i.d.) from ; the nonzero components of the –sparse signal are also i.i.d. from , where is the density of the nonzero components, setting the power of the signal to unity. The data vector is then obtained by eq. (2) given and . This setup is identical to that of the theoretical computation in [22]; thus, we can directly compare the results. We follow the limit of theoretical computation, where the thermodynamic limit is considered while maintaining the ratios in .
3.1.1 Result for the simulated dataset
Herein, we examine the perfect reconstruction ratio obtained by GMC when the sparsity is correctly specified . In particular, we prepare initial conditions with the correct sparsity for each sample of , run GMC from the initial conditions and compute the ratio achieving a perfect reconstruction of the true signal. We call this ratio the success rate (). The ensemble of for samples quantifies the performance of GMC and is therefore investigated below. For simplicity, we fix in the following experiments.
The average value of over is listed against the system size in Table 1.
100  200  400  800  1000  

0.56  0.54  0.52  0.67 0.07  0.75 0.04 
The examined system sizes are and the other parameters are fixed to
. The error bar is the standard error over the
samples. The result implies that a system size dependence is very weak or absent and GMC can stably find the true solution.Further, we investigate the ratio of samples exhibiting , , for a wide range of parameters . This allows us to capture the practical algorithmic limit of the perfect reconstruction by GMC. A heat map of for the overall region of at is plotted in the left panel of Fig. 1.
Herein, is computed for samples. The theoretically derived algorithmic limit is depicted by a white line, and a sharp change of the colour occurs around the line. This implies that GMC achieves the limit. To examine this point, we further investigate the system size dependence of . The result is indicated in the plot of against at in the right panel of the same figure. As grows, the curve of
becomes steeper, implying the presence of phase transition. Although the precise location of the transition point is unclear, it seems to be larger than the algorithmic limit (black dotted line) derived in
[22]. This unfortunately means that the GMC performance is worse than the SAbased method. Meanwhile, the transition point location seems to be still smaller than the reconstruction limit of the relaxation (black dashed line), implying that GMC outperforms LASSO and thus can be a reasonable solver.Finally, we list the necessary MCSs until convergence in Table 2.
Herein, the average value over is shown with the standard error. The result implies a linear increase in with respect to . Therefore, the total computational cost of GMC is scaled as . This is certainly not cheap; however, GMC can achieve a perfect reconstruction even for rather large values of , which is not possible using other greedy algorithms such as OMP. This again suggests that GMC is a reasonable choice for sparse linear regression.
3.2 Realworld dataset
Herein, we apply GMC to a dataset of Type Ia supernovae. Our dataset is a part of the data from [9, 24], which was screened using a certain criterion [26]. This dataset was recently treated using a number of sparse estimation techniques, and a set of important variables known to be empirically significant has been reproduced [20, 26, 19, 11]. In these studies, LASSO and
cases were treated and crossvalidation (CV) was employed for the hyperparameter estimation. We reanalyse this dataset using GMC. The parameters of the screened data are
and , and a standardization, i.e. centering and columns of and normalizing the columns of to be unit norm, was employed in the preprocessing.Table 2 shows the GMC result for the supernovae dataset. Herein, we changed the value of from to , and for each we examined ten different initial conditions and picked the lowest energy configuration among the tens as our final estimator . The result implies that the output MSE decreases as grows, and some variables such as “2” and “1” seem to be frequently selected.
1  2  3  4  5  

0.0312  0.0204  0.0179  0.0158  0.0139  
variables  {2}  {1,2}  {1,2,233}  {1,2,94,233}  {2,6,170,221,225} 
To scrutinize this result, we also computed the leaveoneout (LOO) crossvalidation (CV) error:
(8) 
where is the solution of eq. (3) given and “th LOO system ”: is the data vector whose th component is removed from the original one as , and is defined similarly. In our present case, the th LOO estimator is computed by running GMC for the th LOO system, and thus can be different for each LOO system.
The result of a typical single run of GMC is shown in Table 3. Here, the LOO CV error takes its minimum at , implying that the best model is obtained at this sparsity level. For further quantification of statistical correlations between variables, we count how many times each variable was selected in this LOO CV procedure, following the way of [20]. Table 4 summarizes the results for five variables from the top for –. This indicates that no variables other than “2” representing light width were chosen stably, whereas variable “1” representing color was selected with high frequencies for . Table 4 shows that the frequency of “1” being selected is significantly reduced for . This is presumably due to the strong statistical correlations between “1” and the newly added variables. In addition, the results for varied as we rerun GMC with different initial conditions. These observations mean that we could select at most only light width and color as the relevant variables. This conclusion is consistent with those of recent papers [20, 26, 19, 11] using a number of methods including the SAbased one. This provides additional evidence for the practicality of the GMC algorithm.
1  2  3  4  5  

0.0328  0.0239  0.0261  0.0302  0.0342 
K  

1  variables  2  *  *  *  * 
times selected  78  0  0  0  0  
2  variables  2  1  275  *  * 
times selected  78  77  1  0  0  
3  variables  2  1  233  14  15 
times selected  78  78  70  3  2  
4  variables  2  1  233  94  97 
times selected  78  76  74  66  2  
5  variables  2  233  1  268  36 
times selected  78  37  32  29  26 
4 Summary
In this study, inspired by the theoretical result and the SAbased algorithm of [22, 21, 20], we proposed an MCbased greedy algorithm called GMC for sparse linear regression. GMC is simpler than the SAbased algorithm but still it can achieve the perfect reconstruction in undersampling situations of a reasonable level, as shown by the numerical experiments on synthetic datasets. These experiments also suggest that GMC can outperform the relaxation which is the most commonly used method for sparse estimation. An additional experiment on a realworld dataset of supernovae also supported the practicality of GMC. These results imply that the energy landscape of the sparse linear regression problem is simple and exhibits a funnellike structure [15] in the reconstructable region. We believe that this finding inspires further algorithms of sparse linear regression as well as new models for complex systems, such as glasses and proteins, based on sparse variable selection.
Acknowledgement
This work was supported by JSPS KAKENHI Nos. 25120013, 17H00764, 18K11463 and 19H01812, JST CREST Grant Number JPMJCR1912, Japan. TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation.
References
 [1] (2006) Convex optimization techniques for fitting sparse gaussian graphical models. In Proceedings of the 23rd international conference on Machine learning, pp. 89–96. Cited by: §1.
 [2] (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52 (2), pp. 489–509. Cited by: §1.

[3]
(2005)
Decoding by linear programming
. IEEE transactions on information theory 51 (12), pp. 4203–4215. Cited by: §1.  [4] (2006) Nearoptimal signal recovery from random projections: universal encoding strategies?. IEEE transactions on information theory 52 (12), pp. 5406–5425. Cited by: §1.
 [5] (2008) Iterative reweighted algorithms for compressive sensing. Technical report Cited by: §1.
 [6] (1994) Adaptive timefrequency decompositions. Optical engineering 33 (7), pp. 2183–2192. Cited by: §1.
 [7] (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §1.
 [8] (2009) Observed universality of phase transitions in highdimensional geometry, with implications for modern data analysis and signal processing. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 367 (1906), pp. 4273–4293. Cited by: §1, Figure 1.
 [9] (201209) Berkeley Supernova Ia Program 2013 III. Spectra near maximum brightness improve the accuracy of derived distances to Type Ia supernovae. Monthly Notices of the Royal Astronomical Society 425 (3), pp. 1889–1916. External Links: ISSN 00358711, Document, Link, http://oup.prod.sis.lan/mnras/articlepdf/425/3/1889/3027933/42531889.pdf Cited by: §3.2.
 [10] (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), pp. 432–441. Cited by: §1.
 [11] (2016Sep.) Approximate crossvalidation formula for bayesian linear regression. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 596–600. External Links: Document, ISSN Cited by: §3.2, §3.2.
 [12] (2009) A typical reconstruction limit for compressed sensing based on norm minimization. Journal of Statistical Mechanics: Theory and Experiment 2009 (09), pp. L09003. Cited by: §1, Figure 1.
 [13] (1983) Optimization by simulated annealing. science 220 (4598), pp. 671–680. Cited by: §1.
 [14] (2012) Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment 2012 (08), pp. P08009. Cited by: §1.
 [15] (1992) Protein folding funnels: a kinetic approach to the sequencestructure relationship.. Proceedings of the National Academy of Sciences 89 (18), pp. 8721–8725. Cited by: §4.
 [16] (2004) Consistent neighbourhood selection for sparse highdimensional graphs with the lasso. Cited by: §1.
 [17] (2016) Sparse approximation based on a random overcomplete basis. Journal of Statistical Mechanics: Theory and Experiment 2016 (6), pp. 063302. Cited by: §1.
 [18] (1995) Sparse approximate solutions to linear systems. SIAM journal on computing 24 (2), pp. 227–234. Cited by: §1.
 [19] (2016) Cross validation in LASSO and its acceleration. Journal of Statistical Mechanics: Theory and Experiment 2016 (5), pp. 53304–53339. Cited by: §3.2, §3.2.

[20]
(2016)
Sampling approach to sparse approximation problem: determining degrees of freedom by simulated annealing
. In 2016 24th European Signal Processing Conference (EUSIPCO), Vol. , pp. 1247–1251. External Links: Document Cited by: §1, §2, §3.2, §3.2, Table 4, §4.  [21] (2016) Sparse approximation problem: how rapid simulated annealing succeeds and fails. In Journal of Physics: Conference Series, Vol. 699, pp. 012017. Cited by: §1, §2.1, §2, §4.

[22]
(201810)
Statistical mechanical analysis of sparse linear regression as a variable selection problem.
Journal of Statistical Mechanics: Theory and Experiment 2018 (10), pp. 103401.
Note:
An algorithmic limit of compressed sensing or related variableselection problems is analytically evaluated when a design matrix is given by an overcomplete random matrix. The replica method from statistical mechanics is employed to derive the result. The analysis is conducted through evaluation of the entropy, an exponential rate of the number of combinations of variables giving a specific value of fit error to given data which is assumed to be generated from a linear process using the design matrix. This yields the typical achievable limit of the fit error when solving a representative problem and includes the presence of unfavourable phase transitions preventing local search algorithms from reaching the minimumerror configuration. The associated phase diagrams are presented. A noteworthy outcome of the phase diagrams is that there exists a wide parameter region where any phase transition is absent from the high temperature to the lowest temperature at which the minimumerror configuration or the ground state is reached. This implies that certain local search algorithms can find the ground state with moderate computational costs in that region. Another noteworthy result is the presence of the random firstorder transition in the strong noise case. The theoretical evaluation of the entropy is confirmed by extensive numerical methods using the exchange Monte Carlo and the multihistogram methods. Another numerical test based on a metaheuristic optimisation algorithm called simulated annealing is conducted, which well supports the theoretical predictions on the local search algorithms. In the successful region with no phase transition, the computational cost of the simulated annealing to reach the ground state is estimated as the third order polynomial of the model dimensionality.
External Links: Document, Link Cited by: Perfect Reconstruction of Sparse Signals via Greedy MonteCarlo Search, §1, §2, Figure 1, §3.1.1, §3.1, §4.  [23] (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The TwentySeventh Asilomar Conference on, pp. 40–44. Cited by: §1.
 [24] The SNDB: http://heracles.astro.berkeley.edu/sndb/info. Cited by: §3.2.
 [25] (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288. Cited by: §1.
 [26] (201506) Variable selection for modeling the absolute magnitude at maximum of Type00A0Ia supernovae. Publications of the Astronomical Society of Japan 67 (3). External Links: ISSN 00046264, Document, Link, http://oup.prod.sis.lan/pasj/articlepdf/67/3/55/17445574/psv031.pdf Cited by: §3.2, §3.2, Table 2.