In many applications, a signal is only available as observations which are samples of a transform of the signal, rather than the samples of the signal itself. Examples are compressive sensing Candès (2006); Candès and Tao (2005)
, in which a signal is under a dimension-reducing linear transform, wireless communications in which a signal undergoes a linear or nonlinear channel transformKim and Konstantinou (2001); Jiang et al. (2015), and computational imaging in which the acquired data is a result of light field going through a transform due to optical devices Duarte et al. (2008); Sun et al. (2013). Recently, machine learning (ML) algorithms have demonstrated superior performance in recovering signals from observations Goodfellow et al. (2014); Mousavi and Baraniuk (2017); Kulkarni et al. (2016)
. To recover a signal from its observation, an ML model, such as a convolutional neural network, is trained so that the recovered signal is the output of the model when its observation is used as the input. Despite the great success of ML in recovering signals from observations and the development in ML theory in generalValiant (1984); Vapnik (2000); Blumer et al. (1989); Kearns and Vazirani (1994); Abu-Mostafa et al. (2012), there is a lack of theoretical understanding in many aspects of ML recovery.
This work is to address the question of under what condition a signal can be recovered from its observation by an ML algorithm. We develop a theoretical framework to characterize the signals that can be robustly recovered from observations. We will establish a Lipschitz condition on signals and observations and show that it is both necessary and sufficient for the existence of a robust ML algorithm to recover the signals. We will compare the Lipschitz condition with the restricted isometry property (RIP) Candès (2006); Candès and Tao (2005) in the sparse signal recovery of compressive sensing, and show that the former is more general and less restrictive.
The set of signals satisfying the Lipschitz condition is not unique. Since there is no restriction on the transform of the observations, there is no expectation that a given set of signals can be recovered from their observations. Instead, what is expected is that all signals with certain structure should be recovered. In our framework, the structure of the recoverable signals is precisely defined by the Lipschitz condition: each Lipschitz set is a set of structured signals that are robustly recoverable and a different set defines a different structure. A finite number of training signals can always be used to define a Lipschitz set of signals which can be robustly recovered by a trained, robust ML model.
The significance of this work is that it not only answers the theoretical question of what signals can be robustly recovered, but also suggests a practical recovery method by using singular value decomposition (SVD) for linear observations (see Theorem 3), in which the dimension of the output space of the target function is reduced to the minimum possible.
All proofs of the paper will be given in Appendix.
2 Related Work
The term “Lipshitz learning” was previously used for classification on graphs Kyng et al. (2015)
, in which the target functions in graph-based semi-supervised learning are Lipschitz. In this paper, the same term is used in a different context and a broader sense. In this paper, Lipschitz learning refers to the framework of recovering signals satisfying the Lipschitz condition. Since in this work, the recovery can be achieved by Lipschitz hypothesis, the use of the term here is consistent with its previous use.
In addition to Kyng et al. (2015), existing work in v. Luxburg and Bousquet (2004); Koltchinskii (2011); Lopez-Paz et al. (2015) also studies to utilize Lipschitz functions as decision functions or target functions for the classification problem. Specifically, v. Luxburg and Bousquet (2004)
finds that the Lipschitz function is a generalization of decision functions for metric spaces, and shows that several well-known algorithms are special cases of the Lipschitz classifier.Koltchinskii (2011) poses the cause-effect inference problem as a classification problem, and uses the property of Lipschitz function to derive the bound on excess risk. In addition, Lipschitz function is used in Lopez-Paz et al. (2015) for theoretic analysis of empiric risk minimization. Our work differs from the existing work v. Luxburg and Bousquet (2004); Koltchinskii (2011); Kyng et al. (2015); Lopez-Paz et al. (2015) in the following two aspects: 1) We utilize the Lipschitz condition for the problem of general signal recovery, whereas v. Luxburg and Bousquet (2004); Koltchinskii (2011); Kyng et al. (2015); Lopez-Paz et al. (2015) utilizes Lipschitz functions for the problem of classification. 2) To the best of our knowledge, no existing work utilizes the property of Lipschitz set, which is essential in our theory of signal recovery with Lipschitz learning.
Our framework shows that the Lipschitz condition on a set of signals is equivalent to the existence of a hypothesis for the recovery of these signals. It differs from the probably approximate correct (PAC) learningValiant (1984)
, and the statistical learning theoryVapnik (2000) that analyzes the probability in successfully finding a hypothesis with low generalization error. Our work is currently concerned with the existence of Lipschitz hypothesis, but in the future, will address the complexity of Lipschitz learning such as reducing the bound on number of total training samples required, which includes, for example, using a probabilistic model in Lipschitz learning.
3 Lipschitz Learning
Problem Definition. Let be a signal, be an operator with . The observation of signal under transform is , where the symbol "" means "operates on". The operator may be linear or nonlinear, and it may not be an injection even when . The objective here is, for a given , to recover the signal from its observation by a machine learning algorithm. In an ML algorithm, a hypothesis is a computable function . A recovered signal from the observation by the hypothesis is , with .
Since may not be injective, there is no expectation that a signal can be uniquely recovered from a given observation . Instead, we attempt to characterize a set of signals that can be robustly recovered from their observations by an ML algorithm. Such a characterization is tantamount to imposing a structure on signals to ensure the success of recovery. For example, in compressive sensing Candès (2006); Candès and Tao (2005), observations are the results of a singular linear transform but it is possible to uniquely recover a set of sparse signals under certain conditions.
Let be a set of signals. For all signals in to be recovered from their observations, a necessary condition is
Furthermore, for a recovery to be robust and resilient to noise, it is required that
Definition 1. Given , a set is said to be if
A set is said to be a Lipschitz set if there is an such that it is -Lipschitz. We denote an set by , and call the signals in a Lipschitz set the Lipschitz signals.
Note in Definition 1, is simply a notation; it doesn’t mean exists. However, when restricted on , does have an inverse, and its inverse is Lipschitz. An ML algorithm is to find a hypothesis to approximate it.
The Lipschitz condition in Eq. (3) is a joint condition on signals and their observations (or the operator ). It may be framed in the following two ways.
1) For a given set of signals , Eq. (3) is a condition on the operator . It is equivalent to saying that the inverse must exist on , and the inverse is -Lipschitz. Traditional signal recovery algorithms, such as minimization in compressive sensing, is within this framework, i.e., they attempt to recover all signals in a given structure under the assumption that the operator meets certain conditions.
2) For a given operator , Eq. (3) is a condition on a set of signals to be recovered. For any operator , there is always a set satisfying Eq. (3): any singleton set. An ML algorithm may be designed to recover those signals of interest that are recoverable for the given , by properly selecting training signals to define a set of signals of interest to satisfy Eq. (3). In this context, the Lipschitz signals are the structured signals.
Example. Let operator be the continuous function defined by
The set is not Lipschitz, but , or , is an -Lipschitz set. For any , the set is not Lipschitz although is injective on it; clearly, signals in cannot be recovered reliably under noise because a small noise in the observation may cause the recovered signal to be or or . On the other hand, is a Lipschitz set for any . For example, for any , is -Lipschitz, and so is .
Property 1. If is a finite set, i.e., , and is an injection on , then is -Lipschitz where
Property 2. Let be a linear operator, and be an -Lipschitz set. Then any scaled and shifted set from is also an -Lipschitz set. More precisely, for any , and ,
is an -Lipschitz set.
Property 1 shows that any finite set on which is injective is Lipschitz, and therefore, it can be used as a starting point to build a Lipschitz set of signals of interest. For example, a finite set of training signals may be used to define a maximal set of Lipschitz signals that includes the training signals.
Definition 2. A machine learning hypothesis is said to be -Lipschitz, for , if
A hypothesis is said to be Lipschitz, or robust, if there is an such that is -Lipschitz.
Definition 3. A set is said to be labeled if every and its observation are known.
4 Characterization of Signal Recovery
In this section, we will show the Lipschitz condition on a set of signals is equivalent to the existence robust ML hypothesis for recovery of the signals. More precisely, we will show that the Lipschitz condition Eq. (3) is both necessary and sufficient for the existence of Lipschitz hypotheses in the ML signal recovery.
In the rest of this paper, we assume the observations are bounded, which is generally the case in practice. Without loss of generality, we may assume they are bounded by the unit hypercube, i.e.,
Lemma 1. Let be a finite set and labeled, and be an injection. Then there exists an -Lipschitz hypothesis such that
Lemma 1 is an application of the McShane-Whitney extension theorem McShane (1934); Whitney (1934). It provides an explicit and constructive Lipschitz hypothesis on a finite labeled set (see proof in Appendix). Furthermore, Eq. (9) shows that the finite set is a training set for the Lipschitz hypothesis. The training set can then be expanded to a Lipschitz set in which all signals can be recovered robustly, as to be seen in the next Theorem which shows that the Lipschitz set is sufficient for the existence of a Lipschitz hypothesis.
Theorem 1. Let be an -Lipschitz set. Then for any , there exists a finite set . If is labeled, then there exists an -Lipschitz hypothesis , such that
(i) , for all ; (Training)
(ii) , for all . (Recovery of all -Lipschitz signals)
The factor in -Lipschitz is not necessary and can be removed; it is there only to simplify the proof, which is given in Appendix.
Theorem 1 means that if a set of signals is Lipschitz, then for any given precision, there exists a finite set of training signals so that a Lipschitz hypothesis can be trained on the finite set to recover all signals within the given precision. It guarantees that a set of Lipschitz signals can be recovered by a robust ML algorithm, to an arbitrary precision. We note that although the training set in Theorem 1 is finite in theory, it may be too large for practical purposes.
A Lipschitz hypothesis is stronger than a continuous target function. It could be argued that a continuous target function is sufficient to provide robustness of recovery, so a question would arise as to if the Lipschitz condition (3) is too strong. However, since a set of signals may be discrete, not continuous or connected, it is not possible to define a "continuity" on in the classic sense to guarantee robust recovery, as Lipschitz condition (3) does. The Lipschitz set in Definition 1 is a sensible condition on a (possibly discrete) set of signals for robust recovery. More discussions regarding the Lipschitz condition will be given in Section 5.
Next, we show that the Lipschitz set is necessary for the existence of Lipschitz hypothesis.
Theorem 2. Let be a set. If there exists such that for any there is an -Lipschitz hypothesis , such that
then is an -Lipschitz set.
Theorem 2 says that if there are -Lipschitz hypotheses to recover a set of signals to an arbitrary precision, then the set itself must be -Lipschitz. A weaker version is given below.
Corollary 1. Let be a set. If there exist an and an -Lipschitz hypothesis such that for all , then satisfies
This result says that if a set of signals can be recovered to a certain precision by a Lipschitz hypothesis, then the set of signals is approximately Lipschitz, up to the precision of the recovery.
Theorems 1 and 2 completely characterize robust ML signal recovery: a set of signals can be robustly recovered by ML algorithms if and only if the set satisfies the Lipschitz condition (3).
For linear operators, we have a stronger version of Theorem 1 as follows.
Theorem 3. Let be linear, and be an -Lipschitz set. Then there exist matrices and . Furthermore, for any , there exists a finite set . If is labeled, then there exists an -Lipschitz hypothesis , such that the mapping defined by satisfies
(i) , for all ; (Training)
(ii) , for all ; (Recovery of all -Lipschitz signals)
(iii) , for all . (Recovered signals match the observations)
The significance of Theorem 3 as compared to Theorem 1 is twofold. First, the output space of the hypothesis in Theorem 3 has lower dimension than that of Theorem 1: in Theorem 3 vs in Theorem 1. Consequently, the bound on the total number of required training signals is lower in Theorem 3 than in Theorem 1. Secondly, the recovered signals have the same observations as the original signals, as in (iii). In other words, even if the recovered signal may not equal the original signal , the observations are the same: , i.e., the recovered signal is indistinguishable from the original signal in the observation space.
5 Comparison with Sparse Recovery in Compressive Sensing
In this section, we assume the operator is linear, i.e., .
Sparse recovery Candès (2006)
Let , , and be the submatrix obtained by extracting the columns of corresponding to the indices in . is said to satisfy -restricted isometry property (RIP) if there exists such that
for all subsets with and coefficient sequences . is said to be -restricted isometry constant. It is shown in Candès and Tao (2005) that if satisfies the RIP with
then an -sparse signal can be recovered from its observation by -minimization Candès (2006).
Therefore, the set of -sparse signals for which RIP with (13) is satisfied is an -Lipschitz set, and consequently, according to Theorem 1, there exists a robust ML recovery algorithm for the -sparse signals if RIP is satisfied with condition (13).
This shows that the Lipschitz condition (3) is more general and less restrictive than the RIP conditions (12) and (13). Of course, it must also be pointed out that the stronger RIP conditions (12) and (13) lead to a strong and constructive result that -sparse signals can be recovered by -minimization.
We have developed a framework to characterize the robust ML signal recovery. The theory in the framework makes the terminology "structured signals" in traditional signal recovery algorithms more precise. Here, the structured signals are the Lipschitz signals. For any given transformation , it is always possible to define a set of Lipschitz signals, i.e., structured signals, so that they can be robustly recovered by a trained ML model.
Although we have provided a complete characterization of ML signal recovery in theory, more work is needed to render this theoretical framework for practical use in general. For example, the bound on the total number of training signals required to guarantee robust recovery in this framework is too high to be used in practice. However, this theoretical work does provide insights that can guide the design of practical ML signal recovery algorithms. For linear systems, Theorem 3 suggests a practical method of using SVD to reduce the dimension of the output space of an ML model from to , which is the minimum possible dimension on which a recovery algorithm must learn.
Proof of Property 1. If is a finite set and is injective on , then is well-defined in (5), and furthermore,
which shows , i.e., is an -Lipschitz set. Q.E.D.
Proof of Property 2. Let . There exist , with , so
which shows that is an -Lipschitz set. Q.E.D.
Proof of Lemma 1. Since is finite and is injective on it, it follows from Property 1 that is -Lipschitz for some . Following McShane-Whitney extension theorem McShane (1934); Whitney (1934), we define by
We show that is -Lipschitz. Indeed, since is finite, for any , there exists such that . Furthermore, from definition (17), , and therefore,
Let . From (19), we have
Proof of Theorem 1. By assumption, . Now define
Each in (22) is a hypercube in of length in each dimension, and therefore, we have
We now define
It is clear that is -Lipschitz and finite with
It follows from Lemma 1 that there exists an -Lipschitz hypothesis and for all , which proves (i).
Proof of Theorem 2. To show is -Lipschitz, let . For any , we have
Proof of Theorem 3. We start by following the same process as in the proof of Theorem 1, but change the factor in (22) to to obtain hypercubes and a finite set . Instead of defined in (22) and the bounds derived in (25) and (23), we now have
Without loss of generality, we assume is full rank (if not, can be reduced until it is). Performing the singular value decomposition (SVD) of , we have
are unitary matrices, and 0 is the matrix with all entries being 0. for all . We further split as where and . It is easy to show the following
We will show that is -Lipschitz. Indeed, for
Note that the output space of the hypothesis in (33) has dimension , instead of in Theorem 1. Now define by
The following shows (i):
The following shows (iii):
To show (ii), we note that similar to Proof of Theorem 1, for , there is an , such that .
Finally, for ,
-  (2012) Learning from data. AMLBook. External Links: Cited by: §1.
-  (1989) Learnability and the vapnik-chervonenkis dimension. J. ACM 36 (4), pp. 929–965. Cited by: §1.
-  (2006) Compressed sampling. In Proceedings of the International Congress of Mathematicians, Cited by: §1, §1, §3, §5, §5.
Decoding by linear programming. IEEE Trans. Inform. Theory 51, pp. 4203–4215. Cited by: §1, §1, §3, §5.
-  (2008) Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine 25 (2), pp. 83 – 91. Cited by: §1.
-  (2014) Generative adversarial nets. Advances in neural information processing systems 27 (NIPS 2014). Cited by: §1.
-  (2015) Constrained and preconditioned stochastic gradient method. IEEE Transactions on Signal Processing 63 (10), pp. 2678 – 2691. Cited by: §1.
An introduction to computational learning theory. MIT Press. Cited by: §1.
-  (2001) Digital predistortion of wideband signals based on power amplifier model with memory. Electronics Letters 37 (23), pp. 1417 – 1418. Cited by: §1.
-  (2011) Oracle inequalities in empirical risk minimization and sparse recovery problems. Springer, Berlin, Heidelberg. Cited by: §2.
-  (2016) Reconnet: non-iterative reconstruction of images from compressively sensed measurements. . Cited by: §1.
-  (2015) Algorithms for lipschitz learning on graphs. Proceedings of The 28th Conference on Learning Theory, pp. 1190–1223. Cited by: §2, §2.
-  (2015) Towards a learning theory of cause-effect inference. International Conference on Machine Learning (ICML; JMLR W&CP) 37, pp. 1452–1461. Cited by: §2.
-  (1934) Extension of range of functions. Bull. Amer. Math. Soc 40 (12), pp. 837–842. Cited by: §4, Appendix.
-  (2017) Learning to invert: signal recovery via deep convolutional networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §1.
-  (2013) 3D computational imaging with single-pixel detectors. Science 340 (6134), pp. 844–847. Cited by: §1.
-  (2004) Distance-based classification with lipschitz functions. Journal of Machine Learning Research 5 (Jun), pp. 669–695. Cited by: §2.
-  (1984) A theory of the learnable. Communications of the ACM 27. Cited by: §1, §2.
-  (2000) The nature of statistical learning theory. Springer. Cited by: §1, §2.
-  (1934) Analytic extensions of differentiable functions defined in closed sets. Trans. Amer. Math. Soc. 36 (1), pp. 63–89. Cited by: §4, Appendix.