Inference tasks in signal processing are often characterized by the availability of reliable statistical modeling with some missing instance-specific parameters. One conventional approach uses data to estimate these missing parameters and then infers based on the estimated model. Alternatively, data can also be leveraged to directly learn the inference mapping end-to-end. These approaches for combining partially-known statistical models and data in inference are related to the notions of generative and discriminative models used in the machine learning literature [8, 4], typically considered in the context of classifiers.
The goal of this lecture note is to introduce the concepts of generative and discriminative learning for inference with a partially-known statistical model. While machine learning systems often lack the interpretability of traditional signal processing methods, we focus on a simple setting where one can interpret and compare the approaches in a tractable manner that is accessible and relevant to signal processing readers. In particular, we exemplify the approaches for the task of Bayesian signal estimation in a jointly Gaussian setting with the mean-squared error (MSE) objective, i.e., a linear estimation setting. Here, the discriminative end-to-end approach directly learns the linear minimum MSE (LMMSE) estimator, while the generative strategy yields a two-stage estimator, which first uses data to fit the linear model, and then formulates the LMMSE estimator for the fitted model. The ability to derive these estimators in closed-form facilitates their analytical comparison. It is rigorously shown that discriminative learning results in an estimate which is more robust to mismatches in the mathematical description of the setup. Generative learning, which utilizes prior knowledge on the distribution of the signals, can exploit this prior to achieve improved MSE in some settings. These analytical findings are numerically demonstrated in a numerical study, which is available online as a Python Notebook, such that it can be presented alongside the lecture detailed in this note.
Signal processing algorithms traditionally rely on mathematical models for describing the problem at hand. These models correspond to domain knowledge obtained from, e.g., established statistical models and understanding of the underlying physics. In practice, statistical models often include parameters that are unknown in advance, such as noise levels and channel coefficients, and are estimated from data.
The fact that signal processing tasks are often carried out based on partial domain knowledge, i.e., statistical models with some missing parameters, and data, motivates inspecting which design approach is preferable: the model-oriented approach of using the data to estimate the missing parameters, or the task-oriented strategy, which leverages data to directly optimize a suitable solver in an end-to-end manner? These approaches can be related to the notions of generative learning and discriminative learning, typically considered in the machine learning literature in the context of classification tasks [12, Ch. 3]. In these lecture notes, we address the above fundamental question for an analytically tractable setting of linear Bayesian estimation, for which the approaches can be rigorously compared, connecting machine learning concepts with interpretable signal processing practices and techniques.
This lecture note is intended to be as self-contained as possible and suitable for the undergraduate level without a deep background in estimation theory and machine learning. As such, it requires only basic knowledge in probability and calculus.
Iv Problem Statement
To formulate the considered problem, we first review some basic concepts in statistical inference, following . Then, we elaborate on model-based and data-driven approaches for inference. Finally, we present the running example considered in the remainder of this lecture note of linear estimation in partially-known measurement models.
Iv-a Statistical Inference
The term inference refers to the ability to conclude based on evidence and reasoning. While this generic definition can refer to a broad range of tasks, we focus in our description on systems that estimate or make predictions based on a set of observed measurements. In this wide family of problems, the system is required to map an input variable , taking values in an input space into a prediction of a target variable , which takes value in the target space .
The inputs are related to the targets via some statistical probability measure, , referred to as the data generating distribution, which is defined over . Formally,
is a joint distribution over the domain of inputs and targets. One can view such a distribution as being composed of two parts: a distribution over the unlabeled input, which sometimes is called the marginal distribution, and the conditional distribution over the targets given the inputs , also referred to as the discriminative or inverse distribution.
Inference rules can thus be expressed as mappings of the form
We write the decision variable for a given input as . The space of all possible inference mappings of the form (1) is denoted by
. The fidelity of an inference mapping is measured using a loss function
with being the set of real numbers. We are generally interested in carrying out inference that minimizes the risk function, also known as the generalization error, given by:
where is the stochastic expectation. Thus, the goal is to design the inference rule to minimize the generalization error for a given problem.
Iv-B Model-Based versus Data-Driven
The risk function in (3) allows to evaluate inference rules and to formulate the desired mapping as the one that minimizes . The main question is how to find this mapping, which is divided into two main strategies: the statistical model-based strategy, referred to henceforth as model-based; and the pure machine learning approach, which relies on data, and is thus referred to as data-driven. The main difference between these strategies is what information is utilized to tune .
Model-based methods, also referred to as hand-designed schemes, set their inference rule, i.e. tune in (1), to minimize the risk function , based on full domain knowledge. The term domain knowledge typically refers to prior knowledge of the underlying statistics relating the input and the target , where full domain knowledge implies that the joint distribution is known. For instance, under the squared-error loss, , the optimal inference rule is the minimum MSE estimator, given by the conditional expected value , whose computation requires knowledge of the discriminative distribution .
Model-based methods are the foundations of statistical signal processing. However, in practice, accurate knowledge of the distribution that relates the observations and the desired information is often unavailable. Thus, applying such techniques commonly requires imposing some assumptions on the underlying statistics, which in some cases reflects the actual behavior, but may also constitute a crude approximation of the true dynamics. Moreover, in the presence of inaccurate model knowledge, either as a result of estimation errors or due to enforcing a model which does not fully capture the environment, the performance of model-based techniques tends to degrade. Finally, solving the coupled problem of model selection and estimation of the parameters of the selected model is a difficult task, with non-trivial inference rules [3, 6].
Data-driven methods learn their mapping from data rather than from statistical modeling. This approach builds upon the fact that while in many applications coming up with accurate and tractable statistical modeling is difficult, we are often given access to data describing the setup. In a supervised setting considered henceforth, data is comprised of a training set consisting of pairs of inputs and their corresponding target values, denoted by .
Machine learning provides various data-driven methods which form an inference rule from the data . Broadly speaking and following the terminology of , these approaches can be divided into two main classes:
Generative models that use to estimate the data generating distribution . Given the estimated distribution, denoted by , one then seeks the inference rule which minimizes the risk function with respect to , i.e.,
where is defined in (3).
Discriminative models where data is used to directly form the inference rule as a form of end-to-end learning. Without access to the true distribution , one cannot directly optimize the risk function (3), typically resorting to the empirical risk given by
To avoid overfitting, i.e., coming up with an inference rule which minimizes (5) by memorizing the data, one has to constrain the set . This requires imposing a structure on the mapping, which is often dictated by a set of parameters denoted by , taking value is some parameter set , as considered henceforth. Thus, the system mapping is written as , which is tuned via
One can further divide discriminative models into the following categories:
Model-based discriminative models, where one has some domain knowledge that indicates what structure should the inference rule take. This allows using inference mappings with a relatively small number of parameters that are specific to the problem at hand. For instance, in the example presented in the sequel, we consider estimation in a jointly Gaussian setting, for which a suitable inference rule is known to take a linear form.
Model-agnostic discriminative models, which use highly-parameterized inference rules that can realize a broad set of abstract mappings. This is the common practice in deep learning for inference problems, which infer using
deep neural networkstrained from massive data sets.
The different design approaches are illustrated in Fig. 1.
Iv-C Inference with Partially-Known Generative Distributions
The unprecedented success of deep learning in areas such as computer vision and natural language processing  notably boosted the popularity of model-agnostic discriminative models that rely on abstract, purely data-driven pipelines, trained with massive data sets, for inference. Specifically, by letting be a DNN with parameters , one can train inference rules from data via (6) that operate in scenarios where analytical models are unknown or highly complex . For instance, a statistical model relating an image of a dog and the breed of the dog is likely to be intractable, and thus, inference rules which rely on full domain knowledge or on estimating are likely to be inaccurate. However, the abstractness and extreme parameterization of DNNs results in them often being treated as black boxes, while the training procedure of DNNs is typically lengthy, computationally intensive, and requires massive volumes of data. Furthermore, understanding how their predictions are obtained and how reliable they are tends to be quite challenging, and thus, deep learning lacks the interpretability, flexibility, versatility, and reliability of model-based techniques.
Unlike conventional deep learning domains, such as computer vision and natural language processing, in signal processing, one often has access to some level of reliable domain knowledge. Many problems in signal processing applications are characterized by faithful modeling based on the understanding of the underlying physics, the operation of the system, and models backed by extensive measurements. Nonetheless, existing modeling often includes parameters, e.g., channel coefficients and noise energy, which are specific to a given scenario, and are typically unknown in advance, though they can be estimated from data. The key question in scenarios involving such partial domain knowledge in addition to training data is which of the following design approaches is preferable:
The discriminative learning approach, which formulates the inference rule for a given parameters vector, and then uses to directly optimize the resulting mapping as in (6).
We tackle this fundamental question using a simple tractable scenario of linear estimation with partial domain knowledge, where it is known that the setting is linear, yet the exact linear mapping is unknown. In this setting, which is mathematically formulated in the following section, both approaches can be explicitly derived and analytically compared.
Iv-D Case Study: Linear Estimation with Partially-Known Measurement Models
To provide an analytically tractable comparison between the aforementioned approaches to jointly leverage domain knowledge and data in inference, we consider a linear estimation scenario. Such scenarios are not only simple and enable rigorous analysis, but also correspond to a broad range of statistical estimation problems that are commonly encountered in signal processing applications. Here, the input and the target are real-valued vectors taking values in and in , respectively. The loss measure is the squared-error loss, i.e., .
In the considered setting, one has prior knowledge that the target
admits a Gaussian distribution with meanand covariance , i.e. , and that the measurements follow a linear model
i.e., is a Gaussian noise with mean and covariance matrix , and is assumed to be independent of . However, the available domain knowledge is partial in the sense that the parameters and are unknown. Consequently, while the generative distribution is known to be jointly Gaussian, its parameters are unknown. Thus, conventional Bayesian estimators, such as the minimum MSE (MMSE) and LMMSE estimators, cannot be implemented for this case. Nonetheless, we are given access to a data set comprised of i.i.d. samples drawn from . An illustration of the setting is depicted in Fig. 2(a). The goal is to utilize the available domain knowledge and the data in order to formulate an inference mapping that achieves the lowest generalization error, i.e., minimizes (3) with the squared-error loss.
V Solution: Data-Aided Linear Estimators
In the following we exemplify a generative learning estimator (in Subsection V-A) and a discriminative learning estimator (in Subsection V-B), which both aim to recover the random signal from the observed for the partially known jointly-Gaussian setting described in Subsection IV-D. To this end, we use the following notations of sample means
and the sample covariance/cross-covariance matrices
V-a Generative Learning Estimator
The generative approach uses data to estimate the missing domain knowledge parameters. In order to estimate the distribution based on the model in (7), we need to estimate the matrix and the noise mean from the training data, and then use the estimates denoted and to form the linear estimator, as illustrated in Fig. 2(b).
The unknown and are fitted to the data using the maximum likelihood rule. Letting be the joint distribution of and for given and , the log-likelihood can be written as
In (V-A), const denotes a constant term, which is not a function of the unknown parameters and .
Under the assumption that comprised of i.i.d. samples drawn from a Gaussian generative distribution, , the maximum likelihood estimates are given by
Having estimated the generative model, we proceed to finding the inference rule which minimizes the risk function with respect to , i.e.,
where we substitute the squared-error loss and (3) into (4). Since the estimated distribution, , is a jointly Gaussian distribution, the solution of (14) is the LMMSE estimator under the estimated distribution, which is given by
Here, follows from the estimated jointly Gaussian model (13), and is obtained using the matrix inversion lemma.
Applying the matrix inversion lemma requires the computation of the inverse of an matrix instead of matrix as in the direct expression for the LMMSE estimate. This contributes to the computational complexity when , i.e., the target signal is of a lower dimensionality compared with the input signal. By substituting (12) in (15), one obtains the generative learning estimator as
V-B Discriminative Learning Estimator
In this subsection, we consider a discriminative learning approach for the considered estimation problem. Here, the partial domain knowledge regarding the underlying joint Gaussianity indicates that the estimator should take a linear form, i.e.,
. For the considered parametric model, the available data is used to directly identify the parameters which minimize the empirical risk, as illustrated in Fig.2(c), i.e.,
where the sample means and sample covariance matrices are defined in (8) and (9), and it is assumed that is a non-singular matrix. By substituting (19) into (17), we obtain the discriminative learned estimator, which is given by
It can be verified that the learned estimator in (20) coincides with the sample-LMMSE estimator that is obtained by plugging the sample-mean and sample covariance matrices into the LMMSE estimator.
Vi Discussion and comparison
By focusing on the simple yet common scenario of linear estimation with a partially-known measurement model, we obtained closed-form expressions for the suitable estimators attained via generative learning and via discriminative learning. This allows us to compare the resulting estimators, and thus draw insights into the general approaches of generative versus discriminative learning in the context of signal processing applications. In the following we provide a theoretical analysis of the estimators in Subsection VI-A, followed by a qualitative comparison discussed in Subsection VI-B. A stimulative study is presented in Section VII.
Vi-a Theoretical Comparison
We next provide an analysis of the estimators obtained via generative learning in (V-A) and via discriminative learning in (20), by studying their behavior in asymptotic regimes. To show this, we first study the asymptotic setting where the number of samples is arbitrarily large, after which we inspect the case of high signal-to-noise ratio (SNR), i.e., .
Asymptotic analysis: Let us inspect the derived estimators in the asymptotic case when and the data is comprised of i.i.d. samples (i.e., ergodic and stationarity scenario). In this case, the discriminative learning estimator in (20) converges to the LMMSE estimator i.e.,
where are the true covariance matrices, and are the true expected values.
Similarly, for the generative learning estimator stated in (V-A) converges to
When the linear model in (7) holds, it can be verified that
Thus, asymptotically, if the true generative model is linear and given by (7), then the two estimators coincide.
Asymptotic analysis under misspecified (nonlinear) model: Often in practice, the generative model is nonlinear. For instance, consider the case where the true generative model is not the linear one in (7), but is given by
where the measurement function, is a nonlinear function, and the statistical properties of and are the same as described in Subsection IV-D. Although the model is nonlinear, the estimator may be designed based on the linear model in (7), either due to mismatches, or due to intentional linearization carried out to simplify the problem.
When the true generative model is nonlinear (i.e. under misspecified model), then the two linear estimators and are different even asymptotically. The asymptotic estimators in this case are given by (21) and (22), but (23) does not hold, and thus, here does not coincide with the LMMSE estimator, as it does in (24) for the linear generative model. In this case, the discriminative model approach is asymptotically preferred being the LMMSE estimator, while the generative learning approach, which is based on a mismatched generative model, yields a linear estimate that differs from the LMMSE estimator, and is thus sub-optimal.
High SNR regime: Another setting in which one can rigorously compare the estimators is in the high SNR regime, where . To see this, we note that for , (15) is reduced to
where is obtained by substituting (12).
Similarly, the sample covariance matrices in (9) satisfy
By substituting these results in (26), one obtains
The solution in (30
) is the least squares estimator for linear regression, which is based on the pseudo-inverse of(under the assumption that has full column rank). This estimator ignores the prior information, since the posterior is accurate (). Moreover, by substituting (27) in (30), it can be seen that the MSE of the estimator in this high SNR regime satisfies
Thus, we can accurately recover the target vector from the input under the domain knowledge information that as long as has full column rank.
The solution in (31) is a weighted least-squares solution with as the weight matrix, under the assumption that has full row rank. However, by substituting (27) in (31), it can be verified that the MSE of in this high SNRs regime is a positive definite matrix, i.e.
as long as is finite and
is not a scalar multiple of the identity matrix. This follows since the discriminative learned estimator does not use some of its partial domain knowledge, asin (20) is not a function of (as well as of ). This becomes a disadvantage when this domain knowledge carries important information that can improve the overall performance, as is the case here where when .
Vi-B Qualitative Comparison
The theoretical comparison in Subsection VI-A allows to rigorously identify scenarios in which one approach is preferable over the other, e.g., that discriminative learning is more suitable for handling modeling mismatches, while generative learning yields more accurate estimators in high SNRs. Another aspect in which the estimators are comparable is in their sample complexity. The discriminative linear estimator in (20) requires the computation of the inverse sample covariance matrix of , , from (9c). On the other hand, the generative linear estimator in (V-A) requires the computation of the inverse sample covariance matrix of , , in (9b). The dimensions of the matrices and are different. Thus, for a limited dataset, it may be easier to implement one of these estimators and guarantee the stability of the inverse covariance matrix.
In particular, in settings where the sample size () is comparable to the observation dimension (), the discriminative sample-LMMSE estimator exhibits severe performance degradation. This is because the sample covariance matrix is not well-conditioned in the small sample size regime, and inverting it amplifies the estimation error. Similarly, if is comparable to the estimated vector dimension (), the performance of the generative learning estimator in (V-A
) degrades. In such cases, the available information is not enough to reveal a sufficiently good model that fits the data, and it can be misleading due to the presence of noise and possible outliers.
A possible additional operational gain of generative learning stems from the fact that it estimates the underlying distribution, which can be used for other tasks. The discriminative learning approach is highly task-specific, learning only what must be learned to produce its estimate, and is thus not adaptive to a new task.
Vii Numerical Comparison
In this section we evaluate the considered estimators in comparison with the oracle MMSE, which is the MMSE estimator for the model in (7) and known 111The numerical study is available online at https://gist.github.com/nirshlezinger1/3e92bc16d28c8f2f7feb5031e32b5618. The purpose of this study is to numerically assert the theoretical findings reported in the previous section, and to empirically compare the considered approaches to combine data with partial domain knowledge in a non-asymptotic regime.
We simulate jointly-Gaussian signals via the signal model in (7) with observations and target entries. The target has zero-mean and covariance matrix representing spatial exponential decay, where the th entry of is set to . The measurement matrix
is generated randomly with i.i.d. zero-mean unit variance Gaussian entries. The results are averaged overMonte Carlo simulations.
We numerically evaluate the performance of the discriminative learning estimator of (20) and the generative learning estimator of (V-A). We consider two scenarios: 1) a setting in which and are accurately known; and 2) a mismatched case in which the estimator approximates as the identity matrix. Since the discriminative estimator is invariant of the prior distribution of , the presence of mismatch only affects the generative approach.
The resulting MSE values versus the SNR for data samples are reported in Fig. 3, while Fig. 4 illustrates the MSE curves versus for . Observing Fig. 3, we note that in the high SNR regime, all estimators achieve performance within a minor gap of the MMSE. However, in lower SNR values, the generative estimator, which fully knows , outperforms the discriminative approach due to its ability to incorporate the prior knowledge of
and of the statistical moments of. Nonetheless, in the presence of small mismatches in , the discriminative approach yields improved MSE, indicating on its ability to better cope with modeling mismatches compared with generative learning. In Fig. 4 we observe that the effect of misspecificed model does not vanish when the number of samples increases, and the mismatched generative model remains within a notable gap from the MMSE, while both the discriminative learning estimator and the non-mismatched generative one approach the MMSE as grows. These findings are all in line with the theoretical analysis presented in Subsection VI-A.
Viii What we have learned
In this lecture note, we reviewed two different approaches to combine partial domain knowledge with data for forming an inference rule. The first approach is related to the machine learning notion of generative learning, which operates in two stages: it first uses data to estimate the missing components in the statistical description of the problem at hand; then, the estimated statistical model is used to form the inference rule. The second approach is task-oriented, leveraging the available domain knowledge to specify the structure of the inference rule, while using data to optimize the resulting mapping in an end-to-end fashion.
To compare the approaches in a manner that is relevant and insightful to signal processing students and researchers, we focused on a case study representing linear estimation. For such settings, we obtained a closed-form expression for both the generative learning estimator as well as the discriminative learning one. The resulting explicit derivations enabled us to rigorously compare the approaches, and draw insights into their conceptual differences and individual pros and cons.
In particular, we noted that discriminative learning, which uses the available domain knowledge only to determine the inference rule structure, is more robust to mismatches in the mathematical description of the setup. This property indicates the ability of end-to-end learning to better cope with mismatched and complex models. However, when the partial domain knowledge available is indeed accurate, it was shown that generative learning can leverage this prior to improve performance in harsh and noisy settings. These findings were not only analytically proven, but also backed by numerical evaluations, which are made publicly available as a highly accessible Python Notebook intended to be used for presenting this lecture in class.
The authors thank Arbel Yaniv and Lital Dabush, who helped during the development of the numerical example.
Nir Shlezinger is an assistant professor in the School of Electrical and Computer Engineering at Ben-Gurion University, Israel. He received his B.Sc., M.Sc., and Ph.D. degrees in 2011, 2013, and 2017, respectively, from Ben-Gurion University, Israel, all in electrical and computer engineering. From 2017 to 2019 he was a postdoctoral researcher in the Technion, and from 2019 to 2020 he was a postdoctoral researcher in Weizmann Institute of Science, where he was awarded the FGS prize for outstanding research achievements. His research interests include communications, information theory, signal processing, and machine learning.
Tirza Routtenberg (email@example.com) is an Associate Professor in the School of Electrical and Computer Engineering at Ben-Gurion University of the Negev, Israel. She was the recipient of four Best Student Paper Awards at international Conferences. She is currently serving as an associate editor of IEEE Transactions on Signal and Information Processing Over Networks and of IEEE Signal Processing Letters. She is a member of the IEEE Signal Processing Theory and Methods Technical Committee. Her research interests include statistical signal processing, estimation and detection theory, graph signal processing, and optimization and signal processing for smart grids.
-  (2021) Learning convex optimization models. ieee_j_jas 8 (8), pp. 1355–1364. Cited by: §II.
-  (2009) Learning deep architectures for AI. Foundations and trends in Machine Learning 2 (1), pp. 1–127. Cited by: §IV-C.
-  (2021) Bayesian post-model-selection estimation. ieee_j_spl 28 (), pp. 175–179. External Links: Cited by: §IV-B.
-  (2012) Machine learning: discriminative and generative. Vol. 755, Springer Science & Business Media. Cited by: §I.
-  (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §II, §IV-C.
-  (2021) Cramr-Rao bound for estimation after model selection and its application to sparse vector estimation. ieee_j_sp 69 (), pp. 2284–2301. External Links: Cited by: §IV-B.
-  (2021) Algorithm unrolling: interpretable, efficient deep learning for signal and image processing. ieee_m_sp 38 (2), pp. 18–44. Cited by: §II.
-  (2001) . Advances in neural information processing systems 14. Cited by: §I, §IV-B.
-  (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §IV.
-  (2022) Model-based deep learning: on the intersection of deep learning and optimization. Note: arXiv preprint arXiv:2205.02640 Cited by: §II.
-  (2020) Model-based deep learning. Note: arXiv preprint arXiv:2012.08405 Cited by: §II.
-  (2020) Machine learning: a bayesian and optimization perspective. 2 edition, Elsevier Science & Technology. Cited by: §II, §V-A.