I Scope
Inference tasks in signal processing are often characterized by the availability of reliable statistical modeling with some missing instancespecific parameters. One conventional approach uses data to estimate these missing parameters and then infers based on the estimated model. Alternatively, data can also be leveraged to directly learn the inference mapping endtoend. These approaches for combining partiallyknown statistical models and data in inference are related to the notions of generative and discriminative models used in the machine learning literature [8, 4], typically considered in the context of classifiers.
The goal of this lecture note is to introduce the concepts of generative and discriminative learning for inference with a partiallyknown statistical model. While machine learning systems often lack the interpretability of traditional signal processing methods, we focus on a simple setting where one can interpret and compare the approaches in a tractable manner that is accessible and relevant to signal processing readers. In particular, we exemplify the approaches for the task of Bayesian signal estimation in a jointly Gaussian setting with the meansquared error (MSE) objective, i.e., a linear estimation setting. Here, the discriminative endtoend approach directly learns the linear minimum MSE (LMMSE) estimator, while the generative strategy yields a twostage estimator, which first uses data to fit the linear model, and then formulates the LMMSE estimator for the fitted model. The ability to derive these estimators in closedform facilitates their analytical comparison. It is rigorously shown that discriminative learning results in an estimate which is more robust to mismatches in the mathematical description of the setup. Generative learning, which utilizes prior knowledge on the distribution of the signals, can exploit this prior to achieve improved MSE in some settings. These analytical findings are numerically demonstrated in a numerical study, which is available online as a Python Notebook, such that it can be presented alongside the lecture detailed in this note.
Ii Relevance
Signal processing algorithms traditionally rely on mathematical models for describing the problem at hand. These models correspond to domain knowledge obtained from, e.g., established statistical models and understanding of the underlying physics. In practice, statistical models often include parameters that are unknown in advance, such as noise levels and channel coefficients, and are estimated from data.
Recent years have witnessed a dramatic success of machine learning, and particularly of deep learning, in domains such as computer vision and natural language processing
[5]. For inference tasks, these datadriven methods typically learn the inference rule directly from data, rather than estimating missing parameters in the underlying model, and can operate without any mathematical modeling. Nonetheless, when one has access to some level of domain knowledge, it can be harnessed to design inference rules that benefit over blackbox approaches in terms of performance, interpretability, robustness, complexity, and flexibility [10]. This is achieved by formulating the suitable inference rule given full domain knowledge, and then using data to optimize the resulting solver directly, with various methodologies including learned optimization [1], deep unfolding [7], and the augmentation of classic algorithms with trainable modules [11].The fact that signal processing tasks are often carried out based on partial domain knowledge, i.e., statistical models with some missing parameters, and data, motivates inspecting which design approach is preferable: the modeloriented approach of using the data to estimate the missing parameters, or the taskoriented strategy, which leverages data to directly optimize a suitable solver in an endtoend manner? These approaches can be related to the notions of generative learning and discriminative learning, typically considered in the machine learning literature in the context of classification tasks [12, Ch. 3]. In these lecture notes, we address the above fundamental question for an analytically tractable setting of linear Bayesian estimation, for which the approaches can be rigorously compared, connecting machine learning concepts with interpretable signal processing practices and techniques.
Iii Prerequisites
This lecture note is intended to be as selfcontained as possible and suitable for the undergraduate level without a deep background in estimation theory and machine learning. As such, it requires only basic knowledge in probability and calculus.
Iv Problem Statement
To formulate the considered problem, we first review some basic concepts in statistical inference, following [9]. Then, we elaborate on modelbased and datadriven approaches for inference. Finally, we present the running example considered in the remainder of this lecture note of linear estimation in partiallyknown measurement models.
Iva Statistical Inference
The term inference refers to the ability to conclude based on evidence and reasoning. While this generic definition can refer to a broad range of tasks, we focus in our description on systems that estimate or make predictions based on a set of observed measurements. In this wide family of problems, the system is required to map an input variable , taking values in an input space into a prediction of a target variable , which takes value in the target space .
The inputs are related to the targets via some statistical probability measure, , referred to as the data generating distribution, which is defined over . Formally,
is a joint distribution over the domain of inputs and targets. One can view such a distribution as being composed of two parts: a distribution over the unlabeled input
, which sometimes is called the marginal distribution, and the conditional distribution over the targets given the inputs , also referred to as the discriminative or inverse distribution.Inference rules can thus be expressed as mappings of the form
(1) 
We write the decision variable for a given input as . The space of all possible inference mappings of the form (1) is denoted by
. The fidelity of an inference mapping is measured using a loss function
(2) 
with being the set of real numbers. We are generally interested in carrying out inference that minimizes the risk function, also known as the generalization error, given by:
(3) 
where is the stochastic expectation. Thus, the goal is to design the inference rule to minimize the generalization error for a given problem.
IvB ModelBased versus DataDriven
The risk function in (3) allows to evaluate inference rules and to formulate the desired mapping as the one that minimizes . The main question is how to find this mapping, which is divided into two main strategies: the statistical modelbased strategy, referred to henceforth as modelbased; and the pure machine learning approach, which relies on data, and is thus referred to as datadriven. The main difference between these strategies is what information is utilized to tune .
Modelbased methods, also referred to as handdesigned schemes, set their inference rule, i.e. tune in (1), to minimize the risk function , based on full domain knowledge. The term domain knowledge typically refers to prior knowledge of the underlying statistics relating the input and the target , where full domain knowledge implies that the joint distribution is known. For instance, under the squarederror loss, , the optimal inference rule is the minimum MSE estimator, given by the conditional expected value , whose computation requires knowledge of the discriminative distribution .
Modelbased methods are the foundations of statistical signal processing. However, in practice, accurate knowledge of the distribution that relates the observations and the desired information is often unavailable. Thus, applying such techniques commonly requires imposing some assumptions on the underlying statistics, which in some cases reflects the actual behavior, but may also constitute a crude approximation of the true dynamics. Moreover, in the presence of inaccurate model knowledge, either as a result of estimation errors or due to enforcing a model which does not fully capture the environment, the performance of modelbased techniques tends to degrade. Finally, solving the coupled problem of model selection and estimation of the parameters of the selected model is a difficult task, with nontrivial inference rules [3, 6].
Datadriven methods learn their mapping from data rather than from statistical modeling. This approach builds upon the fact that while in many applications coming up with accurate and tractable statistical modeling is difficult, we are often given access to data describing the setup. In a supervised setting considered henceforth, data is comprised of a training set consisting of pairs of inputs and their corresponding target values, denoted by .
Machine learning provides various datadriven methods which form an inference rule from the data . Broadly speaking and following the terminology of [8], these approaches can be divided into two main classes:

Generative models that use to estimate the data generating distribution . Given the estimated distribution, denoted by , one then seeks the inference rule which minimizes the risk function with respect to , i.e.,
(4) where is defined in (3).

Discriminative models where data is used to directly form the inference rule as a form of endtoend learning. Without access to the true distribution , one cannot directly optimize the risk function (3), typically resorting to the empirical risk given by
(5) To avoid overfitting, i.e., coming up with an inference rule which minimizes (5) by memorizing the data, one has to constrain the set . This requires imposing a structure on the mapping, which is often dictated by a set of parameters denoted by , taking value is some parameter set , as considered henceforth. Thus, the system mapping is written as , which is tuned via
(6) where the last equality is obtained by substituting (5). In practice, the empirical risk in (6) is often combined with regularizing terms in order to facilitate training and mitigate overfitting.
One can further divide discriminative models into the following categories:

Modelbased discriminative models, where one has some domain knowledge that indicates what structure should the inference rule take. This allows using inference mappings with a relatively small number of parameters that are specific to the problem at hand. For instance, in the example presented in the sequel, we consider estimation in a jointly Gaussian setting, for which a suitable inference rule is known to take a linear form.

Modelagnostic discriminative models, which use highlyparameterized inference rules that can realize a broad set of abstract mappings. This is the common practice in deep learning for inference problems, which infer using
deep neural networks
trained from massive data sets.

The different design approaches are illustrated in Fig. 1.
IvC Inference with PartiallyKnown Generative Distributions
The unprecedented success of deep learning in areas such as computer vision and natural language processing [5] notably boosted the popularity of modelagnostic discriminative models that rely on abstract, purely datadriven pipelines, trained with massive data sets, for inference. Specifically, by letting be a DNN with parameters , one can train inference rules from data via (6) that operate in scenarios where analytical models are unknown or highly complex [2]. For instance, a statistical model relating an image of a dog and the breed of the dog is likely to be intractable, and thus, inference rules which rely on full domain knowledge or on estimating are likely to be inaccurate. However, the abstractness and extreme parameterization of DNNs results in them often being treated as black boxes, while the training procedure of DNNs is typically lengthy, computationally intensive, and requires massive volumes of data. Furthermore, understanding how their predictions are obtained and how reliable they are tends to be quite challenging, and thus, deep learning lacks the interpretability, flexibility, versatility, and reliability of modelbased techniques.
Unlike conventional deep learning domains, such as computer vision and natural language processing, in signal processing, one often has access to some level of reliable domain knowledge. Many problems in signal processing applications are characterized by faithful modeling based on the understanding of the underlying physics, the operation of the system, and models backed by extensive measurements. Nonetheless, existing modeling often includes parameters, e.g., channel coefficients and noise energy, which are specific to a given scenario, and are typically unknown in advance, though they can be estimated from data. The key question in scenarios involving such partial domain knowledge in addition to training data is which of the following design approaches is preferable:

The discriminative learning approach, which formulates the inference rule for a given parameters vector, and then uses to directly optimize the resulting mapping as in (6).
We tackle this fundamental question using a simple tractable scenario of linear estimation with partial domain knowledge, where it is known that the setting is linear, yet the exact linear mapping is unknown. In this setting, which is mathematically formulated in the following section, both approaches can be explicitly derived and analytically compared.
IvD Case Study: Linear Estimation with PartiallyKnown Measurement Models
To provide an analytically tractable comparison between the aforementioned approaches to jointly leverage domain knowledge and data in inference, we consider a linear estimation scenario. Such scenarios are not only simple and enable rigorous analysis, but also correspond to a broad range of statistical estimation problems that are commonly encountered in signal processing applications. Here, the input and the target are realvalued vectors taking values in and in , respectively. The loss measure is the squarederror loss, i.e., .
In the considered setting, one has prior knowledge that the target
admits a Gaussian distribution with mean
and covariance , i.e. , and that the measurements follow a linear model(7) 
i.e., is a Gaussian noise with mean and covariance matrix , and is assumed to be independent of . However, the available domain knowledge is partial in the sense that the parameters and are unknown. Consequently, while the generative distribution is known to be jointly Gaussian, its parameters are unknown. Thus, conventional Bayesian estimators, such as the minimum MSE (MMSE) and LMMSE estimators, cannot be implemented for this case. Nonetheless, we are given access to a data set comprised of i.i.d. samples drawn from . An illustration of the setting is depicted in Fig. 2(a). The goal is to utilize the available domain knowledge and the data in order to formulate an inference mapping that achieves the lowest generalization error, i.e., minimizes (3) with the squarederror loss.
V Solution: DataAided Linear Estimators
In the following we exemplify a generative learning estimator (in Subsection VA) and a discriminative learning estimator (in Subsection VB), which both aim to recover the random signal from the observed for the partially known jointlyGaussian setting described in Subsection IVD. To this end, we use the following notations of sample means
(8) 
and the sample covariance/crosscovariance matrices
(9a)  
(9b)  
and  
(9c) 
Va Generative Learning Estimator
The generative approach uses data to estimate the missing domain knowledge parameters. In order to estimate the distribution based on the model in (7), we need to estimate the matrix and the noise mean from the training data, and then use the estimates denoted and to form the linear estimator, as illustrated in Fig. 2(b).
The unknown and are fitted to the data using the maximum likelihood rule. Letting be the joint distribution of and for given and , the loglikelihood can be written as
(10) 
In (VA), const denotes a constant term, which is not a function of the unknown parameters and .
Under the assumption that comprised of i.i.d. samples drawn from a Gaussian generative distribution, , the maximum likelihood estimates are given by
(11) 
The solutions to (11) are given by [12]
(12) 
where , which is defined in (9a), and it is assumed that from (9b) is a nonsingular matrix. By substituting the estimators from (12) in (7), the estimated is obtained from the linear model
(13) 
Having estimated the generative model, we proceed to finding the inference rule which minimizes the risk function with respect to , i.e.,
(14) 
where we substitute the squarederror loss and (3) into (4). Since the estimated distribution, , is a jointly Gaussian distribution, the solution of (14) is the LMMSE estimator under the estimated distribution, which is given by
(15) 
Here, follows from the estimated jointly Gaussian model (13), and is obtained using the matrix inversion lemma.
Applying the matrix inversion lemma requires the computation of the inverse of an matrix instead of matrix as in the direct expression for the LMMSE estimate. This contributes to the computational complexity when , i.e., the target signal is of a lower dimensionality compared with the input signal. By substituting (12) in (15), one obtains the generative learning estimator as
(16) 
VB Discriminative Learning Estimator
In this subsection, we consider a discriminative learning approach for the considered estimation problem. Here, the partial domain knowledge regarding the underlying joint Gaussianity indicates that the estimator should take a linear form, i.e.,
(17) 
where
. For the considered parametric model, the available data is used to directly identify the parameters which minimize the empirical risk, as illustrated in Fig.
2(c), i.e.,(18) 
Since (18) is a convex function of and , the optimal solution is obtained by equating the derivatives of (18) w.r.t. and to zero, which results in
(19) 
where the sample means and sample covariance matrices are defined in (8) and (9), and it is assumed that is a nonsingular matrix. By substituting (19) into (17), we obtain the discriminative learned estimator, which is given by
(20) 
It can be verified that the learned estimator in (20) coincides with the sampleLMMSE estimator that is obtained by plugging the samplemean and sample covariance matrices into the LMMSE estimator.
Vi Discussion and comparison
By focusing on the simple yet common scenario of linear estimation with a partiallyknown measurement model, we obtained closedform expressions for the suitable estimators attained via generative learning and via discriminative learning. This allows us to compare the resulting estimators, and thus draw insights into the general approaches of generative versus discriminative learning in the context of signal processing applications. In the following we provide a theoretical analysis of the estimators in Subsection VIA, followed by a qualitative comparison discussed in Subsection VIB. A stimulative study is presented in Section VII.
Via Theoretical Comparison
We next provide an analysis of the estimators obtained via generative learning in (VA) and via discriminative learning in (20), by studying their behavior in asymptotic regimes. To show this, we first study the asymptotic setting where the number of samples is arbitrarily large, after which we inspect the case of high signaltonoise ratio (SNR), i.e., .
Asymptotic analysis: Let us inspect the derived estimators in the asymptotic case when and the data is comprised of i.i.d. samples (i.e., ergodic and stationarity scenario). In this case, the discriminative learning estimator in (20) converges to the LMMSE estimator i.e.,
(21) 
where are the true covariance matrices, and are the true expected values.
Similarly, for the generative learning estimator stated in (VA) converges to
(22) 
When the linear model in (7) holds, it can be verified that
(23) 
By substituting (23) into (22) and using the matrix inversion lemma again, we obtain
(24) 
Thus, asymptotically, if the true generative model is linear and given by (7), then the two estimators coincide.
Asymptotic analysis under misspecified (nonlinear) model: Often in practice, the generative model is nonlinear. For instance, consider the case where the true generative model is not the linear one in (7), but is given by
(25) 
where the measurement function, is a nonlinear function, and the statistical properties of and are the same as described in Subsection IVD. Although the model is nonlinear, the estimator may be designed based on the linear model in (7), either due to mismatches, or due to intentional linearization carried out to simplify the problem.
When the true generative model is nonlinear (i.e. under misspecified model), then the two linear estimators and are different even asymptotically. The asymptotic estimators in this case are given by (21) and (22), but (23) does not hold, and thus, here does not coincide with the LMMSE estimator, as it does in (24) for the linear generative model. In this case, the discriminative model approach is asymptotically preferred being the LMMSE estimator, while the generative learning approach, which is based on a mismatched generative model, yields a linear estimate that differs from the LMMSE estimator, and is thus suboptimal.
High SNR regime: Another setting in which one can rigorously compare the estimators is in the high SNR regime, where . To see this, we note that for , (15) is reduced to
(26) 
where is obtained by substituting (12).
In this case, the data satisfies the model from (7), i.e., . Thus, the sample mean in (8) satisfies
(27) 
Similarly, the sample covariance matrices in (9) satisfy
(28) 
and
(29) 
By substituting these results in (26), one obtains
(30) 
The solution in (30
) is the least squares estimator for linear regression, which is based on the pseudoinverse of
(under the assumption that has full column rank). This estimator ignores the prior information, since the posterior is accurate (). Moreover, by substituting (27) in (30), it can be seen that the MSE of the estimator in this high SNR regime satisfiesThus, we can accurately recover the target vector from the input under the domain knowledge information that as long as has full column rank.
On the other hand, by substituting (28) and (29) in the discriminative learned estimator in (20), we obtain
(31) 
The solution in (31) is a weighted leastsquares solution with as the weight matrix, under the assumption that has full row rank. However, by substituting (27) in (31), it can be verified that the MSE of in this high SNRs regime is a positive definite matrix, i.e.
(32) 
as long as is finite and
is not a scalar multiple of the identity matrix. This follows since the discriminative learned estimator does not use some of its partial domain knowledge, as
in (20) is not a function of (as well as of ). This becomes a disadvantage when this domain knowledge carries important information that can improve the overall performance, as is the case here where when .ViB Qualitative Comparison
The theoretical comparison in Subsection VIA allows to rigorously identify scenarios in which one approach is preferable over the other, e.g., that discriminative learning is more suitable for handling modeling mismatches, while generative learning yields more accurate estimators in high SNRs. Another aspect in which the estimators are comparable is in their sample complexity. The discriminative linear estimator in (20) requires the computation of the inverse sample covariance matrix of , , from (9c). On the other hand, the generative linear estimator in (VA) requires the computation of the inverse sample covariance matrix of , , in (9b). The dimensions of the matrices and are different. Thus, for a limited dataset, it may be easier to implement one of these estimators and guarantee the stability of the inverse covariance matrix.
In particular, in settings where the sample size () is comparable to the observation dimension (), the discriminative sampleLMMSE estimator exhibits severe performance degradation. This is because the sample covariance matrix is not wellconditioned in the small sample size regime, and inverting it amplifies the estimation error. Similarly, if is comparable to the estimated vector dimension (), the performance of the generative learning estimator in (VA
) degrades. In such cases, the available information is not enough to reveal a sufficiently good model that fits the data, and it can be misleading due to the presence of noise and possible outliers.
A possible additional operational gain of generative learning stems from the fact that it estimates the underlying distribution, which can be used for other tasks. The discriminative learning approach is highly taskspecific, learning only what must be learned to produce its estimate, and is thus not adaptive to a new task.
Vii Numerical Comparison
In this section we evaluate the considered estimators in comparison with the oracle MMSE, which is the MMSE estimator for the model in (7) and known ^{1}^{1}1The numerical study is available online at https://gist.github.com/nirshlezinger1/3e92bc16d28c8f2f7feb5031e32b5618. The purpose of this study is to numerically assert the theoretical findings reported in the previous section, and to empirically compare the considered approaches to combine data with partial domain knowledge in a nonasymptotic regime.
We simulate jointlyGaussian signals via the signal model in (7) with observations and target entries. The target has zeromean and covariance matrix representing spatial exponential decay, where the th entry of is set to . The measurement matrix
is generated randomly with i.i.d. zeromean unit variance Gaussian entries. The results are averaged over
Monte Carlo simulations.We numerically evaluate the performance of the discriminative learning estimator of (20) and the generative learning estimator of (VA). We consider two scenarios: 1) a setting in which and are accurately known; and 2) a mismatched case in which the estimator approximates as the identity matrix. Since the discriminative estimator is invariant of the prior distribution of , the presence of mismatch only affects the generative approach.
The resulting MSE values versus the SNR for data samples are reported in Fig. 3, while Fig. 4 illustrates the MSE curves versus for . Observing Fig. 3, we note that in the high SNR regime, all estimators achieve performance within a minor gap of the MMSE. However, in lower SNR values, the generative estimator, which fully knows , outperforms the discriminative approach due to its ability to incorporate the prior knowledge of
and of the statistical moments of
. Nonetheless, in the presence of small mismatches in , the discriminative approach yields improved MSE, indicating on its ability to better cope with modeling mismatches compared with generative learning. In Fig. 4 we observe that the effect of misspecificed model does not vanish when the number of samples increases, and the mismatched generative model remains within a notable gap from the MMSE, while both the discriminative learning estimator and the nonmismatched generative one approach the MMSE as grows. These findings are all in line with the theoretical analysis presented in Subsection VIA.Viii What we have learned
In this lecture note, we reviewed two different approaches to combine partial domain knowledge with data for forming an inference rule. The first approach is related to the machine learning notion of generative learning, which operates in two stages: it first uses data to estimate the missing components in the statistical description of the problem at hand; then, the estimated statistical model is used to form the inference rule. The second approach is taskoriented, leveraging the available domain knowledge to specify the structure of the inference rule, while using data to optimize the resulting mapping in an endtoend fashion.
To compare the approaches in a manner that is relevant and insightful to signal processing students and researchers, we focused on a case study representing linear estimation. For such settings, we obtained a closedform expression for both the generative learning estimator as well as the discriminative learning one. The resulting explicit derivations enabled us to rigorously compare the approaches, and draw insights into their conceptual differences and individual pros and cons.
In particular, we noted that discriminative learning, which uses the available domain knowledge only to determine the inference rule structure, is more robust to mismatches in the mathematical description of the setup. This property indicates the ability of endtoend learning to better cope with mismatched and complex models. However, when the partial domain knowledge available is indeed accurate, it was shown that generative learning can leverage this prior to improve performance in harsh and noisy settings. These findings were not only analytically proven, but also backed by numerical evaluations, which are made publicly available as a highly accessible Python Notebook intended to be used for presenting this lecture in class.
Ix Acknowledgments
The authors thank Arbel Yaniv and Lital Dabush, who helped during the development of the numerical example.
X Authors
Nir Shlezinger is an assistant professor in the School of Electrical and Computer Engineering at BenGurion University, Israel. He received his B.Sc., M.Sc., and Ph.D. degrees in 2011, 2013, and 2017, respectively, from BenGurion University, Israel, all in electrical and computer engineering. From 2017 to 2019 he was a postdoctoral researcher in the Technion, and from 2019 to 2020 he was a postdoctoral researcher in Weizmann Institute of Science, where he was awarded the FGS prize for outstanding research achievements. His research interests include communications, information theory, signal processing, and machine learning.
Tirza Routtenberg (tirzar@bgu.ac.il) is an Associate Professor in the School of Electrical and Computer Engineering at BenGurion University of the Negev, Israel. She was the recipient of four Best Student Paper Awards at international Conferences. She is currently serving as an associate editor of IEEE Transactions on Signal and Information Processing Over Networks and of IEEE Signal Processing Letters. She is a member of the IEEE Signal Processing Theory and Methods Technical Committee. Her research interests include statistical signal processing, estimation and detection theory, graph signal processing, and optimization and signal processing for smart grids.
References
 [1] (2021) Learning convex optimization models. ieee_j_jas 8 (8), pp. 1355–1364. Cited by: §II.
 [2] (2009) Learning deep architectures for AI. Foundations and trends in Machine Learning 2 (1), pp. 1–127. Cited by: §IVC.
 [3] (2021) Bayesian postmodelselection estimation. ieee_j_spl 28 (), pp. 175–179. External Links: Document Cited by: §IVB.
 [4] (2012) Machine learning: discriminative and generative. Vol. 755, Springer Science & Business Media. Cited by: §I.
 [5] (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §II, §IVC.
 [6] (2021) CramrRao bound for estimation after model selection and its application to sparse vector estimation. ieee_j_sp 69 (), pp. 2284–2301. External Links: Document Cited by: §IVB.
 [7] (2021) Algorithm unrolling: interpretable, efficient deep learning for signal and image processing. ieee_m_sp 38 (2), pp. 18–44. Cited by: §II.

[8]
(2001)
On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes
. Advances in neural information processing systems 14. Cited by: §I, §IVB.  [9] (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §IV.
 [10] (2022) Modelbased deep learning: on the intersection of deep learning and optimization. Note: arXiv preprint arXiv:2205.02640 Cited by: §II.
 [11] (2020) Modelbased deep learning. Note: arXiv preprint arXiv:2012.08405 Cited by: §II.
 [12] (2020) Machine learning: a bayesian and optimization perspective. 2 edition, Elsevier Science & Technology. Cited by: §II, §VA.