Probability theory is the foundation to explain the process of understanding and learning. A fundamental problem in probabilistic modeling of knowledge is the question of prior knowledge or specifically choosing prior distributions of random variables and models. The importance of prior distributions is highlighted clearly in Bayesian statistical inference, where the choice greatly affects the process of learning.
The choice of prior distributions is not clearly dictated by axioms of probability theory. Consequently, in current applications of bayesian statistics the choice of prior differs from case to case, and many of them are justified by experimental results. Observe that there has been substantial attempts to define and theoretically justify Objective Priors
over random variables such as Laplace’s Principle of Indifference [[jaynes2003probability]jaynes1968prior]] and Jeffreys priors [[jeffreys1946invariant]]. The overall goal of the aforementioned attempts are to pinpoint a single distribution as prior to unify and objectify the inference procedure.
In the context of model selection, a prior needs to be defined over possible hypotheses for the governing model that generates the observations. The classic methodology is to assume the prior density over the parameters of a parametric family without any systematic regard to the likelihood function. In the case of large scale and complex models e.g. CNNs, the prior (Regularization) is mainly justified by empirical results and intuitions.
We argue that the prior on parameters needs to be determined depending on the likelihood functions that the parameters are generating. Meaning that if the likelihood function changes, the prior needs to change to maintain similar statistical conclusions. Although there is the potential in Bayesian perspective to connect likelihood functions to priors, e.g. conjugate priors, in this paper, we follow an alternative approach to obtain priors over models.
Our approach is derived directly from the axioms of probability theory. In our approach, the prior probabilities of models, depends only on the prior probabilities of observable random variables and the likelihood function it represents. The perspective presented does not determine what constitutes as anobjective prior of observable random variables; however, clarifies the understanding of priors over models.
We uncover Maximum Probability Principle, existing in the current definitions and discuss its connection with Occam’s Razor.
We present an alternative probabilistic perspective of objective functions and optimization of models.
for probability measure and the probability distribution of some RVrespectively, while corresponds to the negated logarithms of probability measures and probability distribution of some RV respectively.
2 Maximum Probability Principle
This section is divided into three parts. We start by presenting our main result and the theoretical consequences that follows in Section 2.1. The interpretation of our main result is discussed in Section 2.2. The interpretation section is independent of the rest of the paper and does not have a direct theoretical and practical impact, but important for understanding the significance of the main result.
Finally, the practical perspective and the consequences of our contribution is represented in Section 2.3.
We present our main result in the following theorem. Consider the probability space , the random variable with finite range and the probability distribution of . For any with the conditional pdf the following holds,
is read as Maximum Probability of observed by and is read as the minimum information in observed by . The proof for Theorem 2.1 is exceptionally trivial, yet it is fundamental in understanding probabilistic models. In this view, every probability distribution over the random variable corresponds to an event where the probability of the event is bounded using Theorem 2.1. The following corollary is the general form of Theorem 2.1. and the following holds. The general form of the Theorem 2.1 is the following
A random variable extends random variable iff
For any random variable that extends , the following inequality holds
For any random variable and with concatenation the following holds
Theorem 2.1 shows that the tightness of the Maximum Probability bound is relative to the complexity of the random variable. In layman terms, the random variables are tools to measure the probabilities of events. Roughly speaking, increasing the number of states of random variables, increases the fineness of probability measurements. Therefore the probability of events is relative to the choice of random variables. Equivalently in information theoretical language, random variables with larger number of states potentially convey more information about the underlying event.
A family of lower bounds for the maximum probability operator can be derived using approximation of max function. For the family of functions defined as
the following is true,
and the equality holds as . Also . An non-symmetric distance function for probability distributions related to Theorem 2.1 can be extracted. Given probability mass functions over the state space and
the following holds
Given that the set of all probability distributions over state space ; denoted as , the following identity holds
Maximum probability events have roots in the early definitions related to random variables. In other words, maximum probability is implicitly used in obtaining probability of outcomes of random variables. We start by going over the classic definitions corresponding to random variables. The preimage of an outcome of the random variable , denoted by , is defined as
Note that since is a measurable function, then .
Probability of observing an outcome of random variable denoted as is classically defined as
where is probability distribution of . Definition 2.2 considers probability of an outcome of to be the probability of the largest event in the sense of number of elements, that is mapped to , i.e. . It is possible to show an equivalent definition in the sense of probability. Probability of an outcome of random variable in the sense of maximum probability is defined as
Definition 2.2 reveals a principle hidden in definition 2.2 and 2.2, while Proposition 2.2 shows the equivalence of the two definitions. Preimage of an outcome coincides with the most probable event being mapped to because of the monotonicity of probability measures. Definition 2.2 highlights what we call the Maximum Probability Principle, which is described in the following statement.
While there are potentially many events leading to the outcome , one considers the set of events with the highest probability. Consequently, probability of outcome is the probability of the most probable event that is mapped to outcome .
In the language of Information Theory [[cover2012elements]], information content of an event is quantified as , which is the minimum bits required to describe the event . When the set shrinks in the sense of probability, it becomes more specified and therefore requires further description. Considering lower probability than the maximum bound is translated to more information content in an event. This is equivalent to appending assumptions to the information content of the observation. Therefore, the maximum probability principle translates to Minimum Assumption Principle in the language of information theory.
The importance of Theorem 2.1 lies in the calculation of the probability of sets by knowing the corresponding conditional distribution over the range of random variables. Additionally, the definition of maximum probability is implicit in the definition of probability of observing outcomes of random variables and the definition of preimage. The following corollary shows an implicit principle that is hidden in Definition 2.2
Corollary 2.2 shows that is equivalent to in the case of exact observations of . While Definition 2.2 fails to address the probability of uncertain observations of , extends to partial or uncertain observations. Let us clarify the former statement by an example. Imagine a fair coin is being flipped in a room. Alice asks Bob to investigate the room and tell her the outcome of the coin flip. Consider two scenarios:
Bob tells Alice that the coin is surely Head. Alice using classic definitions concludes that the probability of the event is . The conclusion is similar if Alice uses to calculate the probability.
Bob tells Alice that he believes that the coin is Head with probability . Alice cannot use classic definitions to calculate the probability of the event. However using the Maximum Probability bound, she concludes that the probability of the evidence that Bob observed is at most .
2.3 Models, Oracle and Practical Perspective
In the current trend of Bayesian statistics and machine learning, a family of distributions with parametric forms is assumed for the underlying model. Subsequently, the underlying model is estimated by using the most probable parameters given the observations or an ensemble of models with parameters being sampled from the posterior distribution of parameters.
In Contrast, we define a Model as an event; , where the distribution of the random variable given a model is . Furthermore, an Oracle , is assumed, where the observed outcomes are generated from . The Oracle is the underlying probability model that our model is approximating. It is evident that with the maximum probability operator, one can calculate the maximum probability of models. As opposed to the classical formulations where a probability density is assumed on the parameters, in our perspective each model has a probability mass. From the perspective of information, the minimum information of a model can be interpreted as the information or assumptions that are stored in the model. In this section, we assume that the true distribution of the oracle is given, to find an objective function in the ideal scenario. The consequences of using the empirical distribution of the oracle is not in the scope of this paper.
Since both models are represented as sets, the similarity between models can be measured by the symmetric difference operator, . We use the probability measure of the complement of symmetric difference, as the objective function which is explained in the following sections. Observe that since sigma algebras are closed under finite unions, finite intersections and the complement operation. We denote the operation of symmetric difference complement as . The visual representation of the approach is depicted in Figure 1. Probability of intersection of two sets , is bounded by the following
where for Probability of symmetric difference of two sets , and it’s complement is bounded by the following
Probability of symmetric difference of two sets , is bounded by the following when is a random variable defined over
The above lemmas provide a guideline for optimization of models. To summarize, the bounds in Lemma 2.3 can be calculated when is defined since . We use Theorem 2.1. to calculate upper-bound of and . In the next section, we explain the practical process briefly, by considering the upper bounds of probability measures when using stochastic models with hidden states.
3 Gradient based Optimization
As a toy example, consider the following classification problem where the set is given as the observations. Furthermore, Consider the model characterized by , determining a conditional probability model, namely where are random variables associated with the i.i.d observations and is the internal random variable. Note that does not necessarily assume a probability distribution over . The empirical distribution of given by the available data can be used as . However in this section we consider the general case that the model
assumes a joint distribution over.
The first step toward the formulation of the problem is to assume a probability distribution over random variables under the sample space . We use Laplace principle of indifference to determine the prior conditional distribution. According to Laplace’s Principle a uniform probability distribution over should be assumed, i.e. where is the cardinality of the range of . We introduce the random variable to simplify the derivations. The objective function is constructed as
To optimize using gradient methods, first we need to obtain the partial derivative of objective function with respect to parameters of the model;
. We use the chain rule to obtain
is the index for components of vector. Considering (24), the first part of the total derivative is equal to
) can be approximated with Monte Carlo approximation, where the final optimization process is noisy. Stochastic Gradient Descent is empirically proven to be successful in Machine Learning community. The popular example of noisy optimization is Stochastic Gradient Descent, where the gradients are in the form of expectations over the empirical distribution and consequently are approximated with Monte Carlo.
The distribution defined by (27) is not explicitly known. We use rejection sampling to draw samples from (27). The samples are drawn from and accepted with probability . The empirical distribution of the accepted samples converges to (27). We showed the general bounds for in (21); however, we assume that and are independent events conditioned on without loss of generality. Under the conditional independence assumption, we can conclude that
Bayes rule can be used to obtain
On the other hand, the quantities in (31) and (29) are computationally tractable. Therefore (27) can be sampled efficiently within the computational constraints imposed by rejection sampling i.e. probability of rejecting a sample. Since the upper-bound in (31) can be calculated in a deterministic fashion, the computational graph is available during training. Consequently, the gradients with respect to
are backpropagated according to the computational graph.
We introduced a new perspective on the probabilities of events when observing random variables. This perspective directly affects and simplifies the definition of probabilistic models and their probability. Classically, a prior distribution were assumed over the parameters of a model without any systematic regard on the models functionality. In contrast to classic perspective, we obtain the prior distribution over models by directly considering their functionality. Specifically, every model represents an event in the probability space. The maximum probability bound presented in this paper provides an upper bound for probability of the corresponding event.
Appendix A Classic View of Inferrence
To begin, we start with a popular methodology; Bayesian Inference and highlight the complications arising in assuming prior over parameters of models.
Consider a fairly simple example of a single coin flip, with possible outcomes of the random variable . The goal is to approximate the true probability distribution over the outcomes of a single flip represented by vector .
To determine our belief about the random vector given the outcomes of the coin flip we typically write
is a density function over the simplex, representing our belief about the coin, prior to observations. Since does not depend on the evidence, it can be set to any pdf over the simplex and yet not conflict with any of the existing foundations of probability theory. In this example, although the observable random variables had finitely many states, the space of models contain uncountable states and the prior distribution over parameters compared to that of random variables is disproportionately complex. The choice of prior here would change the outcome of inference, especially when the observations are limited. The Bayesian perspective is rooted in Kolmogorov’s formulation of probability theory [[jaynes2003probability]]. It is insightful to visualize and understand the construction of Bayesian formalism within the underlying probability space. The underlying methodology consists of the following probabilistic construction. Given a probability space , and observable random variable , we introduced an auxiliary random variable (parameters) with prior pdf and a likelihood function . Therefore the posterior distribution over is determined given observations and Bayes rule. The visual procedure in the probability space is depicted in Figure 2. Note that the parameters in the aforementioned construction are usually not observable and are numerical representatives of models.
Appendix B Proofs
b.1 Theorem 2.1
and the following is true
since above is true for all then the following holds
b.2 Corollary 2.1
The proof directly follows from the proof of the Theorem 2.1 by replacing with .
b.3 Theorem 2.1
Since is extends and is either the subset of some particular or does not have intersection above equation would reduce to
where . Similarly we know that
From the proof of Theorem 2.1 we can write
for all we can write
b.4 Remark 2.1
the proof follows from Theorem 2.1, if is extending . By definition of , for every state , for some . Therefore every state of is either a subset of some partition induced by or does not have intersection with it. Therefore extends .
b.5 Proposition 2.1
The following holds
where the equality holds as . (Lemma) Let us denote
since there exist at least an element in the set having the same value as supremum of the set, for some positive constant and therefore positive. since . Therefore (47) is greater than the supremum. In the limit case since
since , then
We can use the fact that to prove the lemma for the infimum case. by directly using the inequlity in Lemma (47) we can show that
and by negating and exponentiating the above inequality we get