## I Introduction

In sequential decision-making it is important to make fast and reliable decisions. In this regard, consider, e.g., an autonomous car which has to decide whether an obstacle is present or not on the road. Such decisions are executed by dedicated signal processing algorithms. These algorithms should use the available measurements in an optimal way such that the average time to take a decision is minimized. For practical use it is key to test if the implemented decision algorithms achieve the optimum performance, i.e., if decisions are made as fast as possible with a given reliability.

Sequential decision-making has been first mathematically formulated in the seminal work by A. Wald who introduced a sequential probability ratio test [2]. Wald’s test takes binary decisions on two hypotheses based on sequential observations of a stochastic process. For independent and identically distributed (i.i.d.) observations this test yields the minimum mean decision time for decisions with a given probability of error and a given hypothesis [3]. Wald’s test accumulates the likelihood ratio given by the sequence of observations and decides as soon as this cumulative likelihood ratio exceeds or falls below two given thresholds which depend on the required reliability of the decision. A key characteristic of such a sequential test is that its termination time is a random quantity depending on the actual realization of the observation sequence. The Wald test has been applied to non i.i.d. observation processes, nonhomogeneous and correlated continuous-time processes, and has been generalized for multiple hypotheses [4]; general optimality criteria for sequential probability ratio tests have been proved when probabilities of errors tend to zero, see e.g. [5, 6, 7, 8, 4].

Now we consider the decision-making device as a black box which takes as input the observation process, corresponding to one of two hypotheses, and gives as output a binary decision variable at a random decision time. Can we determine whether this decision-making device is optimal based on the statistics of the output of the device — the decisions and the decision times — and the knowledge of the true hypothesis? Indeed, in the present paper we introduce a test for optimality of sequential decision-making based on necessary conditions for optimality. Notably, this test does *not* require knowledge of the realizations or the statistics of the observation processes.

We first consider a device which takes as input the realization of a continuous stochastic process corresponding to one of the two hypothesis or , and gives as output a binary decision variable (corresponding to the hypotheses and , respectively) at the random decision time elapsed since the beginning of the observations. We will show that optimality of sequential probability ratio tests — in the sense that the mean decision time is minimized while fulfilling given reliability constraints — requires that the following conditions on the distribution of the decision time hold

(1) | |||||

(2) |

Here, is the probability density of the decision time and (corresponding to the hypotheses and ) denotes the random binary hypothesis. The necessary conditions (1) and (2) for optimality imply that the distribution of the decision time given a certain outcome are independent of the actual hypothesis. Moreover, this implies that the decision time of the optimal sequential test does not contain any information on which hypothesis is true beyond the decision outcome . As a consequence, we can quantify the optimality of a given black-box test by measuring the mutual information between the hypothesis and the decision time conditioned on the output of the test , i.e., . In case the test is optimal it must hold that

(3) |

Based on the following example it can be seen that (3) is a necessary but not a sufficient condition for optimality in the sense of minimizing the mean decision time given a certain reliability. Consider we have an optimal decision device using the Wald test. Now we delay all decisions by a constant time . Still with being the decision time of the Wald test. Indeed, (3) is not a sufficient condition for the minimal mean decision time, but rather a measure for the optimal usage of information by the decision device. If this means that the decision time contains additional information on the hypothesis beyond the actual decision implying that the decision device does not exploit all the available information. Hence, measures the degree of divergence from optimality in the sense of optimal usage of information. For practical purposes it is easier to test whether the black-box decision device fulfills the optimality condition in (3) rather than testing if the decision device decides with the minimum mean decision time, since the minimum mean decision time is in general not known. Furthermore, consider that in experimental setups we can measure a decision output at a certain time , with a random delay time, but in general we do not know at which time the decision has been taken, as the decision device is a black box device and we cannot clearly separate the actual decision making process from nondecision processes. If the decision time is statistical independent of when conditioned on and , then implies that , and can be used as a necessary condition to test optimality of decision devices. If additionally , then . In this paper we derive the optimality conditions (1) - (3) and generalize them to discrete-time observation processes. We furthermore formulate tests for optimal sequential hypothesis testing (sequential decision-making) based on (1) - (3). Finally, we illustrate our results in computer experiments.

The optimality conditions (1) - (3

) are of interest in different contexts. For example these conditions allow to test optimality of sequential decision-making in engineered devices. An example are decision-making devices based on machine learning such as deep neural networks. Such algorithms have the power to solve very complex tasks and adapt to specific environments by learning. The advantage of machine learning is that not all environmental situations need to be learned at design time, which for many applications like self driving cars is not practical. However, while neural networks exhibit best performance in comparison to other approaches the principles which lead to their successful operation are yet unclear. It would be very useful to quantify if decisions made on such deep learning approaches are close to optimal. In addition to engineering, the tests proposed in this paper could also allow to understand if specific biological systems use all available information optimally to make reliable decisions on the fly. In this regard, consider the example of sequential decision-making by humans in two-choice decision tasks based on perceptual stimuli or biological cells making decisions on their fate based on extracellular cues. We discuss the application of the tests for optimality to these examples in more detail below.

##### Notation

We denote random variables by upper case sans serif letters, e.g.,

. All random quantities are defined on the measurable space and are governed by the probability measure. The probability density function of a random variable

given is written. Moreover, for discrete random variables

denotes the probability of given . The restriction of the measure to a sub- algebra is written as . Finally, denotes the natural logarithm and is the logarithm w.r.t. base . The mutual information and the conditional mutual information are defined by and , respectively, where the mathematical expectation is taken with respect to the measure .##### Organization of the Paper

The paper is organized as follows. After the introduction, we describe the system setup in detail in Section II where we also give a precise problem formulation including definitions of optimality for sequential decision-making. Subsequently, in Section III we derive the main theorems and corollaries describing properties of optimal sequential probability ratio tests for the case of continuous observation processes. In Section IV for certain conditions we extend these theorems and corollaries to the discrete-time scenario. In Section V we formulate statistical tests to decide whether a black-box decision device performs optimal sequential decision-making based on the theorems and corollaries derived in Section III and IV, and we also discuss how to measure the distance of the black-box decision device to optimality. We illustrate the application of these tests based on numerical experiments in Section VI. In Section VII we discuss the applicability and the limitations of the provided tests for optimality. Readers who are mainly interested in the application of the statistical tests for optimality may skip Sections III and IV.

## Ii System Setup and Problem Formulation

### Ii-a System Setup

We consider a sequential binary decision problem based on an observation process with the time index either discrete, , or continuous, . The stochastic process is generated by one of two possible models corresponding to two hypotheses and . To describe the statistics of the process we consider the filtered probability space with the natural filtration generated by the observation process and the hypothesis . We consider to be a time independent random variable. The statistics of the observation process under the two hypothesis are described by the conditional probability measures given the hypothesis with corresponding to the hypothesis and , respectively [9]; here is the indicator function on the set . We also consider the filtered probability spaces with associated with the two hypotheses. We consider for continuous-time processes that the filtration is right-continuous [4], i.e., for all times .

A sequential test makes binary decisions based on sequential observations of the process and tries to guess which of the hypotheses and is true. A sequential test returns a binary output at a random time . The decision time is a stopping time, which is determined by the time when satisfies for the first time a certain criterion. This stopping rule is non-anticipating in the sense that it depends only on observations of the input sequence up to the current time, i.e., the decision causally depends on the observation process. The decision function is a map from the trajectory to , which determines the decision of the test.

We now consider the following class of sequential tests with given reliabilities

where denotes the expected termination time in case hypothesis is true and where the expectation is taken over the observation sequences . Moreover, and are the maximum allowed error probabilities of the two error types. We assume that . Notice that we restrict ourselves to tests which terminate almost surely. This assumption is fulfilled in many cases like the case of i.i.d. observation processes [10, Th. 6.2-1] and stationary observation processes. Note that the class of sequential tests given by does not consider prior knowledge on the statistics of .

We define the following optimality criterion.

###### Definition 1 (Optimality in terms of mean decision times).

An optimal test minimizes the two mean decision times corresponding to the hypothesis and for a given reliability, i.e.

(5) |

Note that in Definition 1 we assume that there exists a test for which the infimum is attained. If such a test does not exist than we can still find a test for which the two mean decision times are arbitrarily close to their infimum values.

Sequential probability ratio tests or *Wald*-tests are optimal in the sense of Definition 1 [2]. It has been proved that for the case of time-discrete i.i.d. observation processes the Wald test is optimal in the sense of Definition 1 [3]. Furthermore, under broad conditions for the observation process it has been proved that sequential probability ratio tests are optimal in the sense of Definition 1 in the limit of small error probabilities [5, 6, 7, 8, 4].

For discrete-time and i.i.d. processes the Wald test collects observations (which can be understood as samples of a corresponding continuous-time process) until the cumulated log-likelihood ratio

(6) |

exceeds (falls below) a prescribed threshold () for the first time. In (6) are the increments of the log-likelihood ratio at time instant . The test decides (), i.e., for (), when first crosses (). In (6), denotes the probability density function of the observations conditioned on the event . The thresholds and depend on the maximum allowed probabilities for making a wrong decision and . A decision with the given reliability constraints and can be made when the cumulative log-likelihood ratio for the first time crosses one of the thresholds before crossing the opposite one. The thresholds are functions of and . In general, the thresholds and are difficult to obtain. However, the optimal thresholds yielding the minimum mean decision time can be approximated by [10, p. 148]

(7) | |||||

(8) |

The choice in (7) and (8) still guarantees that the error constraints in (LABEL:TestReq) are fulfilled. In summary, the sequential probability ratio test decides at the time

(9) |

for the decision

(10) |

Analogously, the Wald test for non-i.i.d. observation processes is given by (9) and (10) with the log-likelihood ratio

(11) |

where .

The Wald test can also be formulated for continuous-time observation processes. In this case probability densities of the observation trajectories do not always exist. However, the likelihood ratio can be defined in terms of the Radon-Nikodým derivative of the probability space with respect to the probability space :

(12) |

with . Here, the process is the cumulative log-likelihood ratio and () are restricted measures of w.r.t. the -algebra . A decision with the given reliability constraints and can be made when the cumulative log-likelihood ratio for the first time crosses one of the thresholds before crossing the opposite one. I.e., the test decides () in case it crosses () for the first time before crossing () where the thresholds are functions of and . For continuous observation processes is continuous and the thresholds are exactly given by (7) and (8), see, e.g., [10, p. 148]. Therefore, the Wald test for continuous-time observation processes is defined by

(13) |

with the decision output given by

(14) |

### Ii-B Problem Statement

Consider now the black-box decision device as illustrated in Fig. 1 for which the stochastic observation process and the algorithm of the decision device are both unknown. Such a black-box decision device is a sequential test for which the function is unknown. We ask now the question: Is it possible to determine whether such a black-box decision device is optimal in the sense of Definition 1 based on many outcomes and of the device?

Having access to the decision outcomes and decision times it is impossible to verify optimality in terms of Definition 1. In this regard consider that the value of the minimum mean decision time is typically unknown since the observed process and its statistics are often not known. We thus introduce the following alternative definition of optimality, which is based on the idea that optimal sequential decision-making needs to exploit the available information optimally.

###### Definition 2 (Optimality in terms of information).

An optimal test minimizes the mutual information , i.e.

(15) |

Later we will show that for continuous observation processes optimality in the sense of Definition 1 implies optimality in the sense of Definition 2 but not vise versa. In this regard, consider that (15) is invariant w.r.t. time delays in the decisions, i.e., , if is statistically independent of conditioned on and and if additionally satisfies that . Moreover, we will show that for continuous observation processes optimal information usage implies that , because a test achieving always exists and is nonnegative. For these reasons Definition 2 will allows us to formulate practical tests for optimality of sequential decision-making in black-box decision devices. In general, for the discrete-time setting , as the information on the hypothesis does not arrive continuously but in chunks, which makes it more difficult to test optimality in discrete-time settings.

## Iii Optimality Conditions for Continuous Observation Processes

To understand the conditions on optimal sequential decision-making we will derive relations between decision time distributions of optimal binary sequential probability ratio tests. In this section we consider optimal sequential probability ratio tests for continuous observation processes, which are given by in (13) - (14). We call these relations decision time fluctuation relations for their reminiscence to stopping time fluctuation relations in non-equilibrium statistical physics, in particular stochastic thermodynamics [11, 12]. In order to derive these relations we use a key property of the exponential of the cumulative log-likelihood ratio defined in (12), namely that it is a positive and uniformly integrable martingale process with respect to the probability measure and the filtration generated by the observation process [9]. An -adapted and integrable process is called a martingale w.r.t. and a measure if its expected value at time equals to its value at a previous time , when the expected value is conditioned on observations up to the time . For , , and this implies that

(16) |

-almost surely and with . Integrability of implies that .

### Iii-a Decision Time Fluctuation Relation for Optimal Decision Devices

###### Theorem 1.

We consider a binary sequential hypothesis testing problem with the hypotheses . Let and be two probability measures on the same filtered probability space corresponding to the hypothesis and , respectively. We assume that is right continuous. We consider that on , the probability measure is absolutely continuous with respect to . Furthermore, we consider that the realization of the process is almost surely continuous. Let and be as in (13) and (14) with . We also assume that has a density function. Under these assumptions the following holds

(17) | |||||

(18) |

where is the decision time distribution conditioned on the hypothesis and the decision output .

###### Proof.

Let

(19) |

be the set of trajectories for which the decision time does not exceed and the test decides for . The probability of the event with respect to the measures or is equal to the cumulative distribution of the decision time conditioned on the hypothesis or , respectively, and conditioned on the decision outcome . We find the following identity between and :

(20) | |||||

(21) | |||||

(22) | |||||

(23) | |||||

(24) |

where for (21) we have used the Radon-Nikodým theorem and the definition (12). For equality (22) we have applied *Doob’s optional sampling theorem* [9, 13]
to the uniformly integrable -martingale process . For (23) we have used that is a continuous process and achieves the value at time .

The probability density functions of can be expressed in terms of the derivatives of the cumulative distributions ()

(25) | |||||

(26) |

The ratio of the decision probabilities is

(27) |

which follows from , , Eq. (24), and from the assumption that the test terminates almost surely. Taking the derivative of the left hand side (LHS) of (20) and the right hand side (RHS) of (24), and using Eqs. (25) to (27), we prove Eq. (17). Analogously, Eq. (18) can be proved. ∎

### Iii-B Decision Time Fluctuation Relation for Optimal Decision Devices with Unknown Hypotheses

In the following, we derive a second fluctuation relation, which we will apply to test optimality of sequential decision-making with less information than required for Theorem 1 (see Section V-B2), but holds only if the maximal allowed error probabilities are symmetric, i.e., , and the measures and on are related by a measurable involution . We consider that

(28) |

with a measurable involution, i.e., is invertible with inverse and with for all .

###### Theorem 2.

Under the same conditions as in Theorem 1, with the additional assumption that with a measurable involution, and with the additional assumption that the maximal allowed error probabilities fulfill , the following holds

(29) | |||||

(30) |

Furthermore, it holds that

(31) |

A special case of the result in Theorem 2 has been found in the context of nonequilibrium statistical physics [11, 12]: the two hypotheses correspond to a forward and a backward direction of the arrow of time, and corresponds to the time-reversal operation. The Radon-Nikodým derivative is then the stochastic entropy production, and the decision time is its two-boundary first-passage time to cross one of two given symmetric values. Moreover, in communication theory such a symmetry has been found to show that the probability of cycle slips to the positive/negative boundary in phase-locked loops used for synchronization is independent of time [14, Eq. (74)].

### Iii-C Information Theoretic Implications of Optimal Sequential Decision-Making

Theorem 1 and Theorem 2 express statistical dependencies of different random quantities involved in optimal sequential decision-making. Based on Theorem 1 we will now show the following.

###### Corollary 1.

Under the same conditions as in Theorem 1, the following equality for mutual information holds

(32) |

i.e., .

###### Proof.

By the chain rule for mutual information,

can be expressed by(33) |

The second term on the RHS of (33) is given by

(34) | |||

(35) | |||

(36) | |||

(37) | |||

(38) | |||

(39) |

Corollary 1 states that in case of optimal sequential decision-making the decision time does not give any additional information on the hypothesis beyond the decision outcome . In this regard, consider that the first term on the RHS of (33) is the mutual information the decision outcome of the test gives about the actual hypothesis . The second term on the RHS of (33) is the additional information the termination time gives on the hypothesis beyond the information given by the decision . Thus, we have proved that for continuous observation processes optimal sequential decision-making w.r.t. Definition 2 is achievable and that . Note that since sequential probability ratio tests have been shown to be optimal in the sense of Definition 1, Corollary 1 implies that optimality in the sense of Definition 1 also implies optimality in the sense of Definition 2.

In case the assumptions of Theorem 2 are satisfied additionally, the following two corollaries hold.

###### Corollary 2.

Under the same conditions as in Theorem 2, the following equality holds

(40) |

###### Corollary 3.

Under the same conditions as in Theorem 2, and with the additional assumption that , the following equality holds

(44) |

## Iv Optimality Conditions for Discrete-Time Observation Processes

In the following, we extend the analysis on optimal information usage in sequential decision-making to the discrete-time setting. In discrete time the optimal test in the sense of Definition 1 is given by and defined in (9) and (10). Extending our results to a discrete-time setting is relevant for discrete-time systems. Moreover, in usual experimental setups a continuous-time system is sampled yielding a discrete-time representation. The extension from continuous processes to discrete-time processes is not straightforward, as one key characteristic in the continuous-time setting is the fact that the test terminates with a cumulative log-likelihood ratio exactly hitting one of the thresholds. This property of continuous processes does not hold true in the discrete-time setting, where the mean value of the cumulative log-likelihood ratio at the decision time slightly overshoots the thresholds.

The thresholds and depend on the maximum allowed error probabilities and , cf. (LABEL:TestReq). Due to the fact that in the discrete-time setting the trajectory of the accumulated log-likelihood ratios in (6) does not necessarily hit one of the thresholds the determination of the optimal thresholds and in terms of and are rather involved, see [2]. and are chosen such that the allowed error probabilities given in (LABEL:TestReq) are obeyed with equality.

In the following, we study the statistical dependencies between the hypothesis , the decision , and the number of observations the sequential probability ratio test given by (9) and (10) uses to make decisions.

The necessary condition for optimal decision devices given in Theorem 1 for the continuous-time setting does not carry over to the discrete-time settings as we will discuss in the following. This can be understood from applying the steps in the proof of Theorem 1 in (20) to (24) to the discrete-time setting. In the discrete-time case with the measure of the discrete-time version of the set in (19) can be expressed by

(45) | |||||

(46) | |||||

(47) | |||||

(48) | |||||

(49) | |||||

(50) |

where in (50) is the overshoot beyond the threshold . Since in general the distribution of the overshoot depends on the time the fluctuation relations (17) and (18) do not extend to the discrete-time case.

Taking the difference between the values of at two consecutive time instants we get

(53) | |||||

In case is time independent, we get from (51) and (52)

(54) |

where we have used the assumption that the test terminates almost surely, and we get the fluctuation relations corresponding to Theorem 1 for decision times in the discrete-time case.

The constraint that is time independent is approximately fulfilled in case the size of the thresholds and is large in comparison to the average increase of the log-likelihood ratio per observation sample, see (11). This can be seen as taking the continuum limit of the decision making process. In this regard, consider that the distribution of the overshoot is time independent if the distribution of the distance , at the time instant before a decision is taken, is time independent, and if the distribution of the increment is independent of time. The distribution of is time independent if the initial value of the cumulative log-likelihood has no significant influence anymore on the distribution of when conditioning on termination at time instant . This is satisfied in case is sufficiently large, which holds if the thresholds and are large in comparison to the average of the increments of the log-likelihood ratio . This is illustrated for an example based on numerical simulations in Section VI-A4.

In the following, we assume that the condition

(55) |

is fulfilled. For many practical applications this condition is approximately fulfilled, see the numerical experiments in Section VI.

The results on optimal information usage carry over from continuous time to discrete time given that (55) holds.

###### Theorem 3.

We consider a binary sequential hypothesis testing problem with the hypotheses . Let and be two sequences of probability density functions of the sequence of real valued observations in case hypothesis

Comments

There are no comments yet.