The kill chain, identifying and destroying a target, can be factorized into four steps: (1) obtaining sensor readings, (2) classifying the object from the sensor readings (including sensor fusion when multiple sensors are employed), (3) determining whether the sensor reading is reliable, and (4) making a decision to fire or to hold fire.
Over the past 50 years, tremendous work has been placed into automating the sensor-exploitation-and-fusion step; this work has produced strong capabilities in the challenging problem of automatic target recognition (ATR). A true data-to-decision framework, however, could strengthen ATR by automating some aspects of the operator review step as well. In particular, one of the tasks the operator performs is to review whether each classification decision seems reasonable in view of its context. For example, consider an operator reviewing a sensor reporting 80% confidence of Russian Bombers over (a) Kiev and (b) New York City. The operator has three options: (1) naïvely accepting these sensor readings as received (in this case, reporting an 80% chance of bombers over New York), (2) rejecting the counter-intuitive sensor readings pro forma (this defeats the purpose of using the sensor at all), or (3) manually correcting the sensor readings, for example to 90% and 0.01%, respectively. Of these, option (3) is perhaps the most agreeable; however, there is a need for a principled way to perform these corrections.
In this work we consider a general solution to this challenge. For concreteness, we consider the case of imagery datasets (such as those in Figure 1), in which context is defined as knowledge of co-occurrence patterns among object classes. Both images in Figure 1
contain two objects, one of which the sensor is able to resolve with 99% confidence (“ACME warhead” and “dog”, respectively); the other is too blurred to be resolved. For this blurry object, we postulate that the sensor reports equal probability for each of three different possible object classes. We will show how context can enhance our classification decision for the blurry object in both cases.
Ii-a Problem Formulation
We begin by formulating the problem. We have an image, I, with objects, each of which represents exactly one of object classes, and is associated with one out of at least one possible contexts, . For a military situation, could be various conflict regions or locations; in civilian photography, these could refer to the topic of a collection (sporting events, architecture, portraits, etc). We further have a set of sensor readings, , such that (these sensor readings may be decisions from a single sensor or from a multisensor inference algorithm); sensor readings further have a reported uncertainty level.
Our problem is to quantify the context, , and use the context to correct low-confidence sensor readings. We consider two models for this purpose: a Bayesian Network (BN) and a Hierarchical Bayesian Model (HBM).
Ii-B Bayesian Network
We construct the Bayesian network in a completely data-driven way. Beginning with labeled images associated with context , we construct a graph for each where each object class is a node and the edge weights between and represent the number of in which objects of class and co-appear. For convenience, we assign parent-child relationships based on the alphabetical ordering of the node names.
The result is a Bayesian Network that fully encodes the joint distribution over all object classes. Further, the network is efficient in that only nodes that co-occur are connected (further, the threshold for connectivity can be raised to reduce the number of edges).
Ii-C Hierarchical Bayesian Model
For the Hierarchical Bayesian Model , we begin again with the labeled images, , associated with context, ; for each , we calculate the (an
-dimensional vector) and(an dimensional matrix) that give the frequency of each object class and the correlation matrix among the object classes, respectively. The values for and
can be measured quantitatively from labeled training data or provided as estimates.
We now model the sensor readings as having been drawn from a hierarchy of latent probability distributions. This is depicted in plate notation in Figure2, in which the shaded plates represent observed variables. In particular, and (defined above) are used to define a
-dimensional multivariate normal distribution from which some vector,, is drawn. We then normalize to obtain according to the formula:
Each represents the expected number of times that an object class will appear in the scene (note that the value of is fully deterministic given the value of ; hence, only appears in Figure 2).
Given , the vectors are then chosen subject to the constraints that , , and ; we interpret as the probability that object belongs to class . Finally, the sensor readings are drawn from a latent multivariate normal distribution peaked at
, with a standard deviation given by the sensor’s reported uncertainty. This is a key point: if the sensor reports a high degree of certainty, the Hierarchical Bayesian Model will not attempt to second-guess its decision (this is analogous to humans making counter-intuitive classification decisions, i.e., “I know a tiger when I see one!”).
To summarize, the HBM is a generative model that generates sensor observations by performing the following steps:
Generate the scene from context via .
Normalize according to equation (1).
For each of the objects, generate the true object class distribution by randomly choosing vectors that sum to and can be interpreted as probabilities (i.e., sum to one, no values below zero or above one).
Generate sensor observations using a multivariate normal peaked around with standard deviation given by the sensor’s reported uncertainty.
We can now write the joint distribution for the probability distribution of the observations and true values in terms of the hyper-parameters and . This is given in Equation 2.
Iii-a Toy Scenario
We now define a notional missile-defense scenario under which we can leverage this model. We imagine that there are three types of warheads, which we name ACME, GLOBEX, and TRIOZAP. We further imagine that there are three empires with nuclear technology: the Kingdoms of Ohio, Iowa, and Utah. We imagine that Ohio and Utah commonly use TRIOZAP warheads whereas Iowa rarely use them; further, Ohio often launches ACME and TRIOZAP warheads together, whereas Iowa and Utah rarely launch them together. This scenario is summarized in Table 1.
|Kingdom||Triozap warheads are…|
|Iowa||Rare||Anti-correlated with ACME|
|Ohio||Common||Correlated with ACME|
|Utah||Common||Anti-correlated with ACME|
Numerically, we assume that the relative frequency of each object class () and the correlation matrix among these object classes () are given by:
We now imagine that an imaging sensor returns Figure 1(a), from which two objects are detected: an ACME warhead at 99% 1% confidence, and an unknown warhead that is consistent with ACME, GLOBEX, or TRIOZAP at 33.33% 30% confidence each.
Iii-B HBM Results
We attempt to use these hyper-parameters to solve equation 2. This integral turns out to be intractable (this is common for HBMs); however, we can sample the latent true object distribution,
, using a sampling technique such as Markov Chain Monte Carlo (MCMC); Table2 shows the results..
|1||None||Physical Sensor Alone||0.334|
|2||Utah||Physical Sensor context,||0.423|
|3||Utah||Physical Sensor full context||0.314|
|4||Iowa||Physical Sensor full context||0.155|
|5||Ohio||Physical Sensor full context||0.667|
These results are aligned with our intuition:
Line 1 recapitulates our statement that the physical sensor alone gives a 33.3% probability to each of the three warhead classes.
Line 2 takes into account only the fact that Utah commonly uses TRIOZAP missiles; P(TRIOZAP) increases accordingly.
Line 3 takes into account both that Utah commonly uses TRIOZAP missiles and that TRIOZAP missiles are anti-correlated with ACME; these results largely cancel.
Lines 4 & 5 give the corresponding results for Iowa and Ohio; the strength of the contextual expectation coupled with the very uncertain sensor reading results in a substantial correction magnitude.
We pause to take advantage of the opportunity to compare sampling methods as implemented in the Python scripting language (as of early 2017). In particular, we use the PyMC package for Python 2.7 to sample the latent true object distribution via MCMC (“MCMC2”), and the same package for Python 3.6 to sample the latent true object distribution via MCMC (“MCMC3”), the Variational Bayes (VB) technique, and the Hamiltonian Monte Carlo (HMC) as implemented via the No-U-Turn Sampler (NUTS) . In all cases we use the default parameters, as the PyMC package claims to require minimal fine-tuning. We show our results in Table 3, in which the algorithm is defined to have converged when no more than 3 of the 20 Geweke Scores are greater than 0.01; this is quite a conservative criterion.
Table 3 shows that although all four methods give similar results, the VB method is notably different and the HMC technique is notably slow. Further, the interface to the PyMC package for Python 2 is far more user-friendly. We therefore quote all results using PyMC for Python 2. Though it is unfortunate that these different sampling methods give non-trivially different results, this was a known issue at the time of this work, see .
There is one final generalization we can consider: what if we have the image, but do not know the context? In this case, we can simply take the weighted average of the contexts of which we are aware; the weighted coefficients can be added to our model. These weighted coefficients are the hyperpriors.
To test this approach, we consider an extreme case in which we have an image with three ACME warheads and three GLOBEX warheads as well as blurry object. We do not know which context (country) this comes from, but a clever human would suspect Iowa (because Iowa does not use many TRIOZAP warheads). Indeed, a flat hyperprior chooses a 49% chance of Iowa, a 26% chance of Utah, and only a 25% chance of Ohio. Using these hyperpriors, the final probability for the blurry object to be a TRIOZAP warhead is therefore changed from 33.3% to 36.6%. These hyperpriors therefore allow us to leverage context even when we do not know which context to leverage.
Iii-D Results on Microsoft’s COCO Dataset
As a real-world demonstration, we use Microsoft’s COCO dataset , which contains over 82,000 hand-labeled images in which all instances of 80 categories (e.g., person, car, bus, hot dog) are labeled. These 80 categories are grouped into 12 super-categories (e.g., outdoor, animal, sports). We now apply both the Bayesian Network and the Hierarchical Bayesian Model to this dataset, and use these learned models to resolve the blurry object shown in Figure 1(b).
We use the entire COCO dataset to construct a Bayesian Network as described previously. In particular, we connect two nodes only if their edge weight (number of images in which the object classes co-occur, out of total images) is greater than 1000. Part of the resulting Bayesian Network is depicted in Figure 3.
We then use this network to compute our contextual expectation for an image similar to Figure 1(b), in which one figure is clearly a “street” and the other figure is blurry. From the Bayesian Network, we can use standard inference techniques to calculate the probabilities for the blurry object as shown in Table 4.
|Object||Probability of co-occurrence|
If the physical sensor provides observations similar to the above (even at low confidence), then the concordance between the sensor reading and the contextual expectation can augment the operator’s confidence in the sensor output. If the sensor readings are very dissimilar from Table 4 (especially if the sensor confidence is low), then the operator should be more cautious of these sensor readings. The BN alone, however, does not provide a principled way to combine these divergent strategies.
As with the BN, we again use the COCO dataset to quantify context; in particular, we calculate and from the image co-occurrence. As an additional exercise, we use the 12 super-categories described above as different contexts, and calculate and for the 80 categories separately for each of the 12 contexts (by definition, each super-category will be dominated by objects from the categories that make up the super-category; however, other categories will also be represented due to co-occurrence).
Given these hyper-parameters, we address Figure 1(b), in which one object is clearly a dog, while we posit that the sensor tells us that the blurry object is consistent with another dog, a cat, or a pair of skis. As before, we imagine that all of these hypotheses are equally likely (according to the sensor), with confidence each. The HBM (trained on all images) substantially decreases the likelihood of skis (skis are rarely photographed at all, and even less commonly with dogs). More specific contexts give even more emphatic results, however: if the photo is evaluated under the “animal” context, the ski hypothesis is almost completely rejected; while in the “sports” context, the skis become the most likely outcome by far. This is consistent with our intuition: if the photo is labeled “sports” (e.g., if it appears in Sports Illustrated), then we would expect the image to contain something sport-relevant; this expectation would be very different if the photo appeared in Animal Planet. These numerical results are given in Table 5.
We have designed an HBM and used it to enhance low-confidence sensor readings using contextual information, eliminating the need for ad hoc adjustments by the analyst (at least for this type of context). We further showed numerical results on cases in which the context was defined as co-occurrence in imagery. We find that, in this setting, the HBM provides a natural way to numerically fuse the contextual expectation with the sensor readings and sensor uncertainties. We also considered using a Bayesian Network for this purpose; though the Bayesian Network assumes that sensor readings and context are uncorrelated, it is more computationally efficient. Though this setting defines context as co-occurrence only, considering other definitions of context would be a natural extension to this work.
This work was supported by the Missile Defense Agency under contract HQ0147-16-C-7602. Approved for Public Release 18-MDA-9664 (30 May 2018).
-  A. Gelman et al., Bayesian Data Analysis. CRC Press, 2014.
-  M. D. Hoffman and A. Gelman, “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” ArXiv e-prints, Nov. 2011.
-  Pymc-devs, “ADVI, NUTS and Metropolis produce significantly different results,” https://github.com/pymc-devs/pymc3/issues/1163, 2016.
-  T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312