I Introduction
The kill chain, identifying and destroying a target, can be factorized into four steps: (1) obtaining sensor readings, (2) classifying the object from the sensor readings (including sensor fusion when multiple sensors are employed), (3) determining whether the sensor reading is reliable, and (4) making a decision to fire or to hold fire.
Over the past 50 years, tremendous work has been placed into automating the sensorexploitationandfusion step; this work has produced strong capabilities in the challenging problem of automatic target recognition (ATR). A true datatodecision framework, however, could strengthen ATR by automating some aspects of the operator review step as well. In particular, one of the tasks the operator performs is to review whether each classification decision seems reasonable in view of its context. For example, consider an operator reviewing a sensor reporting 80% confidence of Russian Bombers over (a) Kiev and (b) New York City. The operator has three options: (1) naïvely accepting these sensor readings as received (in this case, reporting an 80% chance of bombers over New York), (2) rejecting the counterintuitive sensor readings pro forma (this defeats the purpose of using the sensor at all), or (3) manually correcting the sensor readings, for example to 90% and 0.01%, respectively. Of these, option (3) is perhaps the most agreeable; however, there is a need for a principled way to perform these corrections.
In this work we consider a general solution to this challenge. For concreteness, we consider the case of imagery datasets (such as those in Figure 1), in which context is defined as knowledge of cooccurrence patterns among object classes. Both images in Figure 1
contain two objects, one of which the sensor is able to resolve with 99% confidence (“ACME warhead” and “dog”, respectively); the other is too blurred to be resolved. For this blurry object, we postulate that the sensor reports equal probability for each of three different possible object classes. We will show how context can enhance our classification decision for the blurry object in both cases.
Ii Model
Iia Problem Formulation
We begin by formulating the problem. We have an image, I, with objects, each of which represents exactly one of object classes, and is associated with one out of at least one possible contexts, . For a military situation, could be various conflict regions or locations; in civilian photography, these could refer to the topic of a collection (sporting events, architecture, portraits, etc). We further have a set of sensor readings, , such that (these sensor readings may be decisions from a single sensor or from a multisensor inference algorithm); sensor readings further have a reported uncertainty level.
Our problem is to quantify the context, , and use the context to correct lowconfidence sensor readings. We consider two models for this purpose: a Bayesian Network (BN) and a Hierarchical Bayesian Model (HBM).
IiB Bayesian Network
We construct the Bayesian network in a completely datadriven way. Beginning with labeled images associated with context , we construct a graph for each where each object class is a node and the edge weights between and represent the number of in which objects of class and coappear. For convenience, we assign parentchild relationships based on the alphabetical ordering of the node names.
The result is a Bayesian Network that fully encodes the joint distribution over all object classes. Further, the network is efficient in that only nodes that cooccur are connected (further, the threshold for connectivity can be raised to reduce the number of edges).
IiC Hierarchical Bayesian Model
For the Hierarchical Bayesian Model [1], we begin again with the labeled images, , associated with context, ; for each , we calculate the (an
dimensional vector) and
(an dimensional matrix) that give the frequency of each object class and the correlation matrix among the object classes, respectively. The values for andcan be measured quantitatively from labeled training data or provided as estimates.
We now model the sensor readings as having been drawn from a hierarchy of latent probability distributions. This is depicted in plate notation in Figure
2, in which the shaded plates represent observed variables. In particular, and (defined above) are used to define adimensional multivariate normal distribution from which some vector,
, is drawn. We then normalize to obtain according to the formula:(1) 
Each represents the expected number of times that an object class will appear in the scene (note that the value of is fully deterministic given the value of ; hence, only appears in Figure 2).
Given , the vectors are then chosen subject to the constraints that , , and ; we interpret as the probability that object belongs to class . Finally, the sensor readings are drawn from a latent multivariate normal distribution peaked at
, with a standard deviation given by the sensor’s reported uncertainty. This is a key point: if the sensor reports a high degree of certainty, the Hierarchical Bayesian Model will not attempt to secondguess its decision (this is analogous to humans making counterintuitive classification decisions, i.e., “I know a tiger when I see one!”).
To summarize, the HBM is a generative model that generates sensor observations by performing the following steps:

Generate the scene from context via .

Normalize according to equation (1).

For each of the objects, generate the true object class distribution by randomly choosing vectors that sum to and can be interpreted as probabilities (i.e., sum to one, no values below zero or above one).

Generate sensor observations using a multivariate normal peaked around with standard deviation given by the sensor’s reported uncertainty.
We can now write the joint distribution for the probability distribution of the observations and true values in terms of the hyperparameters and . This is given in Equation 2.
(2) 
Iii Results
Iiia Toy Scenario
We now define a notional missiledefense scenario under which we can leverage this model. We imagine that there are three types of warheads, which we name ACME, GLOBEX, and TRIOZAP. We further imagine that there are three empires with nuclear technology: the Kingdoms of Ohio, Iowa, and Utah. We imagine that Ohio and Utah commonly use TRIOZAP warheads whereas Iowa rarely use them; further, Ohio often launches ACME and TRIOZAP warheads together, whereas Iowa and Utah rarely launch them together. This scenario is summarized in Table 1.
Kingdom  Triozap warheads are…  

Iowa  Rare  Anticorrelated with ACME 
Ohio  Common  Correlated with ACME 
Utah  Common  Anticorrelated with ACME 
Numerically, we assume that the relative frequency of each object class () and the correlation matrix among these object classes () are given by:
(3)  
(4)  
(5)  
(6) 
We now imagine that an imaging sensor returns Figure 1(a), from which two objects are detected: an ACME warhead at 99% 1% confidence, and an unknown warhead that is consistent with ACME, GLOBEX, or TRIOZAP at 33.33% 30% confidence each.
IiiB HBM Results
We attempt to use these hyperparameters to solve equation 2. This integral turns out to be intractable (this is common for HBMs); however, we can sample the latent true object distribution,
, using a sampling technique such as Markov Chain Monte Carlo (MCMC); Table
2 shows the results..#  Context  Sensor Configuration  P(TRIOZAP) 

1  None  Physical Sensor Alone  0.334 
2  Utah  Physical Sensor context,  0.423 
3  Utah  Physical Sensor full context  0.314 
4  Iowa  Physical Sensor full context  0.155 
5  Ohio  Physical Sensor full context  0.667 
These results are aligned with our intuition:

Line 1 recapitulates our statement that the physical sensor alone gives a 33.3% probability to each of the three warhead classes.

Line 2 takes into account only the fact that Utah commonly uses TRIOZAP missiles; P(TRIOZAP) increases accordingly.

Line 3 takes into account both that Utah commonly uses TRIOZAP missiles and that TRIOZAP missiles are anticorrelated with ACME; these results largely cancel.

Lines 4 & 5 give the corresponding results for Iowa and Ohio; the strength of the contextual expectation coupled with the very uncertain sensor reading results in a substantial correction magnitude.
We pause to take advantage of the opportunity to compare sampling methods as implemented in the Python scripting language (as of early 2017). In particular, we use the PyMC package for Python 2.7 to sample the latent true object distribution via MCMC (“MCMC2”), and the same package for Python 3.6 to sample the latent true object distribution via MCMC (“MCMC3”), the Variational Bayes (VB) technique, and the Hamiltonian Monte Carlo (HMC) as implemented via the NoUTurn Sampler (NUTS) [2]. In all cases we use the default parameters, as the PyMC package claims to require minimal finetuning. We show our results in Table 3, in which the algorithm is defined to have converged when no more than 3 of the 20 Geweke Scores are greater than 0.01; this is quite a conservative criterion.
Sampling  P(Triozap)  P(Triozap)  Time  Iter. to 
Method  Ohio  Utah  per iter  Converge 
MCMC2  0.667  0.155  158 s  20k 
MCMC3  0.668  0.107  680 s  20k 
HMC  0.667  0.109  23,484 s  50k 
VB  0.747  0.186  236 s  30k 
Table 3 shows that although all four methods give similar results, the VB method is notably different and the HMC technique is notably slow. Further, the interface to the PyMC package for Python 2 is far more userfriendly. We therefore quote all results using PyMC for Python 2. Though it is unfortunate that these different sampling methods give nontrivially different results, this was a known issue at the time of this work, see [3].
IiiC Hyperpriors
There is one final generalization we can consider: what if we have the image, but do not know the context? In this case, we can simply take the weighted average of the contexts of which we are aware; the weighted coefficients can be added to our model. These weighted coefficients are the hyperpriors.
To test this approach, we consider an extreme case in which we have an image with three ACME warheads and three GLOBEX warheads as well as blurry object. We do not know which context (country) this comes from, but a clever human would suspect Iowa (because Iowa does not use many TRIOZAP warheads). Indeed, a flat hyperprior chooses a 49% chance of Iowa, a 26% chance of Utah, and only a 25% chance of Ohio. Using these hyperpriors, the final probability for the blurry object to be a TRIOZAP warhead is therefore changed from 33.3% to 36.6%. These hyperpriors therefore allow us to leverage context even when we do not know which context to leverage.
IiiD Results on Microsoft’s COCO Dataset
As a realworld demonstration, we use Microsoft’s COCO dataset [4], which contains over 82,000 handlabeled images in which all instances of 80 categories (e.g., person, car, bus, hot dog) are labeled. These 80 categories are grouped into 12 supercategories (e.g., outdoor, animal, sports). We now apply both the Bayesian Network and the Hierarchical Bayesian Model to this dataset, and use these learned models to resolve the blurry object shown in Figure 1(b).
IiiD1 Bn
We use the entire COCO dataset to construct a Bayesian Network as described previously. In particular, we connect two nodes only if their edge weight (number of images in which the object classes cooccur, out of total images) is greater than 1000. Part of the resulting Bayesian Network is depicted in Figure 3.
We then use this network to compute our contextual expectation for an image similar to Figure 1(b), in which one figure is clearly a “street” and the other figure is blurry. From the Bayesian Network, we can use standard inference techniques to calculate the probabilities for the blurry object as shown in Table 4.
Object  Probability of cooccurrence 

Man  0.240 
Road  0.151 
Sign  0.074 
Sidewalk  0.051 
Traffic  0.076 
If the physical sensor provides observations similar to the above (even at low confidence), then the concordance between the sensor reading and the contextual expectation can augment the operator’s confidence in the sensor output. If the sensor readings are very dissimilar from Table 4 (especially if the sensor confidence is low), then the operator should be more cautious of these sensor readings. The BN alone, however, does not provide a principled way to combine these divergent strategies.
IiiD2 Hbm
As with the BN, we again use the COCO dataset to quantify context; in particular, we calculate and from the image cooccurrence. As an additional exercise, we use the 12 supercategories described above as different contexts, and calculate and for the 80 categories separately for each of the 12 contexts (by definition, each supercategory will be dominated by objects from the categories that make up the supercategory; however, other categories will also be represented due to cooccurrence).
Given these hyperparameters, we address Figure 1(b), in which one object is clearly a dog, while we posit that the sensor tells us that the blurry object is consistent with another dog, a cat, or a pair of skis. As before, we imagine that all of these hypotheses are equally likely (according to the sensor), with confidence each. The HBM (trained on all images) substantially decreases the likelihood of skis (skis are rarely photographed at all, and even less commonly with dogs). More specific contexts give even more emphatic results, however: if the photo is evaluated under the “animal” context, the ski hypothesis is almost completely rejected; while in the “sports” context, the skis become the most likely outcome by far. This is consistent with our intuition: if the photo is labeled “sports” (e.g., if it appears in Sports Illustrated), then we would expect the image to contain something sportrelevant; this expectation would be very different if the photo appeared in Animal Planet. These numerical results are given in Table 5.
Context  P(Skis) 

Sensor Alone  33.3% 
All Contexts  21.8% 
Animal Context  0.65% 
Sports Context  63.67% 
Iv Conclusions
We have designed an HBM and used it to enhance lowconfidence sensor readings using contextual information, eliminating the need for ad hoc adjustments by the analyst (at least for this type of context). We further showed numerical results on cases in which the context was defined as cooccurrence in imagery. We find that, in this setting, the HBM provides a natural way to numerically fuse the contextual expectation with the sensor readings and sensor uncertainties. We also considered using a Bayesian Network for this purpose; though the Bayesian Network assumes that sensor readings and context are uncorrelated, it is more computationally efficient. Though this setting defines context as cooccurrence only, considering other definitions of context would be a natural extension to this work.
Acknowledgments
This work was supported by the Missile Defense Agency under contract HQ014716C7602. Approved for Public Release 18MDA9664 (30 May 2018).
References
 [1] A. Gelman et al., Bayesian Data Analysis. CRC Press, 2014.
 [2] M. D. Hoffman and A. Gelman, “The NoUTurn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” ArXiv eprints, Nov. 2011.
 [3] Pymcdevs, “ADVI, NUTS and Metropolis produce significantly different results,” https://github.com/pymcdevs/pymc3/issues/1163, 2016.
 [4] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
Comments
There are no comments yet.