1 Introduction
For many budgetconstrained environmental sensing and monitoring applications in the real world, active learning/sensing is an attractive, frugal alternative to passive highresolution (hence, prohibitively costly) sampling of the spatially varying target phenomenon of interest. Different from the latter, active learning aims to select and gather the most informative observations for modeling and predicting the spatially varying phenomenon given some sampling budget constraints (e.g., quantity of deployed sensors, energy consumption, mission time).
In practice, the target phenomenon often coexists and correlates well with some auxiliary type(s) of phenomena whose measurements may be more spatially correlated, less noisy (e.g., due to higherquality sensors), and/or less tedious to sample (e.g., due to greater availability/quantity, higher sampling rate, and/or lower sampling cost of deployed sensors of these type(s)) and can consequently be exploited for improving its prediction. For example, to monitor soil pollution by some heavy metal (e.g., Cadmium), its complex and timeconsuming extraction from soil samples can be alleviated by supplementing its prediction with correlated auxiliary types of soil measurements (e.g., pH) that are easier to sample [Goovaerts1997]. Similarly, to monitor algal bloom in the coastal ocean, plankton abundance correlates well with auxiliary types of ocean measurements (e.g., chlorophyll a, temperature, and salinity) that can be sampled more readily. Other examples of realworld applications include remote sensing, traffic monitoring [Chen et al.2012, Chen, Low, and Tan2013, Chen et al.2015], monitoring of groundwater and indoor environmental quality [Xu et al.2014], and precision agriculture [Webster and Oliver2007], among others. All of the above practical examples motivate the need to design and develop an active learning algorithm that selects not just the most informative sampling locations to be observed but also the types of measurements (i.e., target and/or auxiliary) at each selected location for minimizing the predictive uncertainty of unobserved areas of a target phenomenon given a sampling budget, which is the focus of our work here^{2}^{2}2Our work here differs from multivariate spatial sampling algorithms [Bueso et al.1999, Le, Sun, and Zidek2003] that aim to improve the prediction of all types of coexisting phenomena, for which existing active learning algorithms for sampling measurements only from the target phenomenon can be extended and applied straightforwardly, as detailed in Section 3..
To achieve this, we model all types of coexisting phenomena (i.e., target and auxiliary) jointly as a multioutput Gaussian process (MOGP) [Álvarez and Lawrence2011, Bonilla, Chai, and Williams2008, Teh and Seeger2005], which allows the spatial correlation structure of each type of phenomenon and the crosscorrelation structure between different types of phenomena to be formally characterized. More importantly, unlike the nonprobabilistic multivariate regression methods, the probabilistic MOGP regression model allows the predictive uncertainty of the target phenomenon (as well as the auxiliary phenomena) to be formally quantified (e.g., based on entropy or mutual information criterion) and consequently exploited for deriving the active learning criterion.
To the best of our knowledge, this paper is the first to present an efficient algorithm for active learning of a MOGP model. We consider utilizing the entropy criterion to measure the predictive uncertainty of a target phenomenon, which is widely used for active learning of a singleoutput GP model. Unfortunately, for the MOGP model, such a criterion scales poorly in the number of candidate sampling locations of the target phenomenon (Section 3) and even more so in the number of selected observations (i.e., sampling budget) when optimized (Section 4). To resolve this scalability issue, we first exploit a structure common to a unifying framework of sparse MOGP models (Section 2) for deriving a novel active learning criterion (Section 3). Then, we define a relaxed notion of submodularity^{3}^{3}3The original notion of submodularity has been used in [Krause and Golovin2014, Krause, Singh, and Guestrin2008] to theoretically guarantee the performance of their algorithms for active learning of a singleoutput GP model. called submodularity and exploit the submodularity property of our new criterion for devising a polynomialtime approximation algorithm that guarantees a constantfactor approximation of that achieved by the optimal set of selected observations (Section 4). We empirically evaluate the performance of our proposed algorithm using three realworld datasets (Section 5).
2 Modeling Coexisting Phenomena with MultiOutput Gaussian Process (MOGP)
Convolved MOGP (CMOGP) Regression.
A number of MOGP models such as cokriging [Webster and Oliver2007], parameter sharing [Skolidis2012], and linear model of coregionalization (LMC) [Teh and Seeger2005, Bonilla, Chai, and Williams2008] have been proposed to handle multiple types of correlated outputs. A generalization of LMC called the convolved MOGP (CMOGP) model has been empirically demonstrated in [Álvarez and Lawrence2011] to outperform the others and will be the MOGP model of our choice due to its approximation whose structure can be exploited for deriving our active learning criterion and in turn an efficient approximation algorithm, as detailed later.
Let types of coexisting phenomena be defined to vary as a realization of a CMOGP over a domain corresponding to a set of sampling locations such that each location is associated with noisy realized (random) output measurement () if is observed (unobserved) for type for . Let and . Then, measurement of type is defined as a convolution between a smoothing kernel and a latent measurement function ^{4}^{4}4To ease exposition, we consider a single latent function. Note, however, that multiple latent functions can be used to improve the fidelity of modeling, as shown in [Álvarez and Lawrence2011]. More importantly, our proposed algorithm and theoretical results remain valid with multiple latent functions. corrupted by an additive noise
with noise variance
:As shown in [Álvarez and Lawrence2011], if is a GP, then is also a GP, that is, every finite subset of
follows a multivariate Gaussian distribution. Such a GP is fully specified by its
prior mean and covariance cov for all , the latter of which characterizes the spatial correlation structure for each type of phenomenon (i.e., ) and the crosscorrelation structure between different types of phenomena (i.e., ). Specifically, let be a GP with prior covariance and where is the signal variance controlling the intensity of measurements of type , and are diagonal precision matrices controlling, respectively, the degrees of correlation between latent measurements and crosscorrelation between latent and type measurements, anddenotes a column vector comprising components of value
. Then,(1) 
where is a Kronecker delta of value if and , and 0 otherwise.
Supposing a column vector of realized measurements is available for some set of tuples of observed locations and their corresponding measurement types where , a CMOGP regression model can exploit these observations to provide a Gaussian predictive distribution of the measurements for any set of tuples of unobserved locations and their corresponding measurement types with the following posterior mean vector and covariance matrix:
(2) 
where and for any .
Sparse CMOGP Regression.
A limitation of the CMOGP model is its poor scalability in the number of observations: Computing its Gaussian predictive distribution (2) requires inverting , which incurs time. To improve its scalability, a unifying framework of sparse CMOGP regression models such as the deterministic training conditional, fully independent training conditional, and partially independent training conditional (PITC) approximations [Álvarez and Lawrence2011] exploit a vector of inducing measurements for some small set of inducing locations (i.e., ) to approximate each measurement :
They also share two structural properties that can be exploited for deriving our active learning criterion and in turn an efficient approximation algorithm: (P1) Measurements of different types (i.e., and for ) are conditionally independent given , and (P2) and are conditionally independent given . PITC will be used as the sparse CMOGP regression model in our work here since the others in the unifying framework impose further assumptions. With the above structural properties, PITC can utilize the observations to provide a Gaussian predictive distribution where
(3) 
such that for any , () is a covariance matrix with covariance components for all and ( for all ), is the transpose of , and is a blockdiagonal matrix constructed from the diagonal blocks of for any , each of which is a matrix for where and . Note that computing (3) does not require the inducing locations to be observed. Also, the covariance matrix in (2) is approximated by a reducedrank matrix summed with the resulting sparsified residual matrix . So, by using the matrix inversion lemma to invert the approximated covariance matrix and applying some algebraic manipulations, computing (3) incurs time [Álvarez and Lawrence2011] in the case of and evenly distributed observations among all types.
3 Active Learning of CMOGP
Recall from Section 1 that the entropy criterion can be used to measure the predictive uncertainty of the unobserved areas of a target phenomenon. Using the CMOGP model (Section 2), the Gaussian posterior joint entropy (i.e., predictive uncertainty) of the measurements for any set of tuples of unobserved locations and their corresponding measurement types can be expressed in terms of its posterior covariance matrix (2) which is independent of the realized measurements :
Let index denote the type of measurements of the target phenomenon^{5}^{5}5Our proposed algorithm can be extended to handle multiple types of target phenomena, as demonstrated in Section 5.. Then, active learning of a CMOGP model involves selecting an optimal set of tuples (i.e., sampling budget) of sampling locations and their corresponding measurement types to be observed that minimize the posterior joint entropy of type measurements at the remaining unobserved locations of the target phenomenon:
(4) 
where is a finite set of tuples of candidate sampling locations of the target phenomenon and their corresponding measurement type available to be selected for observation. However, evaluating the term in (4) incurs time, which is prohibitively expensive when the target phenomenon is spanned by a large number of candidate sampling locations. If auxiliary types of phenomena are missing or ignored (i.e., ), then such a computational difficulty can be eased instead by solving the wellknown maximum entropy sampling (MES) problem [Shewry and Wynn1987]: which can be proven to be equivalent to (4
) by using the chain rule for entropy
and noting that is a constant. Evaluating the term in MES incurs time, which is independent of . Such an equivalence result can in fact be extended and applied to minimizing the predictive uncertainty of all types of coexisting phenomena, as exploited by multivariate spatial sampling algorithms [Bueso et al.1999, Le, Sun, and Zidek2003]:(5) 
where and is defined in a similar manner to but for measurement type . This equivalence result (5) also follows from the chain rule for entropy and the fact that is a constant. Unfortunately, it is not straightforward to derive such an equivalence result for our active learning problem (4) in which a target phenomenon of interest coexists with auxiliary types of phenomena (i.e., ): If we consider maximizing or , then it is no longer equivalent to minimizing (4) as the sum of the two entropy terms is not necessarily a constant.
Exploiting Sparse CMOGP Model Structure.
We derive a new equivalence result by considering instead a constant entropy that is conditioned on the inducing measurements used in sparse CMOGP regression models (Section 2). Then, by using the chain rule for entropy and structural property P2 shared by sparse CMOGP regression models in the unifying framework [Álvarez and Lawrence2011] described in Section 2, (4) can be proven (see Appendix A) to be equivalent to
(6) 
where
(7) 
is the conditional mutual information between and given . Our novel active learning criterion in (6) exhibits an interesting explorationexploitation tradeoff: The inducing measurements can be viewed as latent structure of the sparse CMOGP model to induce conditional independence properties P1 and P2. So, on one hand, maximizing the term aims to select tuples of locations with the most uncertain measurements of the target phenomenon and their corresponding type to be observed given the latent model structure (i.e., exploitation). On the other hand, minimizing the term (7) aims to select tuples to be observed (i.e., possibly of measurement types ) so as to rely less on measurements of type at the remaining unobserved locations of the target phenomenon to infer latent model structure (i.e., exploration) since won’t be sampled.
Supposing , evaluating our new active learning criterion in (6) incurs time for every and a oneoff cost of time (Appendix B). In contrast, computing the original criterion in (4) requires time for every , which is more costly, especially when the number of selected observations is much less than the number of candidate sampling locations of the target phenomenon due to, for example, a tight sampling budget or a large sampling domain that usually occurs in practice. The trick to achieving such a computational advantage can be inherited by our approximation algorithm to be described next.
4 Approximation Algorithm
Our novel active learning criterion in (6), when optimized, still suffers from poor scalability in the number of selected observations (i.e., sampling budget) like the old criterion in (4) because it involves evaluating a prohibitively large number of candidate selections of sampling locations and their corresponding measurement types (i.e., exponential in ). However, unlike the old criterion, it is possible to devise an efficient approximation algorithm with a theoretical performance guarantee to optimize our new criterion, which is the main contribution of our work in this paper.
The key idea of our proposed approximation algorithm is to greedily select the next tuple of sampling location and its corresponding measurement type to be observed that maximally increases our criterion in (6), and iterate this till tuples are selected for observation. Specifically, let
(8) 
denote our active learning criterion in (6) augmented by a positive constant to make nonnegative. Such an additive constant is simply a technical necessity for proving the performance guarantee and does not affect the outcome of the optimal selection (i.e., ). Then, our approximation algorithm greedily selects the next tuple of sampling location and its corresponding measurement type that maximizes :
(9) 
where is a Kronecker delta of value if , and otherwise. The derivation of (9) is in Appendix C. Our algorithm updates and iterates the greedy selection (9) and the update till (i.e., sampling budget is depleted). The intuition to understanding (9) is that our algorithm has to choose between observing a sampling location with the most uncertain measurement (i.e., ) of the target phenomenon (i.e., type ) vs. that for an auxiliary type inducing the largest reduction in predictive uncertainty of the measurements at the remaining unobserved locations of the target phenomenon since .
It is also interesting to figure out whether our approximation algorithm may avoid selecting tuples of a certain auxiliary type and formally analyze the conditions under which it will do so, as elucidated in the following result:
Proposition 1
Let , , , and . Assuming absence of suppressor variables, for any .
Its proof (Appendix D) relies on the following assumption of the absence of suppressor variables which holds in many practical cases [Das and Kempe2008]: Conditioning does not make and more correlated for any and . Proposition 1 reveals that when the signaltonoise ratio of auxiliary type is low (e.g., poorquality measurements due to high noise) and/or the cross correlation (1) between measurements of the target phenomenon and auxiliary type is small due to low , our greedy criterion in (9) returns a small value, hence causing our algorithm to avoid selecting tuples of auxiliary type .
Theorem 1 (Time Complexity)
Our approximation algorithm incurs time.
Its proof is in Appendix E. So, our approximation algorithm only incurs quartic time in the number of selected observations and cubic time in the number of candidate sampling locations of the target phenomenon.
Performance Guarantee.
To theoretically guarantee the performance of our approximation algorithm, we will first motivate the need to define a relaxed notion of submodularity. A submodular set function exhibits a natural diminishing returns property: When adding an element to its input set, the increment in its function value decreases with a larger input set. To maximize a nondecreasing and submodular set function, the work of Nemhauser1978 Nemhauser1978 has proposed a greedy algorithm guaranteeing a factor approximation of that achieved by the optimal input set.
The main difficulty in proving the submodularity of (8) lies in its mutual information term being conditioned on . Some works [Krause and Guestrin2005, Renner and Maurer2002]
have shown the submodularity of such conditional mutual information by imposing conditional independence assumptions (e.g., Markov chain). In practice, these strong assumptions (e.g.,
for any and ) severely violate the correlation structure of multiple types of coexisting phenomena and are an overkill: The correlation structure can in fact be preserved to a fair extent by relaxing these assumptions, which consequently entails a relaxed form of submodularity of (8); a performance guarantee similar to that of Nemhauser1978 Nemhauser1978 can then be derived for our approximation algorithm.Definition 1
A function G : is submodular if for any and .
Lemma 1
Let . Given , if
(10) 
for any and , then is submodular where .
Its proof is in Appendix F. Note that (10) relaxes the above example of conditional independence assumption (i.e., assuming ) to one which allows . In practice, is expected to be small: Since further conditioning monotonically decreases a posterior variance [Xu et al.2014], an expected large set of tuples of remaining unobserved locations of the target phenomenon tends to be informative enough to make small and hence the nonnegative variance reduction term and in (10) small.
Furthermore, (10) with a given small can be realized by controlling the discretization of the domain of candidate sampling locations. For example, by refining the discretization of (i.e., increasing ), the variance reduction term in (10) decreases because it has been shown in [Das and Kempe2008] to be submodular in many practical cases. We give another example in Lemma 2 to realize (10) by controlling the discretization such that every pair of selected observations are sufficiently far apart.
It is easy to derive that . The “information never hurts” bound for entropy [Cover and Thomas1991] entails a nondecreasing : . The first inequality requires so that ,^{6}^{6}6 is proven in Lemma 3 in Appendix D. which is reasonable in practice due to ubiquitous noise. Combining this result with Lemma 1 yields the performance guarantee:
Theorem 2
Given , if (10) holds, then our approximation algorithm is guaranteed to select s.t. where .
Its proof (Appendix G) is similar to that of the wellknown result of Nemhauser1978 Nemhauser1978 except for exploiting submodularity of in Lemma 1 instead of submodularity.
Finally, we present a discretization scheme that satisfies (10): Let be the smallest discretization width of for . Construct a new set of tuples of candidate sampling locations and their corresponding measurement types such that every pair of tuples are at least a distance of apart for some ; each candidate location thus has only one corresponding type. Such a construction constrains our algorithm to select observations sparsely across the spatial domain so that any has sufficiently many neighboring tuples of remaining unobserved locations of the target phenomenon from to keep small and hence the variance reduction term and in (10) small. Our previous theoretical results still hold if is used instead of . The result below gives the minimum value of to satisfy (10):
5 Experiments and Discussion
This section evaluates the predictive performance of our approximation algorithm (mGreedy) empirically on three realworld datasets: (a) Jura dataset [Goovaerts1997] contains concentrations of heavy metals collected at locations in a Swiss Jura region; (b) Gilgai dataset [Webster1977] contains electrical conductivity and chloride content generated from a line transect survey of locations of Gilgai territory in New South Wales, Australia; and (c) indoor environmental quality (IEQ) dataset [Bodik et al.2004] contains temperature (F) and light (Lux) readings taken by temperature sensors and light sensors deployed in the Intel Berkeley Research lab. The sampling locations for the Jura and IEQ datasets are shown in Appendix I.
The performance of mGreedy is compared to that of the (a) maximum variance/entropy (mVar) algorithm which greedily selects the next location and its corresponding measurement type with maximum posterior variance/entropy in each iteration; and (b) greedy maximum entropy (sVar) [Shewry and Wynn1987] and mutual information (sMI) [Krause, Singh, and Guestrin2008] sampling algorithms for gathering observations only from the target phenomenon.
For all experiments, kmeans is used to select inducing locations
by clustering all possible locations available to be selected for observation into clusters such that each cluster center corresponds to an element of . The hyperparameters (i.e., , , and for) of MOGP and singleoutput GP models are learned using the data via maximum likelihood estimation
[Álvarez and Lawrence2011]. For each dataset, observations (i.e., for Jura and Gilgai datasets and for IEQ dataset) of type are randomly selected to form the test set ; the tuples of candidate sampling locations and corresponding type therefore become less than that of auxiliary types. The root mean squared error (RMSE) metric is used to evaluate the performance of the tested algorithms. All experimental results are averaged over random test sets. For a fair comparison, the measurements of all types are normalized before using them for training, prediction, and active learning.Jura Dataset.
Three types of correlated lgCd, Ni, and lgZn measurements are used in this experiment; we take the log of Cd and Zn measurements to remove their strong skewness, as proposed as a standard statistical practice in
[Webster and Oliver2007]. The measurement types with the smallest and largest signaltonoise ratios (respectively, lgCd and Ni; see Appendix J) are each set as type .(a)  (b)  (c) 
(d)  (e)  (f) 
Figs. 1ac and 1df show, respectively, results of the tested algorithms with lgCd and Ni as type . It can be observed that the RMSE of mGreedy decreases more rapidly than that of mVar, especially when observations of auxiliary types are selected after about . This is because our algorithm selects observations of auxiliary types that induce the largest reduction in predictive uncertainty of the measurements at the remaining unobserved locations of the target phenomenon (Section 4). In contrast, mVar may select observations that reduce the predictive uncertainty of auxiliary types of phenomena, which does not directly achieve the aim of our active learning problem. With increasing , both mGreedy and mVar reach smaller RMSEs, but mGreedy can achieve this faster with much less observations. As shown in Figs. 1af, mGreedy performs much better than sVar and sMI, which means observations of correlated auxiliary types can indeed be used to improve the prediction of the target phenomenon. Finally, by comparing the results between Figs. 1ac and 1df, the RMSE of mGreedy with Ni as type decreases faster than that with lgCd as type , especially in the beginning (i.e., ) due to higherquality Ni measurements (i.e., larger signaltonoise ratio).
Gilgai Dataset.
In this experiment, the lgCl contents at depth cm and cm are used jointly as two types of target phenomena while the log of electrical conductivity, which is easier to measure at these depths, is used as the auxiliary type. Fig. 5a shows results of the average RMSE over the two lgCl types with . Similar to the results of the Jura dataset, with two types of target phenomena, the RMSE of mGreedy still decreases more rapidly with increasing than that of mVar and achieves a much smaller RMSE than that of sVar and sMI; the results of sVar and sMI are also averaged over two independent singleoutput GP predictions of lgCl content at the two depths.
IEQ Dataset.
Fig. 5b shows results with light as type and . The observations are similar to that of the Jura and Gilgai datasets: RMSE of mGreedy decreases faster than that of the other algorithms. More importantly, with the same number of observations, mGreedy achieves much smaller RMSE than sVar and sMI that can sample only from the target phenomenon. This is because mGreedy selects observations of the auxiliary type (i.e., temperature) that are less noisy () than that of light (), which demonstrates its advantage over sVar and sMI when type measurements are noisy (e.g., due to poorquality sensors).
[capbesideposition=center,center,capbesidewidth=3.35cm]figure[] (a) (b)
6 Related Work
Existing works on active learning with multiple output measurement types are not driven by the MOGP model and have not formally characterized the crosscorrelation structure between different types of phenomena: Some spatial sampling algorithms [Bueso, Angulo, and Alonso1998, Angulo and Bueso2001] have simply modeled the auxiliary phenomenon as a noisy perturbation of the target phenomenon that is assumed to be latent, which differs from our work here. Multitask active learning (MTAL) and
active transfer learning
(ATL) algorithms have considered the prediction of each type of phenomenon as one task and used the auxiliary tasks to help learn the target task. But, the MTAL algorithm of Zhang2010 Zhang2010 requires relations between different classification tasks to be manually specified, which is highly nontrivial to achieve in practice and not applicable to MOGP regression. Some ATL and active learning algorithms [Roth and Small2006, Zhao et al.2013] have used active learning criteria (e.g., marginbased criterion) specific to their classification or recommendation tasks that cannot be readily tailored to MOGP regression.7 Conclusion
This paper describes a novel efficient algorithm for active learning of a MOGP model. To resolve the issue of poor scalability in optimizing the conventional entropy criterion, we exploit a structure common to a unifying framework of sparse MOGP models for deriving a novel active learning criterion (6). Then, we exploit the submodularity property of our new criterion (Lemma 1) for devising a polynomialtime approximation algorithm (9) that guarantees a constantfactor approximation of that achieved by the optimal set of selected observations (Theorem 2). Empirical evaluation on three realworld datasets shows that our approximation algorithm mGreedy outperforms existing algorithms for active learning of MOGP and singleoutput GP models, especially when measurements of the target phenomenon are more noisy than that of the auxiliary types. For our future work, we plan to extend our approach by generalizing nonmyopic active learning [Cao, Low, and Dolan2013, Hoang et al.2014, Ling, Low, and Jaillet2016, Low, Dolan, and Khosla2009, Low, Dolan, and Khosla2008, Low, Dolan, and Khosla2011] of singleoutput GPs to that of MOGPs and improving its scalability to big data through parallelization [Chen et al.2013, Low et al.2015], online learning [Xu et al.2014], and stochastic variational inference [Hoang, Hoang, and Low2015].
Acknowledgments. This research was carried out at the SeSaMe Centre. It is supported by Singapore NRF under its IRC@SG Funding Initiative and administered by IDMPO.
References
 [Álvarez and Lawrence2011] Álvarez, M. A., and Lawrence, N. D. 2011. Computationally efficient convolved multiple output Gaussian processes. JMLR 12:1459–1500.
 [Angulo and Bueso2001] Angulo, J. M., and Bueso, M. C. 2001. Random perturbation methods applied to multivariate spatial sampling design. Environmetrics 12(7):631–646.
 [Bodik et al.2004] Bodik, P.; Guestrin, C.; Hong, W.; Madden, S.; Paskin, M.; and Thibaux, R. 2004. http://www.select.cs.cmu.edu/data/labapp3/index.html.
 [Bonilla, Chai, and Williams2008] Bonilla, E. V.; Chai, K. M. A.; and Williams, C. K. 2008. Multitask Gaussian process prediction. In Proc. NIPS.
 [Bueso, Angulo, and Alonso1998] Bueso, M. C.; Angulo, J. M.; and Alonso, F. J. 1998. A statespace model approach to optimum spatial sampling design based on entropy. Environmental and Ecological Statistics 5(1):29–44.
 [Bueso et al.1999] Bueso, M. C.; Angulo, J. M.; CruzSanjulián, J.; and GarcíaAróstegui, J. L. 1999. Optimal spatial sampling design in a multivariate framework. Math. Geology 31(5):507–525.
 [Cao, Low, and Dolan2013] Cao, N.; Low, K. H.; and Dolan, J. M. 2013. Multirobot informative path planning for active sensing of environmental phenomena: A tale of two algorithms. In Proc. AAMAS.
 [Chen et al.2012] Chen, J.; Low, K. H.; Tan, C. K.Y.; Oran, A.; Jaillet, P.; Dolan, J. M.; and Sukhatme, G. S. 2012. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traffic phenomena. In Proc. UAI, 163–173.
 [Chen et al.2013] Chen, J.; Cao, N.; Low, K. H.; Ouyang, R.; Tan, C. K.Y.; and Jaillet, P. 2013. Parallel Gaussian process regression with lowrank covariance matrix approximations. In Proc. UAI, 152–161.
 [Chen et al.2015] Chen, J.; Low, K. H.; Jaillet, P.; and Yao, Y. 2015. Gaussian process decentralized data fusion and active sensing for spatiotemporal traffic modeling and prediction in mobilityondemand systems. IEEE Trans. Autom. Sci. Eng. 12:901–921.
 [Chen, Low, and Tan2013] Chen, J.; Low, K. H.; and Tan, C. K.Y. 2013. Gaussian processbased decentralized data fusion and active sensing for mobilityondemand system. In Proc. RSS.
 [Cover and Thomas1991] Cover, T., and Thomas, J. 1991. Elements of Information Theory. John Wiley & Sons, Inc.

[Das and Kempe2008]
Das, A., and Kempe, D.
2008.
Algorithms for subset selection in linear regression.
In Proc. STOC, 45–54.  [Golub and Van Loan1996] Golub, G. H., and Van Loan, C.F. 1996. Matrix Computations. Johns Hopkins Univ. Press, 3rd edition.
 [Goovaerts1997] Goovaerts, P. 1997. Geostatistics for Natural Resources Evaluation. Oxford Univ. Press.
 [Hoang et al.2014] Hoang, T. N.; Low, K. H.; Jaillet, P.; and Kankanhalli, M. 2014. Nonmyopic Bayesoptimal active learning of Gaussian processes. In Proc. ICML, 739–747.
 [Hoang, Hoang, and Low2015] Hoang, T. N.; Hoang, Q. M.; and Low, K. H. 2015. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, 569–578.
 [Krause and Golovin2014] Krause, A., and Golovin, D. 2014. Submodular function maximization. In Bordeaux, L.; Hamadi, Y.; and Kohli, P., eds., Tractability: Practical Approaches to Hard Problems. Cambridge Univ. Press. 71–104.
 [Krause and Guestrin2005] Krause, A., and Guestrin, C. 2005. Nearoptimal nonmyopic value of information in graphical models. In Proc. UAI.
 [Krause, Singh, and Guestrin2008] Krause, A.; Singh, A.; and Guestrin, C. 2008. Nearoptimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR 9:235–284.
 [Le, Sun, and Zidek2003] Le, N. D.; Sun, L.; and Zidek, J. V. 2003. Designing networks for monitoring multivariate environmental fields using data with monotone pattern. Technical Report #20035, Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC.
 [Ling, Low, and Jaillet2016] Ling, C. K.; Low, K. H.; and Jaillet, P. 2016. Gaussian process planning with Lipschitz continuous reward functions: Towards unifying Bayesian optimization, active learning, and beyond. In Proc. AAAI.
 [Low et al.2015] Low, K. H.; Yu, J.; Chen, J.; and Jaillet, P. 2015. Parallel Gaussian process regression for big data: Lowrank representation meets Markov approximation. In Proc. AAAI.
 [Low, Dolan, and Khosla2008] Low, K. H.; Dolan, J. M.; and Khosla, P. 2008. Adaptive multirobot widearea exploration and mapping. In Proc. AAMAS, 23–30.
 [Low, Dolan, and Khosla2009] Low, K. H.; Dolan, J. M.; and Khosla, P. 2009. Informationtheoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In Proc. ICAPS.
 [Low, Dolan, and Khosla2011] Low, K. H.; Dolan, J. M.; and Khosla, P. 2011. Active Markov informationtheoretic path planning for robotic environmental sensing. In Proc. AAMAS, 753–760.
 [Nemhauser, Wolsey, and Fisher1978] Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions  I. Mathematical Programming 14(1):265–294.
 [Petersen and Pedersen2012] Petersen, K. B., and Pedersen, M. S. 2012. The Matrix Cookbook.
 [Renner and Maurer2002] Renner, R., and Maurer, U. 2002. About the mutual (conditional) information. In Proc. IEEE ISIT.
 [Roth and Small2006] Roth, D., and Small, K. 2006. Marginbased active learning for structured output spaces. In Proc. ECML, 413–424.
 [Shewry and Wynn1987] Shewry, M. C., and Wynn, H. P. 1987. Maximum entropy sampling. J. Applied Stat. 14(2):165–170.
 [Skolidis2012] Skolidis, G. 2012. Transfer Learning with Gaussian Processes. Ph.D. Dissertation, University of Edinburgh.
 [Stewart and Sun1990] Stewart, G. W., and Sun, J.G. 1990. Matrix Perturbation Theory. Academic Press.
 [Teh and Seeger2005] Teh, Y. W., and Seeger, M. 2005. Semiparametric latent factor models. In Proc. AISTATS, 333–340.
 [Webster and Oliver2007] Webster, R., and Oliver, M. 2007. Geostatistics for Environmental Scientists. John Wiley & Sons, Inc., 2nd edition.
 [Webster1977] Webster, R. 1977. Spectral analysis of Gilgai soil. Australian Journal of Soil Research 15(3):191–204.
 [Xu et al.2014] Xu, N.; Low, K. H.; Chen, J.; Lim, K. K.; and Ozgul, E. B. 2014. GPLocalize: Persistent mobile robot localization using online sparse Gaussian process observation model. In Proc. AAAI, 2585–2592.
 [Zhang et al.2016] Zhang, Y.; Hoang, T. N.; Low, K. H.; and Kankanhalli, M. 2016. Nearoptimal active learning of multioutput Gaussian processes. In Proc. AAAI.
 [Zhang2010] Zhang, Y. 2010. Multitask active learning with output constraints. In Proc. AAAI, 667–672.
 [Zhao et al.2013] Zhao, L.; Pan, S. J.; Xiang, E. W.; Zhong, E.; Lu, Z.; and Yang, Q. 2013. Active transfer learning for crosssystem recommendation. In Proc. AAAI, 1205–1211.
Appendix A Derivation of Novel Active Learning Criterion (6)
The first equality follows from the fact that is a constant. The third equality is due to the chain rule for entropy as well as the definition of conditional mutual information and . The last equality follows from structural property P2 shared by sparse CMOGP regression models in the unifying framework [Álvarez and Lawrence2011] described in Section 2, which results in and .
Appendix B Time Complexity of Evaluating Active Learning Criterion in (6)
where by definition (see last paragraph of Section 2). So, evaluating incurs time for every ; this worstcase time complexity occurs when all the tuples in are of measurement type (i.e., ).
where
for any , as derived in [Álvarez and Lawrence2011]. Therefore, evaluating incurs time for every ; this worstcase time complexity occurs when all the tuples in are of one measurement type.
Let . Then, by the definition of (see last paragraph of Section 2),
Evaluating the term incurs time for every ; this worstcase time complexity occurs when all the tuples in are of one measurement type. Note that the term remains the same for every (i.e., since it is independent of ) and hence only needs to be computed once in time. Therefore, evaluating incurs time for every and a oneoff cost of time. Consequently, evaluating incurs time for every and a oneoff cost of time.
So, evaluating our active learning criterion in (6) incurs time for every and a oneoff cost of time.
Comments
There are no comments yet.