Recently, there has been a surge toward compressive architectures for hyperspectral imaging and remote sensing . This is mainly due to the increasing amount of hyperspectral data that is being collected by high-resolution airborne imagers such as NASA’s AVIRIS111http://aviris.jpl.nasa.gov and the fact that a large portion of data is discarded during compression or during feature mining prior to learning . It has been noted in  that many of the proposed compressive architectures are based on the spatial mixture of pixels across each frame and correspond to physically costly or impractical operations while most existing airborne hyperspectral imagers employ scanning methods to acquire a pixel or a line of pixels at a time. To address this issue, practical designs of compressive whisk-broom and push-broom cameras were suggested in . In this work, we tackle the problem of hyperspectral pixel classification based on compressive whisk-broom sensors; i.e. each pixel is measured at a time using an individual random measurement matrix. Extension of the presented analysis for the compressive push-broom cameras is straightforward.
To set this work apart from existing efforts that have also focused on the problem of classification from the compressive hyperspectral data, such as , we must mention two issues with the typical indirect approach of applying the classification algorithms to the recovered data: ) the sensed data cannot be decoded at the sender’s side (airborne device) due to the heavy computational cost of compressive recovery, making on-site classification infeasible, ) the number of measurements (per pixel) may not be sufficient for a reliable signal recovery. It has been established that classification in the compressed domain would succeed with far less number of random measurements than it is required for a full data recovery . However, the compressive framework of  corresponds to using a fixed projection matrix for all pixels which limits the measurement diversity that has been promoted by several recent studies for data recovery and learning [6, 7, 8].
Rather than devising new classification algorithms, this work is focused on studying the relationship between the camera’s sensing mechanism, namely the employed random measurement matrix, and the common Support Vector Machine (SVM) classifier. It must be emphasized that the general problem of classification based on compressive measurements has been addressed for the case where a fixed measurement matrix is used[9, 5]. However, our aim is to study the impact of measurement diversity on the learned classifier. In particular, we investigate two different sensing mechanisms that were introduced in  222For more details regarding the physical implementation of compressive whisk-broom sensors, we refer the reader to  which illustrates conceptual schematics of whisk-broom and push-broom cameras.:
FCA-based sensor: A Fixed Coded Aperture (FCA) is used to modulate the dispersed light before it is collected at the linear sensor array. This case corresponds to using a fixed measurement matrix for each pixel and a low-cost alternative to the DMD system below.
DMD-based sensor: A Digital Micromirror Device (DMD) is used to modulate the incoming light according to an arbitrary pattern that is changed for each measurement. Unlike the previous case, DMD adds the option of sensing each pixel using a different measurement matrix. Both cases are illustrated in Figure 1.
SVM has been shown to be a suitable classifier for hyperspectral data 
. Specifically, we employ an efficient linear SVM classifier with the exponential loss function that gives a smooth approximation to the hinge-loss. To train the classifier in the compressed domain, we mustsketch the SVM loss function using the acquired measurements for which we employ some of the techniques developed in . Furthermore, given that the sketched loss function gives a close approximation to the true loss function and that the learning objective function is smooth, it is expected that the learned classifier is close to the ground-truth classifier based on the complete hyperspectral data (which is unknown). As it has been discussed in , recovery of the classifier is of independent importance in some applications.
Ii Problem Formulation and the Proposed Framework
Ii-a Overview of SVM for spectral pixel classification
In a supervised hyperspectral classification task, a subset of pixels are labeled by a specialist who may have access to the side information about the imaged field such as being physically present at the field for measurement. The task of learning is then to employ the labeled samples for tuning the parameters of the classification machine to predict the pixel labels for a field with similar material compositions. Note that, for subpixel targets, an extra stage of spectral unmixing is required to separate different signal sources involved in generating a pixel’s spectrum . For simplicity, we assume that the pixels are homogeneous (consist of single objects).
Recall that most classifiers are inherently composed of binary decision rules. Specifically, in multi-categorical classification, multiple binary classifiers are trained according to either One-Against-All (OAA) or One-Against-One (OAO) schemes and voting techniques are employed to combine the results 
. In a OAA-SVM classification problem, a decision hyperplane is computed between each class and the rest of the training data, while in a OAO scheme, a hyperplane is learned between each pair of classes. As a consequence, most studies focus on the canonical binary classification. Similarly in here, our analysis is presented for the binary classification problem which can be extended to multi-categorical classification.
In the linear SVM classification problem, we are given a set of training data points (corresponding to hyperspectral pixels) for and the associated labels . The inferred class label for is that depends on the classifier and the bias term . The classifier is the normal vector to the affine hyperplane that divides the training data in accordance with their labels. When the training classes are inseparable by an affine hyperplane, maximum-margin soft-margin SVM is used which relies on a loss function to penalize the amount of misfit. For example, a widely used loss function is with . For , this loss function is known as the hinge loss, and for , it is called the squared hinge loss or simply the quadratic loss. The optimization problem for soft-margin SVM becomes333Discussion: Similar results can be obtained using the dual form. Recent works have shown that advantages of the dual form can be obtained in the primal as well . As noted in , the primal form convergences faster to the optimal parameters than the dual form. For the purposes of this work, it is more convenient to work with the primal form of SVM although the analysis can be properly extended to the dual form.
In this paper, we use the smooth exponential loss function, which can be used to approximate the hinge loss while retaining its margin-maximization properties :
where controls the smoothness. We use .
Ii-B SVM in the compressed domain
Let denote the low-dimensional measurement vector for pixel where is size of the photosensor array in the compressive whisk-broom camera . As explained in , a DMD architecture can be used to produce a with random entries in the range or random
entries, resulting in a sub-Gaussian measurement matrix that satisfies the isometry conditions with a high probability. Recall that the measurement matrix is fixed in a FCA-based architecture while it can be distinct for each pixel in a DMD-based architecture.
As noted in , the orthogonal projection onto the row-space of can be computed as
. Consequently, an (unbiased) estimator for the inner product(assuming a fixed and ) based on the compressive measurements would be . As a result, the soft-margin SVM based on the compressive measurements can be expressed as:
(we have omitted the bias term for simplicity).
We must note that the formulation in (3) is different from what was suggested in  for a fixed measurement matrix. In particular, we solve for in the -dimensional space. Meanwhile, the methodology in  would result in the following optimization problem:
which solves for in the low-dimensional column-space of . Also note that, in the case of fixed measurement matrices, (3) and (4) correspond to the same problem with the relationship (because of the regularization term which zeros the components of which lie in the null-space of ). In other words, (3) represents a generalization of (4) for the case when the measurement matrices are not necessarily the same. This allows us to compare the two cases of ) having a fixed measurement matrix and ) having a distinct measurement matrix for each pixel, which is the subject of this paper. For simplicity, assume that each consists of a subset of rows from a random orthonormal matrix, or equivalently ; thus, . Also assume that, in the case of DMD-based sensing, each is generated independently of the other measurement matrices.
Following the recent line of work in the area of randomized optimization, for example , we refer to the new loss as the sketch of the loss, or simply the sketched loss to distinguish it from the true loss . Similarly, we refer to as the sketched classifier as opposed to the ground-truth classifier .
Figure 2 depicts the two cases of using a fixed measurement matrix (FCA-sensed data) and distinct measurement matrices (DMD-sensed data) for training a linear classifier. It is helpful to imagine that, in the sketched problem, each is multiplied with (the projection of onto the column-space of ) since . As shown in Figure 2 (left) with for all , there is a possibility that would nearly align with the null-space of the random low-rank matrix . For such , any vector may not well discriminate between the two classes and ultimately result in the classification failure. Figure 2 (right) depicts the case when a distinct measurement is used for each point. When is symmetrically distributed in the space and is large, there is always a bunch of ’s that nearly align with whereas other ’s can be nearly orthogonal to or somewhere between the two extremes. This intuitive example hints about how measurement diversity pays off by making the optimization process more stable with respect to the variations in the random measurements and the separating hyperplane.
Iii-a Handling the bias term
It is not difficult to see that employing a distinct for each data vector necessitates having distinct values of bias (for each ). Note that in the case of fixed measurement matrix, i.e. when for all , bias terms would be all the same and linear SVM works normally as noted in . However, using a customized bias term for each point would clearly result in overfitting and the learned would be of no practical value. Furthermore, the classifier cannot be used for prediction since the bias is unknown for the new input samples. In the following, we address these issues.
First, let denote a set of distinct measurement matrices, i.e. . Instead of using an arbitrary measurement matrix for each pixel, we draw an entry from for each pixel. Given that , each element of is expected be utilized for more than once. This allows us to learn the bias for each outcome of measurement matrix (without the overfitting issue). Note that signifies the degree of measurement diversity: refers to the least diversity, i.e using a fixed measurement matrix, and measurement diversity is increased with . The new optimization problem becomes:
where randomly (uniformly) maps each to an element of . The overfitting issue can now be restrained by tuning ; reducing results in less overfitting. In our simulations, we use to ensure that spans with a probability close to one.
For prediction, the corresponding bias term is selected from the set .
The dataset used in this section is the well-known Pavia University dataset  which is available with the ground-truth labels444http://www.ehu.eus/ccwintco/555The Indian Pines dataset was not included due to the small size of the image which is not sufficient for a large-scale cross-validation study.. For each experiment, we perform a 2-fold cross-validation with training and testing samples. As discussed earlier, multi-categorical SVM classification algorithms typically rely on pair-wise or One-Against-One (OAO) classification results. Hence, we evaluate the sketched classifier on a OAO basis by reporting the pair-wise performances in a table . Finally, since the measurement operator is random and subject to variation in each experiment, we repeat each experiment for times and perform a worst-case analysis of the results.
Consider the case where a single measurement is made from each pixel, i.e. and is a random vector in the -dimensional spectral space. Clearly, this case represents an extreme scenario where the signal recovery would not be reliable and classification in the compressed domain becomes crucial, even at the receiver’s side where the computational cost is not of greatest concern. For performance evaluation, we are interested in two aspects: () the prediction accuracy over the test dataset, () the recovery accuracy of the classifier (with respect to the ground-truth classifier) —whose importance has been discussed in .
We define the classification accuracy as the minimum (worst) of the True Positive Rate (sensitivity) and the True Negative Rate (specificity). Figure 3 shows an instance of the distribution of the classification accuracy for a pair of classes over
random trials. As it can be seen, in the presence of measurement diversity, classification results are more consistent (reflected in the low variance of accuracy). Due to the limited space, we only report the worst-case OAO accuracies (i.e. the minimum pair-wise accuracies amongtrials) for the Pavia scene. The results for the case of one-measurement-per-pixel () are shown in Tables I and II. Similarly, the results for the case of (which is equivalent to the sampling rate of a typical RGB color camera) are shown in Tables III and IV. Note that the employed SVM classifier is linear and would not result in perfect accuracy (i.e. accuracy of one) when the classes are not linearly separable. To see this, we have reported ground-truth accuracies in Table V.
To measure the classifier recovery accuracy, we compute the cosine similarity, or equivalently the correlation, betweenand :
In the field of ensemble learning, it has been discovered that the diversity among the base learners enhances the overall learning performance . Meanwhile, our aim has been to exploit the diversity that can be efficiently built into the sensing system. Both measurement schemes of pixel-invariant (measurement without diversity) and pixel-varying (measurement with diversity) have been suggested as practical designs for compressive hyperspectral cameras . The presented analysis indicates that employing a DMD would result in more accurate recovery of the classifier and a more stable classification performance compared to the case when an FCA is used. Meanwhile, for tasks that only concern class prediction (and not the recovery of the classifier), FCA is (on average) a suitable low-cost alternative to the DMD architecture.
-  R.M. Willett, M.F. Duarte, M.A. Davenport, and R.G. Baraniuk, “Sparsity and structure in hyperspectral imaging: Sensing, reconstruction, and target detection,” Signal Processing Magazine, IEEE, vol. 31, no. 1, pp. 116–126, Jan 2014.
-  J.M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N.M. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data analysis and future challenges,” Geoscience and Remote Sensing Magazine, IEEE, vol. 1, no. 2, pp. 6–36, June 2013.
-  J.E. Fowler, “Compressive pushbroom and whiskbroom sensing for hyperspectral remote-sensing imaging,” in Proceedings of the International Conference on Image Processing, IEEE, ICIP 2014, October 2014, pp. 684–688.
-  J.E. Fowler, Qian Du, Wei Zhu, and N.H. Younan, “Classification performance of random-projection-based dimensionality reduction of hyperspectral imagery,” in Geoscience and Remote Sensing Symposium,2009 IEEE International,IGARSS 2009, July 2009, vol. 5, pp. V–76–V–79.
-  Robert Calderbank, Sina Jafarpour, and Robert Schapire, “Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain,” .
“Compressive-projection principal component analysis,”Image Processing, IEEE Transactions on, vol. 18, no. 10, pp. 2230–2242, Oct 2009.
-  M. Aghagolzadeh and H. Radha, “Adaptive dictionaries for compressive imaging,” in Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, Dec 2013, pp. 1033–1036.
-  A. Krishnamurthy, M. Azizyan, and A. Singh, “Subspace Learning from Extremely Compressed Measurements,” ArXiv e-prints, Apr. 2014.
-  M.A. Davenport, P.T. Boufounos, M.B. Wakin, and R.G. Baraniuk, “Signal processing with compressive measurements,” Selected Topics in Signal Processing, IEEE Journal of, vol. 4, no. 2, pp. 445–460, April 2010.
-  Lijun Zhang, M. Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu, “Random projections for classification: A recovery approach,” Information Theory, IEEE Transactions on, vol. 60, no. 11, pp. 7300–7316, Nov 2014.
-  Saharon Rosset, Ji Zhu, and Trevor Hastie, “Margin maximizing loss functions,” in In NIPS, 2004.
-  M.F. Duarte, M.A. Davenport, D. Takhar, J.N. Laska, Ting Sun, K.F. Kelly, and R.G. Baraniuk, “Single-pixel imaging via compressive sampling,” Signal Processing Magazine, IEEE, vol. 25, no. 2, pp. 83–91, March 2008.
-  Richard Baraniuk, Mark Davenport, Ronald Devore, and Michael Wakin, “A simple proof of the restricted isometry property for random matrices,” Constr. Approx, vol. 2008, 2007.
-  W.K. Ma, J.M. Bioucas Dias, Tsung Han Chan, N. Gillis, P. Gader, A.J. Plaza, A. Ambikapathi and Chong-Yung Chi, “A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing,” Signal Processing Magazine, IEEE, vol.31, no.1, pp.67,81, January 2014.
-  F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 8, pp. 1778–1790, August 2004.
-  O. Chapelle, “Training a support vector machine in the primal,” Neural Computing, vol. 19(5), pp. 1155–1178, 2007.
-  This dataset was gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana and consists of pixels and 224 spectral reflectance bands in the wavelength range 0.4 to 2.5e-6 meters.
-  This scene was acquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy. The number of spectral bands is 103 and the spatial resolution is pixels. Ground-truth consists of 9 classes.
-  M. Pilanci, Martin J. Wainwright, “Randomized Sketches of Convex Programs with Sharp Guarantees,” arXiv: 1404.7203 [cs.IT], April 2014.
B. Waske, S. Van Der Linden, J.A. Benediktsson, A. Rabe and P. Hostert, “Sensitivity of support vector machines to random feature selection in classification of hyperspectral data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 48, pp. 2880–2889, 2010.