MultilinearCompressiveLearningFramework
None
view repo
Compressive Learning is an emerging topic that combines signal acquisition via compressive sensing and machine learning to perform inference tasks directly on a small number of measurements. Many data modalities naturally have a multidimensional or tensorial format, with each dimension or tensor mode representing different features such as the spatial and temporal information in video sequences or the spatial and spectral information in hyperspectral images. However, in existing compressive learning frameworks, the compressive sensing component utilizes either random or learned linear projection on the vectorized signal to perform signal acquisition, thus discarding the multidimensional structure of the signals. In this paper, we propose Multilinear Compressive Learning, a framework that takes into account the tensorial nature of multidimensional signals in the acquisition step and builds the subsequent inference model on the structurally sensed measurements. Our theoretical complexity analysis shows that the proposed framework is more efficient compared to its vectorbased counterpart in both memory and computation requirement. With extensive experiments, we also empirically show that our Multilinear Compressive Learning framework outperforms the vectorbased framework in object classification and face recognition tasks, and scales favorably when the dimensionalities of the original signals increase, making it highly efficient for highdimensional multidimensional signals.
READ FULL TEXT VIEW PDFNone
The classical samplebased signal acquisition and manipulation approach usually involve separate steps of signal sensing, compression, storing or transmitting, then the reconstruction. This approach requires the signal to be sampled above the Nyquist rate in order to ensure highfidelity reconstruction. Since the existence of spatialmultiplexing cameras, over the past decade, Compressive Sensing (CS) [1] has become an efficient and a prominent approach for signal acquisition at subNyquist rates, combining the sensing and compression step at the hardware level. This is due to the assumption that the signal often possesses specific structures that exhibit sparse or compressible representation in some basis, thus, can be sensed at a lower rate than the Nyquist rate but still allows almost perfect reconstruction [2, 3]. In fact, many data modalities that we operate on are often sparse or compressible. For example, smooth signals are compressible in the Fourier domain or subsequent frames in a video are piecewise smooth, thus compressible in a wavelet domain. With the efficient realization at the hardware level such as the popular Single Pixel Camera, CS becomes an efficient signal acquisition framework, however, making the signal manipulation an intimidating task. Indeed, over the past decade, since reversing the signal to its original domain is often considered the necessary step for signal manipulation, a significant amount of works have been dedicated to signal reconstruction, giving certain insights and theoretical guarantees for the successful recovery of the signal from compressively sensed measurements [2, 1, 3].
While signal recovery plays a major role in some sensing applications such as image acquisition for visual purposes, there are many scenarios in which the primary objective is the detection of certain patterns or inferring some properties in the acquired signal. For example, in many radar applications, one is often interested in anomaly patterns in the measurements, rather than signal recovery. Moreover, in certain applications [4, 5], signal reconstruction is undesirable since the step can potentially disclose private information, leading to the infringement of data protection legislation. These scenarios naturally led to the emergence of Compressive Learning (CL) concept [6, 7, 8, 9] in which the inference system is built on top of the compressively sensed measurements without the explicit reconstruction step. While the amount of literature in CL is rather insignificant compared to signal reconstruction in CS, different attempts have been made to modify the sensing component in accordance with the learning task [10, 11], to extract discriminative features [7, 12] from the randomly sensed measurements or to jointly optimize the sensing matrix [13, 14] and the subsequent inference system. Improvements to different components of CL pipeline have been proposed, however, existing frameworks utilize the same compressive acquisition step that performs a linear projection of the vectorized data, thereby operating on the vectorbased measurements and thus losing the tensorial structure in the measurements of multidimensional data.
In fact, many data modalities naturally possess the tensorial format such as color images, videos or multivariate timeseries. The multidimensional representation naturally reflects the semantic differences inherent in different dimensions or tensor modes. For example, the spatial and temporal dimensions in a video or the spatial and the spectral dimensions in hyperspectral images represent two different concepts, having different properties. Thus by exploiting this natural form of the signals and considering the semantic differences between different dimensions, many tensorbased signal processing, and learning algorithms have shown its superiority over the vectorbased approach, which simply operates on the vectorized data [15, 16, 17, 18, 19, 20, 21]. Indeed, tensor representation and its associated mathematical operations and properties have found various applications in the Machine Learning community. For example, in multivariate timeseries analysis, the multilinear projection was utilized in [18, 22] to model the dependencies between data points along the feature and temporal dimension separately. Several multilinear regression [23, 24] or discriminant models [25, 26]
have been developed to replace their linear counterparts, with improved performance. In neural network literature, multilinear techniques have been employed to compress pretrained networks
[27, 28, 29], or to construct novel neural network architectures [19, 30, 22].It is worth noting that CS plays an important role in many applications that involves highdimensional tensor signals because the standard pointbased signal acquisition is both memory and computationally intensive. Representative examples include Hyperspectral Compressive Imaging (HCI), Synthetic Aperture Radar (SAR) imaging, Magnetic Resonance Imaging (MRI) or Computer Tomography (CT). Therefore, the tensorbased approach has also found its place in CS, also known as Multidimensional Compressive Sensing (MCS) [31], which replaces the linear sensing and reconstruction model with multilinear one. Similar to vectorbased CS, thereupon simply referred to as CS, the majority of efforts in MCS are dedicated to constructing multilinear models that induce sparse representation along each tensor mode with respect to a set of bases. For example, the adoption of sparse Tucker representation and the Kronecker sensing scheme in MRI allows computationally efficient signal recovery with very low Peak Signal to Noise Ratio (PSNR) [31, 32]. In addition, the availability of optical implementations of separable sensing operators such as [33] naturally enables MCS, significantly reducing the amount of data collection and reconstruction cost.
While multilinear models have been successfully applied in Compressive Sensing and Machine Learning, to the best of our knowledge, we have not seen their utilization in Compressive Learning, which is the joint framework combining CS and ML. In this paper, in order to leverage the multidimensional structure in many data modalities, we propose Multilinear Compressive Learning framework, which adopts a multilinear sensing operator and a neural network classifier that is designed to utilize the multidimensional structurepreserving compressed measurements. The contribution of this paper is as follows:
We propose Multilinear Compressive Learning (MCL), a novel CL framework that consists of a multilinear sensing module, a multilinear feature synthesis component, both taking into account the multidimensional property of the signals, and a taskspecific neural network. The multilinear sensing module compressively senses along each separate mode of the original tensor signal, producing structurally encoded measurements. Similarly, the feature synthesis component performs the feature learning steps separately along each mode of the compressed measurements, producing inputs to the subsequent taskspecific neural network which has the structure depending on the inference problem.
We show both theoretically and empirically that the proposed MCL framework is highly costeffective in terms of memory and computational complexity. In addition, theoretical analysis and experimental results also indicate that our framework scales well when the dimensionalities of the original signal increases, making it highly efficient for highdimensional tensor signals.
We conduct extensive experiments in object classification and face recognition tasks to validate the performance of our framework in comparison with its vectorbased counterpart. Besides, the effect of different components and hyperparameters in the proposed framework were also empirically analyzed.
We publicly provide our implementation of the experiments reported in this paper to facilitate future research. By following our detailed instructions on how to set up the software environment, all experiment results can be reproduced in just one line of code. ^{1}^{1}1https://github.com/viebboy/MultilinearCompressiveLearningFramework
The remainder of the paper is organized as follows: in Section 2, we review the background information in Compressive Sensing, Multidimensional Compressive Sensing and Compressive Learning. In Section 3, the detailed description of the proposed Multilinear Compressive Learning framework is given. Complexity analysis and comparison with the vectorbased framework are also given in Section 3. In Section 4, we provide details of our experiment protocols and quantitative analysis of different experiment configurations. Section 5 concludes our work with possible future research directions.
In this paper, we denote scalar values by either lowercase or uppercase characters , vectors by lowercase boldface characters , matrices by uppercase or Greek boldface characters and tensor as calligraphic capitals . A tensor with modes and dimension in the mode is represented as . The entry in the th index in mode for is denoted as . In addition, denotes the vectorization operation that rearranges elements in to the vector representation.
The Kronecker product between two matrices and is denoted as having dimension , is defined by:
(1) 
The mode product between a tensor and a matrix is another tensor of size and denoted by . The element of is defined as .
The following relationship between the Kronecker product and mode product is the cornerstone in MCS:
(2) 
can be written as
(3) 
where and
Compressive Sensing (CS) [1] is a signal acquisition and manipulation paradigm that performs simultaneous sensing and compression on the hardware level, leading to large reduction in computation cost and the number of measurements. The signal working under CS is assumed to have a sparse or compressible representation in some basis or dictionary , that is:
(4) 
where denotes the number of nonzero entries in . While the dictionary presented in Eq. (4) is complete, i.e., the number of columns in is equal to the signal dimension , we should note that signal models with overcomplete dictionaries can also work, i.e., with some modifications [34].
With the assumption on the sparsity, CS performs the linear sensing step using the sensing operator , acquiring a small number of measurements with , from analog signal :
(5) 
Eq. (5) represents both the sensing and compression step that can be efficiently implemented at the sensor level. Thus, what we obtain from CS sensors is a limited number of measurements that is used for other processing steps. By combining Eq. (4, and 5), the CS model is usually expressed as:
(6) 
In some applications, we are interested in recovering the signal from . This involves developing theoretical properties and algorithms to determine the sensing operator , the dictionary or basis , and the number of nonzero coefficients in order to ensure that the reconstruction is unique, and of highfidelity [2, 35, 3]. The reconstruction of is often posed as finding the sparsest solution of the underdetermined linear system [36], particularly:
(7) 
where is a small constant specifying the amount of residual error allowed in the approximation. A large body of research has been dedicated to solve the problem in Eq. (7) and its variants with two main approaches: basis pursuit (BP) which transforms Eq. (7
) to a convex one to be solved by linear programming
[37] or secondorder cones programs [2], and matching pursuit (MP), a class of greedy algorithms, which iteratively refines the solution to the sparsest [38, 39]. Both BP and MP algorithms are computationally intensive when the number of elements in is big, especially in the case of multidimensional signals.Given a multidimensional signal , a direct application of the sparse representation in Eq. (4) requires vectorizing and the calculations on , which is a very big matrix with the number of elements scales exponentially with . Instead of assuming is sparse in some basis or dictionary, MCS adopts a sparse Tucker model [40] as follows:
(8) 
which assumes that the signal is sparse with respect to a set of bases or dictionaries . Since in some cases, the sensing step can be taken in a multilinear way, i.e., by using a set of linear operators along each mode separately, also known as separable sensing operators:
(9) 
that allows us to obtain the measurements with retained multidimensional structure. From Eq. (2, 3, 8 and 9), the MCS model is often expressed as:
Since MCS can be expressed in the vector form, the existing algorithms and theoretical bounds for vectorbased CS have also been extended for MCS. Representative examples include Kronecker OMP and its tensor blocksparsity extension [42] that improves the computation significantly. It is worth noting that by adopting a multilinear structure, MCS operates with a set of smaller sensing and dictionaries, requiring much lower memory and computation compared to the vectorization approach [31].
The idea of learning directly from the compressed measurements dates back to the early work of [7] in which the authors proposed a framework termed compressive classification which introduces the concept of smashed filters and operates directly on the compressive measurements without reconstruction as the first proxy step. The result in [7] was subsequently strengthened in [43] showing that when sufficiently large random sensing matrix is used, it can capture the structure of the data manifold. Later, further extensions that extract discriminative features from compressive measurements for activity recognition [44, 45] or face recognition [12] have also been proposed.
The concept of CL was introduced in [6], which provides theoretical analysis illustrating that learning machines can be built directly in the compressed domain. Particularly, given certain conditions of the sensing matrix
, the performance of a linear Support Vector Machine (SVM) trained on compressed measurements is as good as the best linear threshold classifier trained on the original signal
. Later, for compressive learning of signals described by a Gaussian Mixture Model, asymptotic behavior of the upperbound
[9] and its extension [11] to learn the sensing matrix were also derived.The idea of jointly optimizing the sensing matrix with the classifier was also adopted in [10] in which the authors proposed an adaptive version of featurespecific imaging system to learn an optimal sensing matrix based on past measurements. With the advances in computing hardware and stochastic optimization techniques, endtoend CL system was proposed in [13], and several followup extensions and applications [46, 47, 48], indicating the superior performance when simultaneously optimizing the sensing component and the classifier via taskspecific data. Our work is closely related to the endtoend CL system in [13] in that we also optimize the CL system via stochastic optimization in an endtoend manner. Different from [13], our proposed framework efficiently utilizes the tensor structure inherent in many types of signals, thus outperforming the approach in [13] in both inference performance and computational efficiency.
In this Section, we first give our description of the proposed Multilinear Compressive Learning (MCL) framework that operates directly on the tensor representation of the signals. Then, the initialization scheme and optimization procedures of the proposed framework is discussed. Lastly, theoretical analysis of the framework’s complexity in comparison with its vectorbased counterpart is provided.
In order to model the multidimensional structure in the signal of interest, we assume that the discriminative structure in can be captured in a lowerdimensional multilinear subspace of with ():
(11) 
where denotes the factor matrices and is the signal representation in this multilinear subspace.
Here we should note that although Eq. (11) in our framework and Eq. (8) in MCS look similar in its mathematical form, the assumption and motivation are different. The objective in MCS is to reconstruct the signal by assuming the existence of the set of sparsifying dictionaries or bases and optimizing to induce the sparsest . Since our objective is to learn a classification or regression model, we make no assumption or constraint on the sparsity of but assume that the factorization in Eq. (11) can lead to a tensor subspace in which the representation is discriminative or meaningful for the learning problem.
As mentioned in the previous Section, in some applications, the measurements can be taken in a multilinear fashion, with different linear sensing operators operating along different tensor modes, i.e., separable sensing operators, we obtain the measurements from the following sensing equation:
(12) 
where () represent the sensing matrices of those linear operators.
In cases where the measurements of the multidimensional signals are taken in a vectorbased fashion, i.e., the following sensing model:
(13) 
with a single sensing operator , we can still enforce a structurepreserving sensing operation similar to the multilinear sensing scheme in Eq. (12) by setting:
(15) 
By setting the sensing matrices to be pseudoinverse of for all , we obtain the measurements that lie in the discriminativeinduced tensor subspace mentioned previously.
Figure 1 illustrates our proposed MCL framework which consists of the following components:
CS component: the data acquisition step of the multidimensional signals is done via separable linear sensing operators . As mentioned previously, in cases where the actual hardware implementation only allows vectorbased sensing scheme, Eq. (14) allows the simulation of this multilinear sensing step. This component produces measurements with encoded tensor structure, having the same number of tensor modes () as the original signal.
Feature Synthesis (FS) component: from
, this step performs feature extraction along
modes of the measurements with the set of learnable matrices . Since the measurements typically have many fewer elements compared to the original signal , the FS component expands the dimensions of, allowing better separability between the sensed signals from different classes in a higher multidimensional space that is found through optimization. While the sensing step performs linear interpolations for computational efficiency, the FS component can be either multilinear or nonlinear transformations. A typical nonlinear transformation step is to perform zerothresholding, i.e., ReLU, on
before multiplying with , i.e., . In applications which require the transmission of to be analyzed, this simple thresholding step can, before transmission, increase the compression rate by sparsifying the encoded signal and discarding the sign bits. While nonlinearity is often considered beneficial for neural networks, adding the thresholding step as described above further restricts the information retained in a limited number of measurements , thus, adversely affects the inference system. In the Experiments Section, we provide empirical analysis on the effect of nonlinearity towards the inference tasks at different measurement rates. Here we should note that while our FS component resembles the reprojection step in the vectorbased framework [13], our FS and CS components have different weights ( and , ) and the dimensionality of the tensor feature produced by FS component is taskdependent, and is not constrained to that of the original signal.Taskspecific Neural Network : from the tensor representation produced by FS step, a neural network with taskdependent architecture is built on top to generate the regression or classification outputs. For example, when analyzing visual data, the
can be a Convolutional Neural Network (CNN) in case of static images or a Convolutional Recurrent Neural Network in case of videos. In CS applications that involve distributed arrays of sensors that continuously collect data, specific architectures for timeseries analysis such as LongShort Term Memory Network should be considered for
. Here we should note that the size of is also taskdependent and should match with the neural network component. For example, in object detection and localization task, it is desirable to keep the spatial aspect ratio of similar to to allow precise localization.Our  Vector [13]  
Memory  
Computation 
In our proposed MCL framework, we aim to optimize all three components, i.e., , and , with respect to the inference task. A simple and straightforward approach is to consider all components in this framework as a single computation graph, then randomly initialize the weights according to some popular initialization scheme [49, 50]
and perform stochastic gradient descend on this graph with respect to the loss function defined by the learning task. However, this approach does not take into account any existing domain knowledge of each component that we have.
As mentioned in Section III.A, with the assumption of the existence of a tensor subspace and the factorization in Eq. (11), the sensing matrix in the CS component can be initialized equal to the pseudoinverse of for all to obtain initial that are discriminative or meaningful. There have been several algorithms proposed to learn the factorization in Eq. (11) with respect to different criteria such as the multiclass discriminant [25], classspecific discriminant [26], maxmargin [51] or Tucker Decomposition with nonnegative constraint [52].
In a general setting, we propose to apply Higher Order Singular Value Decomposition (HOSVD)
[40] and initialize with the left singular vectors that correspond to the largest singular values in mode. The sensing matrices are then adjusted together with other components during the stochastic optimization process. This initialization scheme resembles the one proposed for vectorbased CL framework which utilizes Principal Component Analysis (PCA). In a general case where one has no prior knowledge on the structure of
, a transformation that retains the most energy in the signal such as PCA or HOSVD is a popular choice when reducing dimensionalities of the signal. While for higherorder data, HOSVD only provides a quasioptimal condition for data reconstruction in the leastsquare sense [53], since our objective is to make inferences, this initialization scheme works well as indicated in our Experiments Section.With the aforementioned initialization scheme of CS component for a general setting, it is natural to also initialize in FS component with the right singular vectors corresponding to the largest singular values in mode of the training data. With this initialization of
, during the initial forward steps in stochastic gradient descent, the FS component produces an approximate version of
, and in cases where a classifier pretrained on or its approximated version exists, the weights of neural network can be initialized with that of . It is worth noting that the reprojection step in the vectorbased framework in [13] shares the weights with the sensing matrices, performing inexplicit signal reconstruction while we have different sensing and feature extraction weights. Since the vectorbased framework involves large sensing and reprojection matrices, from the optimization point of view, enforcing shared weights might be essential in their framework to reduce overfitting as indicated by their empirical results.After performing the aforementioned initialization steps, all three components in our MCL framework are optimized using Stochastic Gradient Descent method. It is worth noting that above initialization scheme for CS and FS component is proposed in a generic setting, which can serve as a good starting point. In cases where certain properties of the tensor subspace or the tensor feature are known to improve the learning task, one might adopt a different initialization strategy for CS and FS components to induce such properties.
Since the complexity of the neural network component
varies with the choice of the architecture, we will estimate the theoretical complexity for the CS and FS component and make comparison with the vectorbased framework
[13]. Let and denote the dimensionality of the original signal and its measurements , respectively. In addition, to compare with the vectorbased framework, we also assume that the dimensionality of the feature is also . Thus, belongs to and belongs to for in our CS and FS component, while in [13], the sensing matrix and the reconstruction matrix belong to and , respectively.It is clear that the memory complexity of CS and FS component in our MCL framework is , and that of the vectorbased framework is . To see the huge difference between the two frameworks, let us consider 3D MRI image of size with the sampling ratio , i.e., , the memory complexity in our framework is while that of the vectorbased framework is
Regarding computational complexity of our framework, the CS component performs having complexity of , and the FS component performs having complexity of . For the vectorbased framework, the sensing step computes and reprojection step computes , resulting in total complexity of . With the same 3D MRI example as in the previous paragraph, the total computational complexity of our framework is while that of the vectorbased framework is .
Table I summarizes the complexity of the two frameworks. It is worth noting that by taking into account the multidimensional structure of the signal, the proposed framework has both memory and computational complexity several orders of magnitudes lower than its vectorbased counterpart.
In this section, we provide a detailed description of our empirical analysis of the proposed MCL framework. We start by describing the datasets and the experiments’ protocols that have been used. In the standard set of experiments, we analyze the performance of MCL in comparison with the vectorbased framework proposed in [13]. We further investigate the effect of different components in our framework in the Ablation Study Subsection.
We have conducted experiments on the object classification and face recognition tasks on the following datasets:
CIFAR10 and CIFAR100: CIFAR dataset [54] is a color (RGB) image dataset for evaluating object recognition task. The dataset consists of images for training and images for testing with resolution pixels. CIFAR10 refers to the class objection recognition task in which each individual image has a single class label coming from different categories. Likewise, CIFAR100 refers to a more finegrained classification task with each image having a label coming from different categories. In our experiment, from the training set of CIFAR10 and CIFAR100, we randomly selected images for validation purpose and only trained the algorithms on images.
CelebA: CelebA [55] is a largescale face attributes dataset with more than images at different resolutions from more than identities. In our experiment, we used a subset of identities in this dataset which corresponds to , , and samples for training, validation, and testing, respectively. In order to evaluate the scalability of our proposed framework, we resized the original images to different set of resolutions, including: , , , and pixels, which are subsequently denoted as CelebA32, CelebA48, CelebA64, and CelebA80, respectively.
In our experiments, two types of network architecture have been employed for the neural network component : the AllCNN architecture [56] and the ResNet architecture [57]
. AllCNN is a simple 9layer feedforward architecture which has no maxpooling (pooling is done via convolution with stride more than 1) and no fullyconnected layer. ResNet is a 110layer CNN with residual connections. The exact topologies of AllCNN and ResNet in our experiment can be found in our publicly available implementation
^{2}^{2}2https://github.com/viebboy/MultilinearCompressiveLearningFramework.Since all of the datasets contain RGB images, we followed the implementation proposed in [58] for the vectorbased framework, which has 3 different sensing matrices for each of the color channel, and the corresponding reprojection matrices are enforced to share weights with the sensing matrices. The sensing matrices in MCL were initialized with the HOSVD decomposition on the training sets while the sensing matrices in the vectorbased framework were initialized with PCA decomposition on the training set. Likewise, the bases obtained from HOSVD and PCA were also used to initialize the FS component in our framework and the reprojection matrices in the vectorbased framework. In addition, we also trained the neural network component on uncompressed data with respect to the learning tasks and initialized the classifier in each framework with these pretrained networks’ weights. After the initialization step, both frameworks were trained in an endtoend manner.
All algorithms were trained with ADAM optimizer [59] with the following learning rate the schedule
, changing at epoch
and . Each algorithm was trained for epochs in total. Weight decay coefficient was set to to regularize all the trainable weights in all experiments. We performed no data preprocessing step, except scaling all the pixel values to . In addition, data augmentation was employed by random flipping on the horizontal axis and image shifting within of the spatial dimensions. In all experiments, the final model weights which are used to measure the performance on the test sets, are obtained from the epoch which has the highest validation accuracy.For each experiment configuration, we performed
runs and the mean and standard deviation of test accuracy are reported.
Type 
Configuration  #measurements  Measurement Rate 
vector [13]  
MCL (our)  
MCL (our)  
vector [13] 

MCL (our)  
MCL (our)  
vector [13] 

MCL (our)  
MCL (our)  

In order to compare with the vectorbased framework in [13], we performed experiments on 3 datasets: CIFAR10, CIFAR100, and CelebA32. To compare the performances at different measurement rates, we employed three different measurement values for the vectorbased framework: , , and . Here indicates that the vectorbased framework has different sensing matrices for each color channel. Since we cannot always select the size of the measurements in MCL to match the number of measurements in the vectorbased framework, we try to find the configurations of that closely match with the vectorbased ones. In addition, with a target number of measurements, there can be more than one configuration of that yields a similar number of measurements. For each measurement value () in the vectorbased framework, we evaluated two different values of , particularly, the following sizes of were used: , , , , and . The measurement configurations are summarized in Table II.
In order to effectively compare the CS and FS component in MCL with those in [13], two different neural network architectures with different capacities have been used. Table III and IV show the accuracy on the test set with AllCNN and ResNet architecture, respectively. The second row of each table shows the performance of the base classifier on the uncompressed data, which we term as Oracle.
It is clear that our proposed framework outperforms the vectorbased framework in all compression rates and datasets with both AllCNN and ResNet architecture, except for CIFAR100 dataset at the lowest measurement rate (). The performance gaps between the proposed MCL framework and the vectorbased one are huge, with more than differences for the CIFAR datasets at measurement rates and . In case of CelebA32 dataset and at measurement rate (configuration ), the inference systems learned by our proposed framework even slightly outperform the Oracle setting for both AllCNN and ResNet architecture.
Although the capacities of AllCNN and ResNet architecture are different, their performances on the uncompressed data are roughly similar. Regarding the effect of two different base classifiers in the two Compressive Learning pipelines, it is clear that the optimal configurations of our framework at each measurement rate are consistent between the two classifiers, i.e., the bold patterns from both Table III and IV are similar. When switching from AllCNN to ResNet, the vectorbased framework observes performance drop at the highest measurement rate (), but increases in lower rates ( and ). For our framework when switching from AllCNN to ResNet, the test accuracies stay approximately similar or improve.
Table V shows the empirical complexity of both frameworks with respect to different measurement configurations, excluding the base classifiers. Since all three datasets employed in this experiment have the same input size and the size of the feature tensor in MCL was set similar to the original input size, the complexities of CS and FS components in all three datasets are similar. It is clear that our proposed MCL framework has much lower memory and computational complexity compared to the vectorbased counterpart. In our proposed framework, even operating at the highest measurement rate , the CS and FS components require only parameters and FLOPs, which are approximately times fewer than that of the vectorbased framework operating at the lowest measurement rate . Interestingly, the optimal configuration at each measurement rate obtained in our framework also has lower or similar complexity than the other configuration.
In Figure 2, we visualize the features obtained from the reprojection step and the FS component in the proposed framework, respectively. It is worth noting that the sensing matrices and the reprojection matrices (in case of the vectorbased framework) or (in FS component of MCL framework) were initialized with PCA and HOSVD. In addition, the base network classifiers were also initialized with the ones trained on the original data. Thus, it is intuitive to expect the features obtained from both frameworks to be visually interpretable for human, despite no explicit reconstruction objective was incorporated during the training phase. Indeed, from Figure 2, we can see that with the highest number of measurements, the feature images obtained from both frameworks look very similar to the original images. Particularly, the ones synthesized by the vectorbased framework look visually closer to the original images than those obtained from our MCL framework. Since the sensing and reprojection steps in the vectorbased framework share the same weight matrices during the optimization procedure, the whole pipeline is more constrained to reconstruct the images at the reprojection step.
When the number of measurements drops to approximately of the original signal, the reverse scenario happens: the feature images (in configuration , ) obtained from our framework retain more facial features compared to those from the vectorbased framework (), especially in the configuration. This is due to the fact that most of the facial information in particular, and natural images in general lie on the spatial dimensions, i.e., height and width. Besides, when the dimension of the third mode of the measurement is set to (as in configuration , ), after the optimization procedure, our proposed framework effectively discards the color information which is less relevant to the facial recognition task, and retains more lightness details, thus, performs better than the configurations with the mode dimension set to (in configuration , ).
With the above observations from the empirical analysis, it is clear that structurepreserving Compressive Sensing and Feature Synthesis components in our proposed MCL framework can better capture essential information inherent in the multidimensional signal for the learning tasks, compared with the vectorbased framework.
Configuration 
CIFAR10  CIFAR100  CelebA32 
Oracle  
[13]  
(our)  
(our)  
[13] 

(our)  
(our)  
[13] 

(our)  
(our)  

In this subsection, we provide the empirical analysis on the effect of different components in MCL framework. These factors include the effect of the popular nonlinear thresholding step discussed in Section III.B; the choice of having shared or separate weights in CS and FS component; the initialization step discussed in Section III.C; the scalability of the proposed framework when the original dimensionalities of the signal increase. Since the total number of experiment settings when combining all of the aforementioned factors is huge, and the results involved multiple factors are difficult to interpret, we analyze these factors in a progressive manner.
Firstly, the choice of linearity or nonlinearity and the choice of shared or separate weights in CS and FS component are analyzed together since the two factors are closely related. In this setting, the CS and FS components are initialized by HOSVD decomposition as described in Section III.C. The neural network classifier has the AllCNN architecture with the weights initialized from the corresponding pretrained network on the original data. Table VI shows the test accuracies on CIFAR10, CIFAR100 and CelebA32 at different measurements. It is clear that most of the highest test accuracies are obtained without the thresholding step and with separate weights in CS and FS component, i.e., most boldface numbers appear in the lower quarter on the left side of Table VI. Comparing between linearity and nonlinearity option, it is obvious that the nonlinearity effect of adversely affect the performances, especially when the number of measurements decreases. The reason might be that applying to the compressed measurements restricts the information to be represented in the positive subspace only, thus further losing the representation power in the compressed measurements when only a limited number of measurements allowed.
In the linearity setting, while the performance differences between shared and separate weights in some configurations are small, here we should note that allowing nonshared weights can be beneficial in cases where we know that certain features should be synthesized in the FS component in order to make inferences.
From the observation obtained from the above analysis on the effect of linearity and separate weights, we investigated the effect of the initialization step discussed in Section III.C. All setups were trained with a multilinear FS component having separate weights from CS component. From Table VII, we can easily observe that by initializing the CS and FS components with HOSVD, the performances of the learning systems increase significantly. When CS and FS components are initialized with HOSVD, utilizing a pretrained network further improves the inference performance of the systems, especially in the low measurement rate regime. Thus, the initialization strategy proposed in Section III.C is beneficial in a general setting for the learning tasks.
Configuration  LINEARITY  NONLINEARITY  
CIFAR10  CIFAR100  CelebA32  CIFAR10  CIFAR100  CelebA32  
SHARED  
SEPARATE  
Configuration  PRECOMPUTE CS & FS  RANDOM CS & FS  
CIFAR10  CIFAR100  CelebA32  CIFAR10  CIFAR100  CelebA32  





Finally, the scalability of the proposed framework is validated in different resolutions of the CelebA dataset. All of the previous experiments were demonstrated with CelebA32 dataset, which we assume that there are only elements in the original signal. To investigate the scalability, we pose the following question: What if the original dimensions of the signal are higher than , with the same numbers of measurements presented in Table II, can we still learn to recognize facial images with feasible costs?. To answer this question, we trained our framework on CelebA32, CelebA48, CelebA64 and CelebA80 and recorded the test accuracies, the number of parameters and the number of FLOPs at different number of measurements, which are shown in Table VIII. It is clear that at each measurement configuration, when the original signal resolution increases, the measurement rate drops at a similar rate, however, without any adverse effect on the inference performance. Particularly, if we look into the last column of Table VIII, with a sampling rate of only , the proposed framework achieves accuracy, which is only lower compared to that of the base classifier trained on the original data. Here we should note that most of the images in CelebA dataset have higher resolution than pixel, therefore, 4 different versions of CelebA (, , , , ) in our experiments indeed contain increasing levels of data fidelity. From the performance statistics, we can observe that the performance of our framework is characterized by the number of measurements, rather than the measurement rates or compression rates.
Due to the memory limitation when training the vectorbased framework at higher resolutions, we could not perform the same set of experiments for the vectorbased framework. However, to compare the scalability in terms of computation and memory between the two frameworks, we measured the number of FLOPs and parameters in the vectorbased framework, excluding the base classifier and visualize the results on Figure 3. It is worth noting that on the yaxis is the log scale and as the dimensions of the original signal increase, the complexity of the vectorbased framework increases by an order of magnitude while our proposed MCL framework scales favorably in both memory and computation.
Configuration  ACCURACY  MEASUREMENT RATE  
CelebA32  CelebA48  CelebA64  CelebA80  CelebA32  CelebA48  CelebA64  CelebA80  
Oracle  
Configuration  #FLOP  #PARAMETER  
CelebA32  CelebA48  CelebA64  CelebA80  CelebA32  CelebA48  CelebA64  CelebA80  
In this paper, we proposed Multilinear Compressive Learning, an efficient framework to tackle the Compressive Learning task that operates on multidimensional signals. The proposed framework takes into account the tensorial nature of the multidimensional signals and performs the compressive sensing as well as the feature extraction step along different modes of the original data, thus being able to retain and synthesize essential information on a multilinear subspace for the learning task. We show theoretically and empirically that the proposed framework outperforms its vectorbased counterpart in both inference performance and computational efficiency. Extensive ablation study has been conducted to investigate the effect of different components in the proposed framework, giving insights into the importance of different design choices.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–24, 2015.Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256, 2010.Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in
Proceedings of International Conference on Computer Vision (ICCV), 2015.K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
Comments
There are no comments yet.