I Introduction
Compressive Sensing (CS) [11]
is an efficient signal acquisition paradigm that performs the measurement of the signal at subNyquist rates by sensing and linearly interpolating the samples at the sensor level. Due to the property that the compression step is performed before signal registration during the sampling phase, CS significantly lowers the temporary storage and bandwidth requirement of the sensing devices. Therefore, this paradigm plays an important role in many applications that involve highdimensional signals since the signal collection process can be very computationally demanding, timeconsuming, or even prohibitive when using the traditional pointbased sensingthencompressing approach
[8]. For example, Hyperspectral Compressive Imaging (HCI) [3] and Synthetic Aperture Radar (SAR) imaging [34] often perform remote sensing of large volume of data constrained by limited electrical power, storage capacity and transmission bandwidth, or in 3D Magnetic Resonance Imaging (MRI) [23], long scanning time can make patients feel uncomfortable, promoting body movements, thus degrading the image quality.Although the ideal sampling theorem requires the signal to be sampled at higher rates than the Nyquist rate to ensure perfect reconstruction, in CS, the undersampled signal can still be reconstructed almost perfectly if the sparsity assumption holds and the sensing operators possess certain properties [10, 17]. The sparsity prior assumes the existence of a set of basis or a dictionary in which the signal of interest has sparse representations, i.e., having few nonzero coefficients. Sparsity manifests in many classes of signals that we operate on since they are often smooth or piecewise smooth, thus having sparse representations in Fourier or wavelet domains. This form of prior knowledge incorporated into the model allows us to acquire the signals at a much lower cost.
In fact, over the past decade, a large body of works in CS has been dedicated to finding relevant priors from the class of signals of interest, and incorporating them into the model to achieve better compression and reconstruction [33]. A typical form of priors that has been widely used in CS is the existence of a signal similar to the original signal of interest. This information is available in many scenarios such as video acquisitions [39, 24]
and estimation problems
[13] where signals acquired in succession are very similar. Similarity also exists in signals captured by nearby sensors in sensor networks [6] or images from spatially close cameras in multiview camera systems [12, 43]. Other examples of prior information include the available estimate of the support of the signal [44][37], weighted sparsity [25] or block sparsity [40, 18]. The process of finding prior information often relies on the available prior knowledge related to the signal, e.g., the spatial arrangement of sensor networks, while the incorporation of such information is often expressed as another term to be minimized such as the difference between the current frame and the last frame when reconstructing MRI images. For a detailed discussion on prior information in CS and its theoretical and empirical improvements, we refer the readers to [33].While signal reconstruction is a necessary processing step in some applications, there exist many scenarios in which the primary purpose is to detect certain patterns or to infer some properties from the acquired signal, rather than the recovery of the entire signal. Learning from compressed measurements, also known as Compressive Learning (CL) [9, 15, 14, 35, 29, 1, 41], focuses on the task of inference directly from compressive domain without explicit signal reconstruction. That is, CL models are often formed under the assumption that only certain aspects or representative features of the signal need to be captured to make inferences, even when perfect signal recovery might be impossible from the retained information. It is worth noting that in many scenarios such as security and surveillance applications [45, 32], signal recovery might disclose privacy information. Thus, it is undesirable.
Since the main objective of CL is to extract relevant information given the learning problem, the literature in CL mainly concerns with the selection of the sensing operators [36, 4] or the optimization of the learning model [15] that operates in the compressive domain. In the early works, the selection or design of sensing matrices was decoupled from the optimization of the inference model. Nowadays, with the developments on both hardware and algorithmic levels, stochastic optimization has become more and more efficient, making endtoend learning systems applicable and appealing to a wide range of applications. Initiated by the work of [30, 1], recent CL systems [22, 16, 49, 41]
have adopted an endtoend approach which jointly optimizes the sensing matrices and the inference operators, leaving the task of discovering relevant features to stochastic gradient descent.
The majority of CL systems employs linear sensing operators which operate on the vectorbased input, regardless of the natural representation of the signal. Recently, the authors in
[41]proposed Multilinear Compressive Learning (MCL) framework, which takes into consideration the natural structure of multidimensional signals by adopting multilinear sensing and feature extraction components. MCL has been shown both theoretically and empirically to be more efficient than its vectorbased counterpart. The assumption made in MCL is that the original signals can be linearly projected along each tensor mode to a tensor subspace which retains relevant information for the learning problem, and from which discriminative features can be synthesized. In a general setting, the authors in
[41]propose to initialize the sensing matrices with the left singular vectors obtained from Higher Order Singular Value Decomposition (HOSVD), a kind of
weak prior that initially guides the model towards an energypreserving tensor subspace during stochastic optimization.As we have seen from CS literature, finding good priors and successfully incorporating them into the signal model lead us to better solutions. While prior information exists in many forms in the reconstruction task, it is not straightforward to define relevant information given the learning problem. Indeed, representation learning [7]
has been an active research area that aims to extract meaningful representations from the raw data, ideally without any human given label such as object categories. From the human perception, in certain cases, we do have an idea of what kind of information is relevant or irrelevant to the learning task, e.g., color cues might be irrelevant in face recognition problems but relevant for the traffic light detection problem. From the optimization point of view, it is hard to make such a claim before any trial. Therefore, we often want to strike a balance between handcrafted features and endtoend solutions.
In this paper, we aim to tackle the problems of How to find good priors in Compressive Learning? and How to incorporate such priors into an endtoend Compressive Learning system without hardcoding? Based on two observations:

Although the sensing operators in CL are limited to linear/multilinear form, the feature synthesis component that operates on compressed measurements can adopt nonlinear transformations.

Nonlinear sensing operators have better capacities to discover representative tensor subspaces
we propose a novel approach that utilizes nonlinear compressive learning models to discover the structures of the compressive domain and the informative features that should be synthesized from this domain for the learning task. This knowledge is subsequently transferred to the compressive learning models via progressive Knowledge Distillation [21]. Our contributions can be summarized as follows:

In this paper, a novel methodology to discover and incorporate prior knowledge into existing endtoend Compressive Learning systems is proposed. Although we limit our investigation of the proposed method to Multilinear Compressive Learning framework, the approach described in this paper can be applied to any endtoend Compressive Learning system to improve its performance.

The proposed approach naturally leads to the semisupervised extension in which the availability of unlabeled signals coming from similar distributions can benefit Compressive Learning systems via prior knowledge incorporation.

We carefully designed and conducted extensive experiments in object classification and face recognition tasks to investigate the effectiveness of the proposed approach to learn and incorporate prior information. In addition, to facilitate future research and reproducibility, we publicly provide our implementation of all experiments reported in this paper^{1}^{1}1https://github.com/viebboy/MultilinearCompressiveLearningWithPrior.
The remainder of our paper is organized as follows: in Section II, we review the related works in Compressive Learning, Knowledge Distillation, and give a brief description of Multilinear Compressive Learning framework. Section III provides a detailed description of our approach to learn and incorporate prior knowledge into MCL models in both supervised and semisupervised settings. Section IV details our empirical analysis, including experiment protocols, experiment designs, and quantitative and qualitative evaluation. Section V concludes our work.
Abbreviation & Nomenclature

[]
 CS

Compressive Sensing
 CL

Compressive Learning
 MCL

Multilinear Compressive Learning
 MCLwP

Multilinear Compressive Learning with Priors
 HOSVD

Higher Order Singular Value Decomposition
 KD

Knowledge Distillation
 FS

Feature Synthesis

Multidimensional analog signal

Compressed measurements (vector) in CL

Compressed measurements (tensor) in MCL

Tensor features, output of FS component in MCL

CS component in MCL

Parameters of

FS component in MCL

Parameters of

Taskspecific neural network in MCL

Parameters of

Priorgenerating model

Compressed measurements in

Tensor features, output of FS component in

Nonlinear sensing component in

Parameters of

FS component in

Parameters of

Taskspecific neural network in

Parameters of

th data sample

label of the th sample

Number of labeled samples

Total number of samples, including labeled and unlabeled samples

The set of labeled data

The set of unlabeled data

Enlarged labeled set

Inference loss function

Distillation loss function
Ii Related Work
Iia Background
Throughout the paper, we denote scalar values by either lowercase or uppercase characters , vectors by lowercase boldface characters , matrices by uppercase or Greek boldface characters and tensor as calligraphic capitals . A tensor with modes and dimension in the mode is represented as . The entry in the th index in mode for is denoted as . In addition, denotes the vectorization operation that rearranges elements in to the vector representation.
The mode product between a tensor and a matrix is another tensor of size and denoted by . The element of is defined as .
In CS, given the original analog signal , signal acquisition model is described via the following equation:
(1) 
where denotes the measurements obtained from CS devices and denotes the sensing operator or sensing matrix that performs linear interpolation of .
For a multidimensional signal , we have Multilinear Compressive Sensing (MCS) model as follows:
(2) 
where is the compressed measurements having a tensor form, and
denotes separable sensing operators that perform linear transformation along each mode of the original signal
.Since Eq. (2) can also be written as:
(3) 
where , , and denotes the Kronecker product, we can always express MCS in the vectorbased fashion, but not vice versa.
IiB Compressive Learning
The objective of Compressive Learning is to create learning models that generate predictions from the compressed measurements or . This idea was first proposed in the early work of [15]
, where the authors train a classifier directly on the compressed measurements without the signal reconstruction step. It has been shown in this work that when the number of measurements approximates the dimensionality of the data manifold, accurate classifiers can be trained directly in the compressive domain. To achieve good classification performance, an extension of
[15] later showed that the number of measurements only depends on the intrinsic dimensionality of the data manifold [5].A similar theoretical result was derived later in [9]
, which proves that the performance of a linear Support Vector Machine classifier trained on compressed measurements is equivalent to the best linear threshold classifier operating in the original signal domain, given the DistancePreserving Property holds for the sensing matrices. Asymptotic behavior and sensing operator optimization in CL models of signals described by Gaussian Mixture Model have also been proposed in
[35, 36].The idea of using past measurements as a prior to jointly learn better sensing matrices and the classifier was proposed in [4]. Since the advancement in stochastic optimization, linear classifiers and shallow learning models have been replaced with deep neural networks [29, 30], and separate optimization of the sensing operators and the classifiers have been replaced by the endtoend training paradigm, which has been shown to significantly outperform predefined sensing operators when the compression rates are high [1, 22, 16, 49, 41].
IiC Multilinear Compressive Learning
While previous works in learning from compressed measurements adopt the signal acquisition model in Eq. (1), recently, the authors in [41] proposed Multilinear Compressive Learning (MCL), a framework that utilizes Eq. (2) as the signal acquisition model in order to retain the multidimensional structure of the original signal.
MCL consists of three components: the CS component described by Eq. (2), the Feature Synthesis (FS) component that transforms the measurements to tensor feature via:
(4) 
and a taskspecific neural network that operates on to generate prediction, i.e., the prediction is .
Since the transformation in Eq. (2) preserves the tensor structure of , MCL assumes the existence of a tensor subspace which captures the essential information in through , and from which discriminative features can be synthesized.
Similar to the endtoend vectorbased framework [1, 49], MCL incorporates a kind of weak prior that initializes and with energypreserving values, i.e., the singular vectors corresponding to the largest singular values in mode obtained from HOSVD of the training data. The taskspecific network is initialized with the weights trained on uncompressed signals. The whole system is then optimized via stochastic gradient descent. As pointed out in the empirical analysis in [41], although simple, this initialization scheme contributes significantly to the performances of the system compared to the de facto random initialization scheme often utilized in deep neural networks [20].
IiD Knowledge Distillation
The term Knowledge Distillation (KD) was coined in [21], in which the authors proposed a neural network training technique that utilizes the prediction from a pretrained high capacity network (the teacher network) to provide supervisory signals to a smaller network (the student network) along with labeled data. The intuition behind this training technique is that with higher capacity, it is easier for the teacher network to discover better data representation, and the form of knowledge provided via the teacher network’s prediction helps guide the student network to better solutions.
There have been several works investigating variants of this technique. For example, KD that involves multiple student and teacher networks [48] has been shown to be more robust than a single teacherstudent pair. In other works [38, 46, 47]
, in addition to the predicted probabilities, knowledge coming from intermediate layers of the teacher network has also been proven to be useful for the student. While KD has often been considered in the context of model compression, i.e., to train low capacity models with better performances, this paradigm has also been successfully applied to distributed training with student and teacher networks having the same architecture
[2].In our work, we propose to use KD as a method to progressively incorporate prior information into CL models. While in a general setting it is unclear how to select the immediate source and target layers to transfer knowledge from the teacher to the student, we will show next that in the context of CL, and especially in the MCL framework, pairing the intermediate teachers and students is easy since the learning system is modularized into different components with different functionalities.
Iii Proposed Methods
Iiia Finding Prior Knowledge
A common constraint which exists in all CL models is that the sensing operator is linear or multilinear. This limits the capacity of CL models and the amount of information that can be captured from the original signal. Thus, in CL in general and MCL in particular, given a fixed structure of the Feature Synthesis (FS) component and the taskspecific neural network , there is always an upperbound on the capacity of the whole learning system, which makes the problem of learning from compressed measurements generally harder than other unconstrained learning tasks.
As mentioned in the Introduction Section, finding and incorporating prior knowledge is a recurring theme in CS community since good priors have been shown both theoretically and empirically to improve signal recovery and compression efficiency [33]. In case of CL, prior knowledge is less obvious. However, we may still formulate the following question: ´´What kind of prior knowledge do we know about the measurements and the types of features that are representative for the learning task?”
Although we might not have direct knowledge of optimal or , we know as a prior knowledge that, given sufficient training data, a higher capacity neural network is expected to perform better than a less complex one when both networks are structurally similar. By structure here we mean that both networks have similar connectivity patterns and by higher capacity, we mean higher number of parameters and/or higher number of layers. The structural similarity ensures a feasible capacity comparison between networks since one cannot make such a guess about the learning ability, e.g., between a feedforward architecture having a higher parameter count and a residual architecture having a lower parameter count. Therefore, we almost know for certain that given an MCL model with upperbounded learning capacity, one can construct and train other learning systems with similar FS component and taskspecific network , but having nonlinear sensing operators to achieve better learning performance, thus better and . The representations produced by higher capacity, nonlinear compressive learning models become our direct prior knowledge of the compressive domain and the feature space.
The advantage of the above presented approach to define prior knowledge in MCL is twofold:

Since the priors are generated from another learning system trained on the same learning task, we incorporate completely datadependent, taskspecific priors into our model instead of handcrafted priors.

By sharing the same structure of the FS component and taskspecific neural network between the MCL model and the priorgenerating model, the weights obtained from the latter provide a good initialization when optimizing the former.
The proposed approach reduces the problem of finding priors to the task of designing and training a nonlinear compressive learning model; hereupon we refer to it as the priorgenerating model , which has no structural constraint. To construct and optimize , we propose to mirror the strategy in MCL [41]:

The nonlinear sensing component of , denoted as with parameters , has several layers to reduce the dimensionality of the original signal gradually. transforms the input signal to a compressed representation , i.e., , which has the same size as .

Similarly, the nonlinear FS component of , denoted as with parameters , has several layers to increase the dimensionality of gradually, producing features , which has the same dimensionality as the original signal. and
together resemble the encoder and decoder of a nonlinear autoencoder.

Denote and () the th training sample and its label. In order to initialize the weights and , we update and to minimize via stochastic gradient descent. The weights of taskspecific network are initialized with weights minimizing with is the inference loss.

After the initialization step, , , and are updated altogether via stochastic gradient descent to minimize
IiiB Incorporating Prior Knowledge
Let us denote the CS, FS and taskspecific neural network component of the MCL model as , and , having learnable parameters , and , respectively. It should be noted that and have the same functional form as and while possesses a multilinear form as in Eq. (2).
After training the priorgenerating model , we perform progressive KD to transfer the knowledge of to the MCL model. During this phase, the parameters of , i.e., , , and are fixed. Since the teacher () and the student (MCL) model share the same modular structure which has three distinct components, training MCL with prior knowledge given by consists of three stages:

Stage 1 (Transferring the sensing knowledge from to ): during this stage, the weights of the sensing component in MCL are obtained by optimizing the following criterion:
(5) The purpose of this stage is to enforce the sensing component in MCL to mimic that of the priorgenerating model .

Stage 2 (Transferring the feature synthesis knowledge from to ): during this stage, we aim to optimize and in MCL to synthesize similar features produced by the priorgenerating model. Before the optimization process, the weights of the sensing component () are initialized with values obtained from the 1^{st} stage, and the weights of the feature synthesis component are initialized with values from since in MCL and in have the same functional form. After the initialization step, both sensing () and feature synthesis () components in MCL are updated together to minimize the following criterion:
(6) 
Stage 3 (Transferring the inference knowledge from to and discriminative training ): in this final stage, the MCL model is trained to minimize the inference loss as well as the difference between its prediction and the prediction produced by the priorgenerating model . That is, the minimization objective in this stage is:
(7) where and denote the inference loss and distillation loss, respectively. is a hyperparameter that allows the adjustment of distillation loss. The specific form of and are chosen depending on the inference problem. For example, in classification problems, we can select crossentropy function for
and KullbackLeibler divergence for
while in regression problems, and can be meansquarederror or meanabsoluteerror function. To avoid lengthy representation, Eq. (7) omits the parameters of each component. Here we should note that before optimizing Eq. (7), parameters and of the sensing and synthesis component in MCL are initialized with values obtained from the previous stage, while parameters of the taskdependent neural network in MCL is initialized with values from in the priorgenerating model .
The pseudocode and the illustration of our proposed algorithm to train MCL model via progressive knowledge transfer is presented in Algorithm 1 and Figure 1, respectively.
To use KD and the priorgenerating model , a simpler and more straightforward approach is to directly distill the predictions of to the MCL model as in the 3^{rd} stage described above, skipping the 1^{st} and 2^{nd} stages. However, by gradually transferring knowledge from each component of the priorgenerating model to the MCL model, we argue that the proposed training scheme directly incorporates the knowledge of a good compressive domain and feature space, facilitating the MCL model to mimic the internal representations of its teacher, which is better than simply imitating its teacher’s predictions. To validate our argument, we provide empirical analysis of the effects of each training stage in Section IVF.
The aforementioned progressive knowledge transfer scheme also allows us to directly use the predictions produced by the teacher model in the 3^{rd} stage, instead of the “softened” predictions as in the original work [21], eliminating the need to select the associated temperature hyperparameter to soften the teacher’s predictions.
IiiC Semisupervised Learning Extension
It should be noted that without sufficient labeled data, training a highcapacity priorgenerating model as proposed in Section IIIA
can lead to an overfitted teacher, which might provide misleading supervisory information to the MCL model. In addition, a limited amount of data might also prevent effective knowledge transfer from the teacher to the student model. In certain scenarios, labeled data is scarce, however, we can easily obtain a large amount of data coming from the same distribution without labels. Semisupervised learning refers to the learning paradigm that takes advantages of unlabeled data, usually available in abundance, along with a limited amount of labeled data. Here we also describe a semisupervised adaptation for the above MCL model training technique to remedy possible overfitting cases and improve generalization of both priorgenerating and MCL models in classification tasks. To this end, unlabeled data is utilized to both initialize, and then optimize the weights of the priorgenerating model
via an incremental selflabeling procedure. Subsequently, when transferring knowledge to the MCL model, the class predictions of on unlabeled data are used as hard labels in the inference loss term, while its probability predictions on unlabeled data are used as soft labels in the distillation loss term.Let us denote the labeled training set and the unlabeled training set. To take advatange of , we propose the following modifications to the training procedure of priorgenerating model :

Initialization of and : the weights of the sensing () and feature synthesis () component are initialized with values obtained from minimizing the reconstruction error on , i.e., .

Incremental optimization of via selflabeling: after the initialization step, all parameters of the priorgenerating model are optimized with respect to the inference loss, which is calculated on the enlarged labeled set :
(8) Initially, the enlarged labeled set is formed from the labeled data, i.e., . After every backpropagation epochs, is augmented with those data instances (with their predicted labels) in that have the most confident predictions from the current , given a probability threshold , i.e., :
(9) After the enlargement of with the most confident instances, they are removed from the unlabeled set , i.e., . The training terminates when the enlargement of stops, i.e., .
Selflabeling is a popular technique in semisupervised algorithms. While there are many sophisticated variants of this technique [42], the simple modifications proposed above work well as illustrated in our empirical study in Section IVG.
Given a priorgenerating model trained on , in order to adapt the progressive knowledge transfer algorithm proposed in Section IIIB to the semisupervised setting, we propose to replace the objectives in Eq. (5), (6) and (7) with the following objectives respectively:
(10) 
and
(11) 
and
(12) 
where in (12), denotes the class label predicted by the priorgenerating model for .
A summary of our proposed algorithm in the semisupervised setting is presented in Algorithm 2.
Iv Experiments
In this Section, we provide a detailed description and results of our empirical analysis which demonstrates the advantages of the proposed algorithms that incorporate datadependent prior knowledge into the training procedure of MCL models, compared to [41]. For this purpose, we provide various comparisons between the proposed algorithm and the original algorithm proposed in [41] in the supervised learning setting. The experiments are designed to benchmark the learning performances and the quality of signal representation in compressive domain produced by the two competing algorithms, with increasing difficulty in the learning tasks. In addition, we also study the significance of priorgenerating model and different knowledge transfer stages via different ablation experiments. Lastly, we demonstrate the necessity of the proposed modifications presented in Section IIIC to our algorithm in the semisupervised learning setting, i.e., when the databases are large, but only limited amounts of labeled data exist.
Iva Datasets
We conducted experiments on the object classification and face recognition tasks of the following datasets:

CIFAR dataset [27] is a color (RGB) image dataset used for evaluating object recognition methods. The dataset consists of images for training and images for testing with resolution pixels. In our work, CIFAR10 refers to the class objection recognition task in which each individual image has a single class label coming from different categories. A more finegrained and difficult classification task also exists in CIFAR dataset with each image having a label coming from different categories, which we denote as CIFAR100. Since there is no validation set in the original database, in our experiments, we randomly selected images from the training set of CIFAR10 and CIFAR100 for validation and only trained the algorithms on images.

CelebA: CelebA [28] is a largescale human face image dataset with more than images at different resolutions from more than identities. In our experiment, we created three versions of CelebA with increasing difficulties by increasing the set of identities to be recognized: CelebA100, CelebA200, and CelebA500 having , , and identities respectively. Here we note that CelebA100 is a subset of CelebA200, and CelebA200 is a subset of CelebA500.

To study the performances of the proposed algorithm in semisupervised settings, we also created CIFAR10S, CIFAR100S, CelebA500S, which have the same number of training instances as CIFAR10, CIFAR100, and CelebA500, respectively, but only 20% of them are labeled. The test sets in CIFAR10S, CIFAR100S and CelebA500S are the same as those in CIFAR10, CIFAR100, and CelebA500.
The information of all datasets used in our experiments is summarized in Table I.
Dataset  Input Dimension  Output Dimension  #Labeled Train  #Unlabeled Train  #Validation  #Test 
CIFAR10  
CIFAR100  
CelebA100  
CelebA200  
CelebA500  
CIFAR10S  
CIFAR100S  
CelebA500S  

IvB Experiment Protocols
In this Section, to differentiate between MCL models trained by different algorithms, we use the abbreviation MCL to refer to the original algorithm proposed in [41], MCLwP to refer to our proposed algorithm that trains MCL models with a priorgenerating model with labeled data only, and MCLwPS to refer to our proposed algorithm that can take advantages of additional unlabeled data.
Regarding the architectural choices in MCL, MCLwP, MCLwPS, we adopted the AllCNN architecture [41] for the taskspecific neural network component in all algorithms. In MCL, the CS and FS components both perform multilinear transformation while in MCLwP and MCLwPS, the CS component performs multilinear transformation and the FS component performs nonlinear transformation which consists of both convolution and multilinear projection. The exact network architecture used for each algorithm can be found in our publicly available implementation^{2}^{2}2https://github.com/viebboy/MultilinearCompressiveLearningWithPrior.
To compare the learning performances at different measurement rates, we have conducted experiments at four measurement rates , , and , corresponding to the following configurations of : , , , and , respectively.
Since all learning tasks are classification tasks, we represented the labels with onehot encoding and used crossentropy function and symmetric KullbackLeibler divergence for the inference loss
and distillation loss , respectively. For each experiment configuration, we performed three runs, and the median values of accuracy on the test set are reported. For , the hyperparameter that controls the amount of distillation loss in Eq. (7), we set it equal to . Confidence probability threshold in semisupervised learning experiments was selected from the set .Regarding stochastic optimization protocol, we used ADAM optimizer [26] with the following learning rate schedule , changing at epochs and . Each stochastic optimization procedure was conducted for epochs in total. Maxnorm constraint with a value of was used to regularize the parameters in all networks. No data preprocessing was conducted, except the scaling of all pixel values to . During stochastic optimization, we performed random flipping on the horizontal axis and image shifting within of the spatial dimensions to augment the training set. In all experiments, the final model weights, which are used to measure the performance on the test sets, are obtained from the epoch, which has the highest validation accuracy.
IvC Comparisons with MCL [41]
Measurements  Models  CIFAR10  CIFAR100  CelebA100  CelebA200  CelebA500 
MCL [41]  
MCLwP (our)  
(MCLwPMCL)  
MCL [41]  
MCLwP (our)  
(MCLwPMCL)  
MCL [41]  
MCLwP (our)  
(MCLwPMCL)  
MCL [41]  
MCLwP (our)  
(MCLwPMCL)  
Prior ()  
Prior ()  
Prior ()  
Prior () 
Table II shows the performances of MCL proposed by [41] and our proposed algorithm MCLwP. The last four rows of Table II present the learning performances of the corresponding priorgenerating models. It is clear that with the presence of prior knowledge, our proposed algorithm outperforms MCL for most of the configurations in all five datasets.
For a simpler problem such as recognizing objects in CIFAR10 dataset, the performance gaps between MCLwP and MCL are relatively small. However, when the complexity of the learning problem increases, i.e., when the number of objects or facial identities increases, the differences in performance between MCLwP and MCL are significant. This can be observed by looking at the last row of each measurement configuration, with a direction from left to right (the direction of increasing difficulties of the learning tasks). For example, at configuration , moving from CIFAR10 to CIFAR100, the improvement changes from to , while from CelebA100 to Celeb200, then to CelebA500, the improvement changes from to and to , respectively.
The only setting that MCLwP performs slightly worse than MCL is in CIFAR10 dataset at the lowest measurement rate (). By inspecting the priorgenerating models’ performances in CIFAR10 dataset, we can see that the priorgenerating model at measurement has very high performance, and even performs far better than its counterpart at a higher measurement rate (). Thus, the reason we see the degradation in learning performance might be because there is a huge gap between the teacher’s and the student’s learning capacity that makes the student model unable to learn effectively. This phenomenon in KD has been observed previously in [31].
One might make an assumption that lower numbers of measurements always associate with lower learning performances. This, however, is not necessarily true for our priorgenerating models having nonlinear CS component with different numbers of downsampling layers. In fact, in our priorgenerating models, we use two maxpooling layers to reduce the spatial dimensions for configuration while only one for configuration . Although better performances for each measurement configuration can be achieved by carefully adjusting the corresponding teacher’s capacity as in [31], it is sufficient for us to use a simple design pattern of autoencoder in order to demonstrate the effectiveness of the proposed MCLwP.
IvD Effects of Learning Capacity in PriorGenerating Models
Measurements  Models  CIFAR10  CIFAR100  CelebA100  CelebA200  CelebA500 
MCLwP  
MCLwP*  
MCLwP  
MCLwP*  
MCLwP  
MCLwP*  
MCLwP  
MCLwP*  
Prior  
Prior  
Prior  
Prior  
Prior  
Prior  
Prior  
Prior 
While individual tweaking of each priorgenerating model’s topology requires elaborate experimentation and is out of the scope of this work, we still conducted a simple set of experiments to study the overall effects when changing the priorgenerating models’ capacity. In particular, we increased the capacity of the teacher models () in Section IVC by adding more convolution layers in the CS and FS components. The set of highercapacity teachers are denoted as , and the resulting student models are denoted as MCLwP*.
Table III shows the learning performances of MCLwP in comparison with MCLwP*, and in comparison with . While there are clear improvements in the learning performance of the teachers when we switch from to , we observe mixed behaviors in the corresponding student models. Here we should note that this phenomenon is expected since different measurement configurations in different datasets would require different adjustments (increase or decrease) of the teacher’s capacity to ensure the most efficient knowledge distillation. As observed in [31], the distribution of a student model’s performance with respect to different teacher models’ capacity has a bell shape. Thus, Table III can act as a guideline whether to increase or decrease the capacity of the priorgenerating model in each configuration to maximize the learning performance of its student.
For example, in CIFAR10 and CIFAR100 dataset at measurement , increasing the teacher’s capacity from to leads to further degradation in the student’s performances; thus we should lower the capacity of to move toward the bell curve’s peak. On the other hand, in CIFAR100 at or in CelebA200 at , we should further upgrade the capacity of to possibly obtain better performing student models. As mentioned previously, this type of empirical hill climbing for each measurement configuration requires extensive experiments, however, might be necessary for certain practical applications.
IvE Quality of Compressive Domain
Measurements  Models  CIFAR10  CIFAR100  CelebA100  CelebA200  CelebA500 
#neighbors k=5  
MCL [41]  
MCLwP (our)  
MCL [41]  
MCLwP (our)  
#neighbors k=20  
MCL [41]  
MCLwP (our)  
MCL [41]  
MCLwP (our) 
Although both MCL and MCLwP optimize the compressive learning models in an endtoend manner, and there is no explicit loss term that regulates the compressive domain, it is still intuitive to expect models with better learning performances to possess better compressed representation at the same measurement rate. In order to quantify the representation produced by the competing algorithms in the compressive domain, we performed KNearestNeighbor classification using the compressed representation at and after training MCL and MCLwP. Table IV shows the learning performances with two different neighbor values ( and ).
It is clear that MCLwP outperforms MCL in the majority of configurations. The performance gaps between MCLwP and MCL are more significant in facial image recognition tasks than in object recognition tasks. Here we should note that Euclidean distance was used to measure the similarity between data points, which might not be the optimal metric which can entirely reflect the semantic similarity and the quality of the compressive domain, especially when our compressive domains possess tensor forms that also encode spatial information. Besides, we cannot compare the performances of KNearestNeighbor on a dataset across different measurements since measurements having larger spatial dimensions potentially retain more spatial variances, and Euclidean distance becomes less effective when measuring the similarity. For example, we can observe a decrease in performance on the CIFAR10 and CIFAR100 datasets, although the number of measurements increases from
to . For this reason, we did not study the performances of KNearestNeighbor at higher measurements.IvF Effects of Prior Knowledge
Measurements  Models  CIFAR10  CIFAR100  CelebA100  CelebA200  CelebA500 
MCLw/oP  
MCLwP  
MCLw/oP  
MCLwP  
MCLw/oP  
MCLwP  
MCLw/oP  
MCLwP 
In order to study the effects of exploiting prior knowledge, we conducted two sets of experiments. While MCL and MCLwP models share the same structures of the CS and taskspecific neural network component, there are architectural differences in the FS components: in MCL, the FS component performs multilinear transformation while in MCLwP, the synthesis step consists of both convolution and multilinear operations. Thus, in the first set of experiments, we aim to remove the architectural differences between the MCL and MCLwP models in the FS component that can potentially affect the quantification of prior knowledge. This is done by training the architectures which are trained by MCLwP in a similar manner as proposed in [41], i.e., without prior knowledge: we first initialized the CS and FS components by minimizing the reconstructed error via stochastic optimization, then trained the whole model with respect to the inference loss. This set of models are denoted as MCLw/oP and the comparisons between MCLwP and MCLw/oP are presented in Table V. Although we observe mixed behaviors at measurement , it is obvious that the presence of prior knowledge leads to improved learning performances without any architectural differences.
In the second set of experiments, we studied the effect of different knowledge transfer stages in MCLwP by either skipping or performing a particular knowledge transfer stage. This setup leads to different variants of the training procedure, which are presented in Table VI. Here we should note that when skipping the last transfer stage, we only discarded the distillation loss term and still trained the models with respect to the inference loss, instead of discarding the entire 3^{rd} stage.
The first row of each dataset shows the results when all knowledge transfer activities are skipped. In other words, the architectures are initialized with a standard neural network initialization technique [19] and trained only with the inference loss. It is clear that this training procedure produces the worst performing models.
The next three rows in each dataset represent the cases where we performed only one of the knowledge transfer stages. In this scenario, conducting only the 2^{nd} stage has noticeably better results than the other two cases, indicating the importance of the 2^{nd} knowledge transfer stage.
Regarding the cases when one of the stages is skipped, we can also observe a homogeneous phenomenon across different measurements and different datasets that skipping the 1^{st} stage leads to the least degradation. In fact, with this training procedure, we can obtain performances relatively close to the original MCLwP. During the 2^{nd} knowledge transfer stage, the FS component () is updated in conjunction with the CS component () to mimic the features synthesized by the priorgenerating model, which might imply an implicit knowledge transfer from to , thus might explain why skipping the 1^{st} transfer stage only leads to minor degradations in performance. Overall, performing all three knowledge transfer stages yields the best results.




CIFAR10  
✓  
✓  
✓  
✓  ✓  
✓  ✓  
✓  ✓  
✓  ✓  ✓  
CIFAR100  
✓  
✓  
✓  
✓  ✓  
✓  ✓  
✓  ✓  
✓  ✓  ✓  
CelebA100  
✓  
✓  
✓  
✓  ✓  
✓  ✓  
✓  ✓  
✓  ✓  ✓  
CelebA200  
✓  
✓  
✓  
✓  ✓  
✓  ✓  
✓  ✓  
✓  ✓  ✓  
CelebA500  
✓  
✓  
✓  
✓  ✓  
✓  ✓  
✓  ✓  
✓  ✓  ✓ 
IvG Semisupervised Learning
Measurements  Models  CIFAR10S  CIFAR100S  CelebA500S 
MCL  
MCLwP  
MCLwPS  
MCL  
MCLwP  
MCLwPS  
MCL  
MCLwP  
MCLwPS  
MCL  
MCLwP  
MCLwPS  
Prior  
Prior  
Prior  
Prior  
Prior  
Prior  
Prior  
Prior 
Finally, we present the empirical results in the semisupervised setting. Here we should note that although the total amount of data is large, only a small fraction has labels for training. The competing algorithms include MCL [41] and MCLwP, both of which were trained with the labeled data only, and MCLwPS, the semisupervised extension of MCLwP which takes advantages of the unlabeled data in addition to the labeled set. We denote and the corresponding priorgenerating models of MCLwP and MCLwPS, respectively. The results are shown in Table VII.
It is clear that without sufficient data, the priorgenerating models () cannot effectively transfer the knowledge to their student models (MCLwP) as can be seen by the inferior performances of MCLwP compared to MCL in half of the cases. However, when unlabeled data is utilized during the training process as proposed in our MCLwPS algorithm, it is obvious that not only the priorgenerating models () improve in performance but also their student models (MCLwPS). The enhancement of the knowledge transfer process (via additional data) and priorgenerating models (via selflabeling procedure) results in MCLwPS having the best performances among the competing algorithms.
V Conclusions
In this work, we proposed a novel methodology to find and incorporate datadependent prior knowledge into the training process of MCL models. In addition to the traditional supervised learning setting, we also proposed a semisupervised adaptation that enables our methodology to take advantage of unlabeled data that comes from similar distributions of the signal of interest. Although we limited our investigation to MCL framework, the proposals presented in this work are sufficiently generic to be applicable to any CL systems. With extensive sets of experiments, we demonstrated the effectiveness of our algorithms to train MCL models in comparison with the previously proposed algorithm and provided insights into different aspects of the proposed methodologies.
Vi Acknowledgement
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR). This publication reflects the authors’ views only. The European Commission is not responsible for any use that may be made of the information it contains.
References
 [1] (2016) Compressed learning: a deep neural network approach. arXiv preprint arXiv:1610.09615. Cited by: §I, §I, §IIB, §IIC.
 [2] (2018) Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. Cited by: §IID.
 [3] (2013) Compressive hyperspectral imaging by random separable projections in both the spatial and the spectral domains. Applied optics 52 (10), pp. D46–D54. Cited by: §I.
 [4] (2008) Adaptive featurespecific imaging: a face recognition example. Applied optics 47 (10), pp. B21–B31. Cited by: §I, §IIB.
 [5] (2009) Random projections of smooth manifolds. Foundations of computational mathematics 9 (1), pp. 51–77. Cited by: §IIB.
 [6] (2005) Distributed compressed sensing. Cited by: §I.
 [7] (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §I.
 [8] (2013) Multidimensional compressed sensing and their applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3 (6), pp. 355–380. Cited by: §I.
 [9] (2012) Finding needles in compressed haystacks. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3441–3444. Cited by: §I, §IIB.
 [10] (2006) Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 59 (8), pp. 1207–1223. Cited by: §I.
 [11] (2008) An introduction to compressive sampling [a sensing/sampling paradigm that goes against the common knowledge in data acquisition]. IEEE signal processing magazine 25 (2), pp. 21–30. Cited by: §I.

[12]
(2008)
Compressive sensing for background subtraction.
In
European Conference on Computer Vision
, pp. 155–168. Cited by: §I.  [13] (2011) Sparsity penalties in dynamical system estimation. In 2011 45th annual conference on information sciences and systems, pp. 1–6. Cited by: §I.
 [14] (2010) Signal processing with compressive measurements.. J. Sel. Topics Signal Processing 4 (2), pp. 445–460. Cited by: §I.
 [15] (2007) The smashed filter for compressive classification and target recognition. In Computational Imaging V, Vol. 6498, pp. 64980H. Cited by: §I, §I, §IIB.
 [16] (2018) Compressively sensed image recognition. In 2018 7th European Workshop on Visual Information Processing (EUVIP), pp. 1–6. Cited by: §I, §IIB.
 [17] (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §I.
 [18] (2009) Robust recovery of signals from a structured union of subspaces. IEEE Transactions on Information Theory 55 (11), pp. 5302–5316. Cited by: §I.

[19]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §IVF.  [20] (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §IIC.
 [21] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §I, §IID, §IIIB.
 [22] (2018) Compressed learning for tactile object recognition. IEEE Robotics and Automation Letters 3 (3), pp. 1616–1623. Cited by: §I, §IIB.
 [23] (1995) Fast three dimensional magnetic resonance imaging. Magnetic resonance in medicine 33 (5), pp. 656–662. Cited by: §I.
 [24] (2009) Distributed compressive video sensing. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1169–1172. Cited by: §I.
 [25] (201105) Analyzing weighted minimization for sparse recovery with nonuniform sparse models. IEEE Transactions on Signal Processing 59 (5), pp. 1985–2001. External Links: Document, ISSN 1053587X Cited by: §I.
 [26] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IVB.
 [27] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: 1st item.
 [28] (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: 2nd item.

[29]
(2015)
Reconstructionfree inference on compressive measurements.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–24. Cited by: §I, §IIB. 
[30]
(2016)
Direct inference on compressive measurements using convolutional neural networks
. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 1913–1917. Cited by: §I, §IIB.  [31] (2019) Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393. Cited by: §IVC, §IVC, §IVD.
 [32] (2017) Secureml: a system for scalable privacypreserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 19–38. Cited by: §I.
 [33] (2017) Compressed sensing with prior information: strategies, geometry, and bounds. IEEE Transactions on Information Theory 63 (7), pp. 4472–4496. Cited by: §I, §IIIA.
 [34] (2010) Compressed synthetic aperture radar. IEEE Journal of selected topics in signal processing 4 (2), pp. 244–254. Cited by: §I.
 [35] (2013) Compressive classification. In 2013 IEEE International Symposium on Information Theory, pp. 674–678. Cited by: §I, §IIB.
 [36] (2013) Projections designs for compressive classification. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 1029–1032. Cited by: §I, §IIB.
 [37] (2014) Reconstruction of signals drawn from a gaussian mixture via noisy compressive measurements. IEEE Transactions on Signal Processing 62 (9), pp. 2265–2277. Cited by: §I.
 [38] (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §IID.
 [39] (2009) Compressive image sampling with side information. In 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 3037–3040. Cited by: §I.
 [40] (2009) On the reconstruction of blocksparse signals with an optimal number of measurements. IEEE Transactions on Signal Processing 57 (8), pp. 3075–3085. Cited by: §I.
 [41] (2019) Multilinear compressive learning. arXiv preprint arXiv:1905.07481. Cited by: §I, §I, §I, §IIB, §IIC, §IIC, §IIIA, §IVB, §IVB, §IVC, §IVC, §IVF, §IVG, TABLE II, TABLE IV, TABLE VII, §IV.
 [42] (2015) Selflabeled techniques for semisupervised learning: taxonomy, software and empirical study. Knowledge and Information systems 42 (2), pp. 245–284. Cited by: §IIIC.
 [43] (2010) Disparitycompensated compressedsensing reconstruction for multiview images. In 2010 IEEE International Conference on Multimedia and Expo, pp. 1225–1229. Cited by: §I.
 [44] (2010) Modifiedcs: modifying compressive sensing for problems with partially known support. IEEE Transactions on Signal Processing 58 (9), pp. 4595–4607. Cited by: §I.
 [45] (2019) Reversible privacy preservation using multilevel encryption and compressive sensing. arXiv preprint arXiv:1906.08713. Cited by: §I.

[46]
(2017)
A gift from knowledge distillation: fast optimization, network minimization and transfer learning
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §IID.  [47] (2017) Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1974–1982. Cited by: §IID.
 [48] (2018) Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328. Cited by: §IID.
 [49] (2018) Compressed learning for image classification: a deep neural network approach. Processing, Analyzing and Learning of Images, Shapes, and Forms 19, pp. 1. Cited by: §I, §IIB, §IIC.
Comments
There are no comments yet.