1 Introduction
Reinforcement learning in highdimensional problems can be very timeconsuming when using traditional algorithms like Qlearning or SARSA. Learning from ”expert” is thus desirable. This is the field of inverse reinforcement learning (IRL), which has seen many recent developments. In particular, a lot of effort has been levied at making IRL sampleefficient. In many realistic scenarios, an ”expert” is not always available and/or ”demonstrations” might be costly, which motivates techniques for efficient imitation through very limited observations. Some real world examples where this would apply include trading, where ”expert” companies invest a lot in making their trades invisible and we could only observe a limited set of their trades. Other examples include training robots to perform various activities and training autonomous vehicles, where we hope to minimize the need for a human ”expert” to demonstrate.
Furthermore, reconstructing the reward function is often an illposed problem, with many degenerate solutions. While many solutions have been proposed, such as maximum entropy reinforcement learning. However, once one determines the reward function, one still has to iterate through the training process to derive a policy from it, which is inefficient if the problem is highdimensional.
In order to address these shortcomings, we propose a novel algorithm which can accelerate an agent’s Qlearning by using compressed sensing. It allows us to directly reconstruct the value function or action value function which trivially renders the policy.
1.1 Compressed Sensing Formalism
In a typical signal reconstruction problem, there is a wellknown fact (Nyquist theorem) that one must sample a signal at the frequency of at least twice the highest frequency in the sample in order to get accurate reconstruction without aliasing. Compressed sensing is a regime of signal reconstruction where the number of samples is far below the Nyquist frequency (Candes and Wakin, 2008; Donoho, 2006). In order to obtain high fidelity signal reconstruction in this regime, a few constraints can be introduced to exploit the prior knowledge. The most important of these is the notion that the signal has a ”sparse” representation in some domain. For example, if we consider a single sinusoid () in time, we know that it is a Diracdelta function in the frequency domain. As a result, with just a few random samples in time domain of this signal even at the presence of noise one can, in principle, reconstruct it with arbitrarily small error.
The technical process of compressed sensing typically involves a minimization problem which exploits the sparsity of the signal in some domain under linear transformation (which can be generalized to nonlinear). Specifically, given a set of observations (in the dense domain)
and a transformation operation which takes us from the sparse domain to the discrete domain, we want to reconstruct the signal to minimize its norm (i.e. maintain sparsity) subject to the constraint that it accurately reproduces all the observed samples:(1) 
This is applicable when the signal has a lowdimensional structure and perfect reconstruction is feasible with very few samples. The measurement (sampling) matrix has to be incoherent and satisfy the restricted isometry property. In practice, we may relax the constraint as or move it into the minimization objective as a term . To make the optimization tractable, people often use in the place of , which is effectively the method of LASSO. Apart from gradientbased optimization, this problem can also be solved with iterative methods e.g. iterative hard/soft shrinkage.
Many imitation learning problems are in the regime where the model’s degrees of freedom outnumbers the expert’s demonstrations by orders, while there exists a concise representation of the expert’s policy (which makes imitation possible). The setting naturally motivates application of compressed sensing.
1.2 Imitation Learning: Related Work
Imitation learning aims to transfer an ”expert’s” knowledge of a particular task to another agent. This knowledge can come in the form of the expert’s policy (behavior cloning, assuming that the agent and the expert follow the same set of dynamics) or more often (for robustness), in terms of an inferred reward function based on the observation of the expert’s set of trajectories (inverse reinforcement learning).
As one of the simplest and most direct method of knowledge transfer, behavior cloning (Bratko et al., 1995)
involves the supervised learning task of fitting a function which most accurately maps a set of observed expert states to their respective action. This function then effectively serves as the agent’s policy map. However, this can suffer from an inability to generalize if the model is given inputs outside the distribution that it is trained on (i.e. overfitting).
DAGGER, or dataset aggregation, overcomes the potential overfitting or lack of generalization of behavior cloning by simply providing an ”online” platform whereby the agent can continuously query the expert for more labels to update the model (Ross et al., 2011). However, this clearly violates one of our starting premises of being data sparse. In fact, both BC and DAGGER can potentially require a significant number of training examples in order to work, which we hope to circumvent using compressed sensing.
Other related words include apprenticeship learning (Abbeel and Ng, 2004), which attempts to infer the reward function from the expert’s demonstration. This method requires two steps, determining an approximation to the reward function, and then using this reward function to ultimately evaluate the actionvalue function. In our work, we will not require an inference of the reward function. In fact, we will not even require any knowledge of the reward function that the expert is operating with, only the expert’s demonstrations of state and actions.
2 Approach: Accelerating Imitation Learning with Compressed Sensing
We now show that compressed sensing can provide a way to maximally transfer limited observations of an expert’s stateaction trajectories during Qlearning, both for a linear function approximator AND a nonlinear function approximator (neural net).
2.1 Level 1: Expert exposes Q(s,a)
First, we must understand how we can formulate compressed sensing for inverse reinforcement learning. For simplicity, consider the actionvalue function with a linear function approximation:
(2) 
where
is the feature vector for the state. In analogy with Eq.
1, we let , and . Thus, if our weight matrix is sufficiently ”sparse”, Eq. 1 actually directly applies to the reconstruction of with a limited set of actions. Similarly, this is applicable for a linear function approximation for (which is more practical as we can recover the policy from it).The convex optimization problem is:
(3)  
subject to 
We note that solving this problem can be done online or offline in the batch setting. In the former case, if a subsample of the expert’s actions do not yield an adequate reconstruction, the procedure can be rerun again to obtain a better result. In the latter case, one must hope that the batch of data provided is diverse and contains sufficient information to provide a solid reconstruction.
2.2 Level 2: Sampling only Expert’s states and actions
Of course, one downside of the formulation presented in the previous section is that in practice, an expert’s value function is not directly observable. We only have observation of the expert’s states, actions and rewards (to reconstruct either or , we would need further knowledge of the reward function at a minimum. However, the formalism fits very nicely into the framework of CS and only requires that we can demonstrate that is sparse and demonstrations satisfy the restricted isometry property (Candes and Wakin, 2008).
However, we can also construct a CS problem using just the observed information from the expert’s states and actions. For every stateaction pair (s,a) observed in the demonstration:
(4)  
subject to 
More specifically, in the cartpole example, our action space has cardinality 2, so we construct a feature vector for each state demonstrated by the expert where action is taken:
(5) 
With the weight vectors concatenated into , equation 11 can be brought to a form similar to 1 with constraint .
In practice, and as will be discussed later, exact optimization on the norm is an NPhard problem and quickly becomes intractable even for modest problem sizes. As a result, we consider an L1norm problem:
(6)  
subject to 
The regularization term is now added because of the nature of the constraint in Eq. 6. trivially satisfies this constraint so that optimizing the L1 norm will also trivially just force the weights to be zero. For the linear case, the choice of a random target weight is sufficient to prevent this trivial solution from being selected.
The
is another hyperparameter which relaxes the condition that
for all other . This hyperparameter is useful in cases where the expert is noisy or is not an ”optimal” expert. While the presence of two hyperparameters that we have to scan may pose a problem in a datasparse problem, it isn’t because they require only data that the agent collects and we may assume that the agent can easily collect data from the environment.One interesting observation to note is that because only a finite number of samples are provided by the expert, the that is reconstructed may not be unique. Intrinsically, there are just multiple possible which could output the correct set of states and actions from the expert.
2.3 Level 3: Application to Deep Q Networks
Function approximation without the use of deep neural network is difficult to tune in practice, thus it is desirable to extend the method to imitation learning on Deeqp Q Network (DQN) or deep policy network. One challenge there is the use of nonlinear measurements and nonlinear constraints, which conflicts the CS methodology in nature. Remedies typically relax the contraint into regularization terms, which then transforms the challenge into the ambiguity of the line between compressed imitation learning and imitation learning with a clever regularization (that captures the same prior).
A Deep Q network (DQN) can be viewed as a combination of preparing a highly nonlinear feature vector (up to second last layer in DQN) and applying the correct weights on the feature for final output (the final layer in DQN), which allows us to apply compressed sensing on the last layer of the network in a linear fashion. Our network architecture employs an extremely wide last layer which we train with dropout so that the CS applies.
We denote the DQN as a function:
(7) 
The compressed sensing problem is:
(8)  
subject to  
To interpret this properly,
(9) 
Essentially, in the DQN we treat the initial layers of the network as a special ’featurizer’ of . The final layer is simply a linear layer which maps to . In more complex problems, this procedure can be interleaved with the normal training process of DQN, where the normal training improves the whole network (especially the featurizer) gradually and CS boosts the final layer weights once in a while.
3 Theory
In general, solving Eq. 1 is an NPhard problem due to the 0norm. There are a few relaxations which can still guarantee good solutions, as discussed in Section 1.1.
One of the key requirements of compressed sensing is the Restricted Isometry Property (RIP), which is a quantitative realization of the conditions that the measurement and representation bases are incoherent.
(10) 
Eq. 10 requires that the projection of
(the reconstruction matrix) preserves distance between signals. An equivalent statement of the RIP is that the eigenvalues of
lie in the interval . It has been proven that RIP is naturally satisfied for random matrices (Baraniuk et al., 2008).In the case of Qlearning, the sensing matrix , consists of , which means that the burden of determining whether CS will work lies in analyzing for a given RL problem. For us, is a stack of highdimensional feature vectors
for all states s sampled from expert, which is approximately a random matrix, despite some correlation in the feature vectors that slightly increases difficulty but gradually diminishes as dimension increases.
In general, there are several important considerations to understand that make the application of RIP and similar properties used in CS a bit more nuanced when using it for RL. The biggest one is the fact that the transformation from sparse basis to measurement basis is not ”unique”. can be different (or even inherently stochastic) depending on what demonstrations the expert provides. Moreover, it also means that there is not a unique which would solve the problem.
Additionally, for Qlearning, the fidelity in which matches is also not the same as that of a typical CS problem. Since in the reinforcement learning problem, the performance of the agent is predicated on selecting the correct action rather than predicting the right value, we care more that the resultant policy matches the expert. Namely, the exact value of the reconstruction does not matter as much as the sign of it.
4 Results
4.1 Problem and Data Setup
Our first benchmarks are using the OpenAI gym’s Cartpole. In Cartpole, the state of the system is given by a four dimensional vector which contains the cart position, cart velocity, pole angular position, and pole angular velocity respectively. As the state space of this is relatively small, we artificially expand it using a union of five different nonlinear gaussian kernels. The action space for the cartpole is discrete consisting of two actions corresponding to moving the cart left or right. We denote these two actions with and .
We use sklearn’s radial basis function (rbf) sampler, which generates a feature map of an RBF kernel by Monte Carlo approximation of its Fourier transform. To fit the sampler, we generate a set of 20000 random observation samples from the environment’s state space to fit the featurizer. With five different kernels, each 100 components each, the resultant feature dimension for this problem is 500 instead of just 4. In doing so, we also guarantee that the weight vector for this problem will have some degree of ”sparsity”.
In order to train an ”expert” on this problem, we use typical Qlearning. The baseline Qlearning to train our agent takes around 10000 or more episodes to yield perfect performance. With each episode containing up to 200 different stateaction pairs, this amounts to almost 1 million d
(11)  
subject to 
ifferent stateaction episodes needed to converge (recall that we are expanding the state dimension so this increases the amount of data needed).
4.2 Level 1: Expert exposes Q(s,a)
With our expert generated, we then provide a limited set of examples from which we learn from. In the level 1 case, we again assume that the expert actually provides its function values, which is heavily contrived, but serves to illustrate that the compressed sensing works. In the following results, we use ONLY 21 examples, which represents a number of examples that’s at least 2 orders of magnitude smaller than what we trained on. First, using the expert’s we show the reconstruction performance.
While a side by side comparison of the original weights and the reconstructed weights does not appear to show a match, looking at the difference is much more instructive. The average absolute value of the error in the reconstructed weights is just 10%. Even given the relatively sizeable errors however in the weights, we can see that the reconstructed agent’s performance is actually still quite good.
Even given the error, the reason why the reconstruction is relatively good can be better elucidated once we realize again that it is whether or not the reconstructed action matches between the reconstructed agent and our expert. In Fig. 7, we plot the difference of for the expert versus the agent. This plot tells us that if the differences are more correlated then our reconstructed agent tends to pick the same action as the original expert. And indeed we do see a strong correlation.
4.3 Level 2: Expert exposes only stateaction pairs
Now we consider the problem of reconstruction with only the stateaction pairs visible provided by the expert as opposed to the Q(s,a) values. This kind of expert exposure is reasonable.
First, we need to identify what to choose as an appropriate in Eq. 6. Empirically, we find that picking a random is sufficient to generate good reconstructions.
In Fig. 6, we see that the reconstructed weights are actually ksparse, in contrast to level 1. the orange lines indicate the random used. Being able to achieve the ksparsity here is critical as the agent is able to effectively identify the few features in the 500 dimensional space that are actually useful. What is truly remarkable however is that using this formulation, we can actually get the test performance of the reconstructed agent is nearly perfect!
Below, we show the analogous performance metric as shown in Fig. 2. As expected, for perfect performance, the agreement between the expert Q difference and the agent’s Q difference is much tighter than in level 1.
In effect, by receiving only 21 demonstrations of the expert, compressed sensing is already sufficient to give the agent an almost perfect image of the expert’s policy for the cartpole.
Moreover, we can evaluate how well the agent performs compared to the expert across the entire state space (as oppose to the restricted set of states where the cartpole is mostly balanced). This is shown in Fig. 8.
4.4 Level 3: DQN
Finally, we demonstrate the success of our formalism using a nonlinear function approximator. We start with a ”student” DQN. The architecture of the DQN is like any typical neural net but we expand the last layer to be a very wide and train it with dropout. WE ”pretrain” this net for a relatively small number of iterations, which is insufficient to get good performance, as shown by the blue line in Fig. 10. At that point, we expose some expert state action pairs. These state action pairs are used to ”sparsiy” the weights of the last layer. Then we reevaluate the performance of our ”student” network to see if there is improvement. The purpose of the pretraining is to acquire a .
In Fig. 9, we can see again, just as in the linear case, the reconstruction process substantially sparsifies the weights in the last layer.
In Fig. 10
, the blue line represents the average test performance of the student network after 1000 iterations. The orange line shows the test performance after doing CS on the last layer with 20 expert examples. And we show our baseline expert’s performance (a fully trained dqn for 20000 epochs), which is nearperfect.
5 Benchmarks: Behavior Cloning
While our results above are already significant as they provide a way to vastly accelerate Qlearning by directly integrating the CS step into the Qlearning algorithm, we compare our technique to behavior cloning to directly compare how much of an advantage we gain in number of training examples.
Clearly, with behavior cloning, one needs well over 500 examples to get consistent PERFECT performance in test, which gives direct CS at least an order of magnitude improvement over behavior cloning.
6 Conclusion
We have demonstrated a novel procedure which allows us to apply compressed sensing directly to existing reinforcement learning algorithms. For the appropriate problem domains where the feature space is high dimensional, our technique can substantially accelerate the imitation process. Moreover, it requires strictly only that the expert demonstrates what their action is for a given state. This is empirically demonstrated on three different levels with increasing generality: the linear QFA case where expert exposes Q, the linear QFA case where expert only shows stateaction pairs and the DQN case where expert only shows stateaction pairs. Our method gives the agent a significant boost in performance with very few expert samples for all three cases.
Future work includes:

Generalize the restricted isometry property for the scenario where the reconstruction goal is relaxed (as we only care about the relative stateaction values)

Extend the scheme to DPN. This is challenging since the expert’s probability of actions cannot be directly observed, thus aiming to best reconstruct the expert’s demonstrations will have a variational flavor where we try to maximize the likelihood of the demonstrated actions, which is incompatible with the oneshot nature of compressed sensing.

Apply the method on more complex problems. The method’s advantages will be amplified in highdimensional problems, but the training time will also be longer and exceeds the scope of this project.

Generalize the prior to be encoded by a generative network. Here we require the weight vector to be sparse in its own domain, whereas in principle, it only needs to have a concise/compact representation. That suggests possibility of encoding the weight vector by a generative model and put the sparsity constraint on the input to the generative model.
References
 Apprenticeship learning via inverse reinforcement learning. In Proceedings, TwentyFirst International Conference on Machine Learning, ICML 2004, pp. 1–8. External Links: Document, ISBN 1581138385 Cited by: §1.2.
 A Simple Proof of the Restricted Isometry Property for Random Matrices. Constructive Approximation 28 (3), pp. 253–263. External Links: Document, ISSN 14320940, Link Cited by: §3.
 Behavioural Cloning: Phenomena, Results and Problems. IFAC Proceedings Volumes 28 (21), pp. 143–149. External Links: Document, ISSN 14746670 Cited by: §1.2.
 An introduction to compressive sampling: A sensing/sampling paradigm that goes against the common knowledge in data acquisition. IEEE Signal Processing Magazine 25 (2), pp. 21–30. External Links: Document, ISSN 10535888 Cited by: §1.1, §1.1, §2.2.
 Compressed sensing. IEEE Transactions on Information Theory 52 (4), pp. 1289–1306. External Links: Document, ISSN 15579654 Cited by: §1.1, §1.1.
 A reduction of imitation learning and structured prediction to noregret online learning. In Journal of Machine Learning Research, Vol. 15, pp. 627–635. External Links: 1011.0686, ISSN 15324435 Cited by: §1.2.
Comments
There are no comments yet.