Reinforcement learning (RL) is composed of the four elements: states that the agent could be in, actions to be taken by the agent, reward observed as a result of action, and the successor state after action is taken. The states an agent could be in, in real world tasks, are often composed of visual input, or RGB images. Traditionally, to decrease the dimensionality of input, relevant features are first extracted before using them as input states to the RL algorithm. This induces two major disadvantages: 1) every raw input needs to be preprocessed; 2) each task needs to have its own feature extraction algorithm tailored for it. The more generalized approach would be to use the visual image percept directly as input state, such as in the case of many OpenAI Atari environments. Visually simple and completely observable environments, such as the ones in Atari, are known to be solvable using well-known RL algorithms . However, complex real-world tasks, such as having a robot navigate in a procedurally generated environment, are more involved. The three major challenges that arise from navigation in first-person perspective is 1) the substantial visual complexity that arises from independent moving objects in the environment; 2) the sparsity of rewards coupled with the constantly-changing view of the agent; 3) the procedurally generated environment is different on every episode, requiring agent to explore and rapidly adapt to the new environment. This paper is interested in exploring combinations of RL algorithms and techniques that best solve a navigation task from a first-person perspective in a visually complex dynamic environment.
2.1 Proposed Model
Due to the complexity of the problem, only deep models are being considered. We want to use vanilla DQN and Duel-DQN with regular experience replay as baselines for the tasks. We do not expect any decent performance from using DQN with experience replay alone, especially for the Maze Navigation task. This is because of the challenges we mentioned above, such as the extremely sparse reward, partially observable state, and changing start/goal state in each episode. In order for the network to learn the latent structures of the complex pixel inputs, the baseline DQN would need at least a convolution neural network instead of simply stacking fully-connected layers. One drawback of using a convolution neural network is that a CNN is not rotational invariant, which may pose a problem in dynamic environments where both the agent’s angle of view and the object itself constantly changes. In this paper, we aim to explore a few more advanced reinforcement learning algorithms and techniques that could possibly enhance the performance.
2.2 Hindsight Experience Replay
Hindsight Experience Replay (HER) is used to accelerate and learn environments with sparse rewards that traditional DQN algorithms could not learn.  The idea of HER is to replay episodes with a different goal than the original goal that the agent is trying to achieve, thereby obtaining a learning signal despite episode not containing the original goal state.
2.3 Actor-Critic Network
The actor-critic network for policy gradient is an essential elements of many advanced algorithms, such as Hierarchical Actor Critic, and Deep Deterministic Policy Gradient. An actor-critic network maintains two separate neural networks: an actor, which learns a deterministic target policy , and a critic, which uses an approximation architecture and simulation to learn an action-value function to approximate actor’s action-value function .  An actor-critic network has more desirable convergence properties compared to critic-only method, and provides faster convergence compared to actor-only methods. 
2.4 Deep Deterministic Policy Gradient
DDPG is an off-policy algorithm based on actor-critic network with experience replay and adapted to perform well in high dimensional action space. More importantly to our purpose, DDPG has proved to be capable of learning directly from unprocessed pixels in simple environments and even had pixel-input outperform planning algorithms in many cases. 
2.5 Hierarchical Actor-Critic
Hierarchical Actor-Critic (HAC) algorithm enables the agent to explore at high level as well as learn short policies at each level of hierarchy. HAC has shown its potential of accelerating learning in tasks that has long time horizon and even in the presence of sparse rewards, which seems to be suited for the complex environment presented above . HAC is an actor-critic algorithm based on deep deterministic policy gradient (DDPG) that uses Hindsight Experience Replay (HER) and Universal Value Function Approximators (UVFA). UVFA is used to extend action-value functions to incorporate multiple goals, which means that each goal corresponds to a reward function . This causes the Q-function to depend on both a state-action pair and a goal  
3 Preliminary Results [TODO]
To simulate a complex environment, we use DeepMind Lab’s 3D learning environment based on Quake III Arena games . DeepMind Lab is a first-person 3D game platform created by Beattie, et al. in 2016 and designed to study how autonomous artificial agents interact in large, partially observed, and visually diverse worlds. It contains a suite of tasks from three different difficulty levels: the easiest is collecting different fruits (rewards) in maze, medium is navigating in a maze, and the most challenging one is laser tag while bouncing through space. The observation available to the agent consists of 3 parts: a RGBD (depth-included) image of the current field of vision, a reward, and an optional velocity information. Due to the complex nature of training pixels-to-actions autonomous agent that navigate in 3D space, this paper will focus mostly on the easy and medium tasks.
Easy task: "Stairway to Melon" and "Seek Avoid Arena"
There are two versions of this task, both requires the agent to gather fruits in a static map. The goal is to gather positive-reward fruits but avoid negative-reward ones.
Medium task: "Maze Navigation"
This task comes in two flavors: a static map and a procedurally generated random map. The agent is expected to find its way, from a random start location, to the goal location (which is fixed in static map and random in random map).
3.2 Initial Results
There are numerous environments available in Deepmind lab. For the sake of simplicity, we chose to focus on the easiest environment "seekavoid_arena_01".
The first policy that we tried is a randomized policy, with an equal probability for all possible actions at a given state. Unsurprisingly, the random policy did poorly in all environments we tried.
The policy received 0 reward in 1000 steps most of the times.
The next model is the linear model with the state as input and the q values as output just like in the homework 2 writeup. We trained the model online for 1000 epochs and here are the results.
Unfortunately, linear model does not work just as well as on CartPole or MountainCar as we expected. In retrospect, the most probable explanation is that the inputs in CartPole / MountainCar are drastically different than the inputs in seakavoid_arena_01. In the previous two environments, the inputs are subject attributes such as location and velocity. But in seakavoid_arena_01, the inputs are raw pixels, which are way more complex.
There are two possible ways to improve the linear model.
Manual feature extraction standards.
We may construct a feature extraction model based on human expert. In other words, we pre-process the raw inputs data and extract the features with a human defined model, and then feed the features to a linear model. However, this approach is not generalizable to similar navigation tasks.
Let the network decide feature extraction.
We can train a Convolution Neural Network to extract useful features in the video feed. Combined with the techniques listed in section two, we believe there would be a significant performance improvement.
 Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. Vol. 16. 2016.
 Beattie, Charles, et al. "Deepmind lab." arXiv preprint arXiv:1612.03801 (2016).
 Andrychowicz, Marcin, et al. "Hindsight experience replay." Advances in Neural Information Processing Systems. 2017.
 Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
 Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
 Levy, Andrew, Robert Platt, and Kate Saenko. "Hierarchical Actor-Critic." arXiv preprint arXiv:1712.00948 (2017).