DeepAI AI Chat
Log In Sign Up

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

by   Tianxin Tao, et al.

Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to standard convolutional neural network (CNN) architectures? To answer this question, we evaluate ViT training methods for image-based reinforcement learning (RL) control tasks and compare these results to a leading convolutional-network architecture method, RAD. For training the ViT encoder, we consider several recently-proposed self-supervised losses that are treated as auxiliary tasks, as well as a baseline with no additional loss terms. We find that the CNN architectures trained using RAD still generally provide superior performance. For the ViT methods, all three types of auxiliary tasks that we consider provide a benefit over plain ViT training. Furthermore, ViT masking-based tasks are found to significantly outperform ViT contrastive-learning.


Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning

The Vision Transformer architecture has shown to be competitive in the c...

D2RL: Deep Dense Architectures in Reinforcement Learning

While improvements in deep learning architectures have played a crucial ...

Stabilizing Off-Policy Deep Reinforcement Learning from Pixels

Off-policy reinforcement learning (RL) from pixel observations is notori...

Deep Reinforcement Learning with Swin Transformer

Transformers are neural network models that utilize multiple layers of s...

Object Detection with Deep Reinforcement Learning

Object localization has been a crucial task in computer vision field. Me...

Masked Contrastive Representation Learning for Reinforcement Learning

Improving sample efficiency is a key research problem in reinforcement l...

ROLL: Visual Self-Supervised Reinforcement Learning with Object Reasoning

Current image-based reinforcement learning (RL) algorithms typically ope...