Log In Sign Up

Game State Learning via Game Scene Augmentation

Having access to accurate game state information is of utmost importance for any game artificial intelligence task including game-playing, testing, player modeling, and procedural content generation. Self-Supervised Learning (SSL) techniques have shown to be capable of inferring accurate game state information from the high-dimensional pixel input of game's rendering into compressed latent representations. Contrastive Learning is one such popular paradigm of SSL where the visual understanding of the game's images comes from contrasting dissimilar and similar game states defined by simple image augmentation methods. In this study, we introduce a new game scene augmentation technique – named GameCLR – that takes advantage of the game-engine to define and synthesize specific, highly-controlled renderings of different game states, thereby, boosting contrastive learning performance. We test our GameCLR contrastive learning technique on images of the CARLA driving simulator environment and compare it against the popular SimCLR baseline SSL method. Our results suggest that GameCLR can infer the game's state information from game footage more accurately compared to the baseline. The introduced approach allows us to conduct game artificial intelligence research by directly utilizing screen pixels as input.


Learning Task-Independent Game State Representations from Unlabeled Images

Self-supervised learning (SSL) techniques have been widely used to learn...

Contrastive Learning of Generalized Game Representations

Representing games through their pixels offers a promising approach for ...

Towards Domain-Agnostic Contrastive Learning

Despite recent success, most contrastive self-supervised learning method...

On the duality between contrastive and non-contrastive self-supervised learning

Recent approaches in self-supervised learning of image representations c...

ImCLR: Implicit Contrastive Learning for Image Classification

Contrastive learning is an effective method for learning visual represen...

SC2EGSet: StarCraft II Esport Replay and Game-state Dataset

As a relatively new form of sport, esports offers unparalleled data avai...

1. Introduction

Extensive work (yannakakis2018artificial; barthet2021go; berner2019dota) in the fields of dissimilar domains of AI and games such as player experience modeling, general gameplaying or content generation make use of the internal state of the game (anand2019unsupervised; nelson2021estimates) obtained from the game engine. Using computer vision to obtain such state information from on-screen game footage, instead of directly from the game engine, remains challenging (stooke2021decoupling). Recent computer vision advancements with contrastive learning (jaiswal2020survey), however, show promise in tackling these challenges.

Contrastive learning belongs to the family of self-supervised representation learning methods in computer vision that use a “pairwise-comparison” approach which operates by contrasting semantically similar and dissimilar images. The pairwise mechanism helps the vision model to identify critical visual features that define the semantics of these images. Recent work (trivedi2022representations) has applied this technique to the domain of learning state representations in games from pixel input. Such methods, however, rely on simple image augmentation techniques such as image flipping, rotation, and brightness change, to define and create semantically similar pairs of images. In this work, we investigate whether having access to a game engine can help us synthesize highly-controlled image augmentations that are better suited for learning such vision models. In particular, we use the game engine to construct better similar and dissimilar pairings via the proposed game scene augmentation technique, named GameCLR, for the task of game state representation learning.

Figure 1. The GameCLR Contrastive Learning Framework.

GameCLR synthesizes images that represent similar game states that are highly dissimilar in the pixel-space (synthetic positives) and images that represent different game states but are very similar in the pixel-space (synthetic negatives). Within the context of a car racing game, Fig. 1 visualizes an example of such a representation space containing images that act as synthetic positives and negatives to a reference image called the anchor. Our hypothesis is that by including such images in the contrastive learning process, our model will be better equipped to learn the important visual features that define any particular game state. Moreover, as we are defining the positive and negative pairings generated by the game engine, we also have a level of control over the learning process by guiding the model to learn what distinguishing factors are of importance to us (i.e. traffic information) and which ones can be considered as invariant for learning (e.g. rainy weather, color of the car). We test how training a vision model using GameCLR compares against a baseline SSL method SimCLR, on the CARLA driving simulator (dosovitskiy2017carla). Our findings suggest that synthesizing specific images from the game engine can boost the performance of contrastive learning methods for learning critical game state features from images.

Image-based Augmentations:
Flipping, Noise, Change Brightness, Rotation, etc.
Game Scene-preserving Augmentations:
 Change weather (clear, cloudy, windy, wet, rainy)
 Change time of day (noon, sunset, midnight)
 Change ego-vehicle color (5 color options)
Game Scene-altering Augmentations:
 Add one, two, or three vehicles (one per lane)
Table 1. Image and Game Scene Augmentations used here.

2. Methodology

For all experiments reported in this study, we use the CARLA (dosovitskiy2017carla) urban driving simulator (see Fig. 1) which provides access to its Unreal engine via a Python API. A scene in CARLA is defined as the current game state of its Unreal engine, which, when put through the game’s graphic renderer , yields the pixel output shown to the user on the screen (i.e., ). We take advantage of this game engine to generate a dataset for testing two contrastive learning methods: (1) a baseline SSL method SimCLR (chen2020simple) which uses simple image augmentations ; and (2) our proposed GameCLR method which uses the CARLA game-engine to first apply game scene augmentation before going through rendering and then applying the regular image augmentations . All the augmentation techniques used across both methods are described in Table 1.

2.1. SimCLR

In 2020, Chen et al. (chen2020simple) proposed SimCLR, a simple framework for contrastive learning of visual representations. A contrastive approach between similar and dissimilar images is used to learn image representations based on the content present in the images. Its pipeline has four major components: an image augmentation function

using simple image augmentations, a convolutional neural network encoder function

, a small fully-connected network called the projection head that maps representations to an embedding space, and a contrastive loss that is applied on these embeddings. Under this framework, simple image augmentations (e.g. rotation, brightness, addition of noise, etc.) are used to create two different views of the same image that are semantically similar, referred to as a positive pair. Similarly, any two views coming from distinct images are defined as negative pairs due to semantic dissimilarity. For a given embedding of a reference image called the anchor, and its positive pair’s embedding as well as multiple negative pairs’ embeddings in set

, the contrastive probability can be calculated as per Eq. (



where is the temperature hyper-parameter. Thus, the contrastive loss in SimCLR with respect to the anchor and all its associated positive pairs in a set can be defined as per Eq. (2):


We implement this SimCLR training method on our CARLA game dataset using the solo-learn framework (da2022solo). We spawn the ego-vehicle at random locations and place the camera behind it. We also randomize the time of day, weather, color of the ego-vehicle, and traffic around it through the and functions described in Table 1. Through this process, we collect 50,000 anchor images to train a ResNet18 encoder (da2022solo)

over 20 epochs.

Figure 2.

Average cosine similarity of the anchor image and its positive and negative pairings in a training batch.

2.2. GameCLR (Our Approach)

Our work follows the literature (kalantidis2020hard) regarding synthesizing hard negatives which can provide more information to the SimCLR loss compared to regular negatives occurring through image augmentation. Our approach, however, exploits access to a game engine and thereby our ability to generate relevant images for learning meaningful representations. Our assumption in this paper is that we can accurately describe the traffic around the ego-vehicle without concern of changes in game aesthetics—such as car color—and lighting conditions arising from changes in weather and day time.

Towards this end, we first render an anchor image by spawning the ego-vehicle at a random location. Then, we change the weather, time of the day conditions, or the color of the ego-vehicle while the ego-vehicle remains at the same state, using the Game Scene-preserving Augmentations listed in Table 1. We define all such images as indicating the set of synthetic positives with respect to the anchor image. Similarly, we synthesize negatives () by spawning random vehicles around our ego-vehicle. This is done by performing Game Scene-altering augmentations in addition to . Figure 1 provides a few examples of the synthetic and regular images for a given anchor image. Note that all these images in GameCLR also undergo simple image augmentations during training, similar to SimCLR. Thus, we can now compute the GameCLR loss as following the loss formulation of SimCLR in Eq. (2). This framework is showcased in Fig. 1 and we name it GameCLR, owing to the game engine guided contrastive learning of visual representations. Our experiments for GameCLR follow similar choice of training hyper-parameters used in SimCLR as described in Section 2.1.

3. Results

We present the results of our experiments with both augmentation approaches as a two-part assessment. First, we analyze in Section 3.1 how the representations of game states change throughout the learning process, especially focusing on the behavior of the synthetic images used in GameCLR. Next, in Section 3.2 we focus on highlighting the benefits of using such image representations for applications to game research that require extracting game state information from the game’s images.

3.1. Analyzing the Training Process

To investigate the role of different images encountered in a training batch during the contrastive learning process in SimCLR, we start by measuring the average cosine similarities between the positive and negative pairs of image embeddings with respect to the anchor images in a given training batch. Figure 2 (left) showcases those measurements during the training process across 20 epochs. At the beginning of training, both sets of positive and negative images have similar cosine similarity to the anchor images. This implies that the employed model before training cannot discriminate images that are semantically similar to an anchor image from images that are semantically different. As the training goes on, however, we notice that the images that belong to positive sets are more closely embedded to the anchor images compared to the images that belong to the negative sets. This behavior indicates that the model learns semantic similarities between images and embeds those semantics into the produced high-level image representations.

For GameCLR, to evaluate the degree to which game scene augmentation impacts the learning process, we measure the changes in cosine similarities between the synthetic positives, synthetic negatives and regular negatives with respect to the anchor throughout training (see Fig. 2). We observe that all sets of images start at a similar level of cosine similarity with the anchor, but as training advances, the synthetic negatives prove harder to contrast than regular negatives. Interestingly, by the time the algorithm converges, the model learns to distinguish the synthetic negatives at a similar level as that of the regular negatives. This indicates that after convergence the model is easily able to distinguish between the distinct game states including the synthetic hard negatives.

This analysis shows the superior learning capability afforded by the synthetic images obtained by directly modifying the pixels of the image with the help of the game engine. In order to quantify this benefit in terms of applicability to games research, we compare the models trained by these two approaches based on post-training evaluation, described in the following section.

Traffic variables Untrained SimCLR GameCLR
Dist. (left vehicle) 0.310.006 0.500.006 0.600.010
Dir. (left vehicle) 0.350.013 0.530.009 0.560.008
Dist. (front vehicle) 0.330.013 0.580.010 0.700.014
Dir. (front vehicle) 0.370.016 0.530.010 0.570.007
Dist. (right vehicle) 0.390.008 0.650.010 0.690.005
Dir. (right vehicle) 0.400.015 0.600.014 0.650.009
Table 2. Average

correlations between trained ResNet18 vectors and internal game state variables, averaged over 5 runs and shown along with 95% confidence intervals. Highest average

values for each variable are highlighted in bold.

3.2. Post-Training Evaluation

As proposed by Anand et al. (anand2019unsupervised), we evaluate how well the learned representations have captured information relevant to the game state through linear probing

. Linear probing includes freezing the weights of the ResNet encoder after the self-supervised training is over (i.e. the contrastive loss has converged). Then, we train linear regression models with the learned representations acting as the predictor variables (input) and certain variables describing the game state acting as the response variables (output). We measure the performance of these regression models with the

correlation metric, where higher correlation values suggest that the model has better learned to identify the game state variables from the images.

In our study, we wish to test whether the derived representations of our models can describe the traffic around the ego-vehicle irrespective of weather and lighting conditions. Therefore, we prepare an evaluation dataset in CARLA by spawning an ego-vehicle at a random location with a camera and collecting RGB images, and at the same time collecting information about the coordinates and motion direction of the vehicles surrounding this ego-vehicle, similar to (trivedi2022representations). We refer to these as traffic variables, as they can describe the state of traffic around the ego-vehicle. For each frame in our dataset, we collect a total of 6 synchronized traffic variables: Distance (left vehicle), Direction (left vehicle), Distance (front vehicle), Direction (front vehicle), Distance (right vehicle), Direction (right vehicle). Note that we are able to find this ground truth of the traffic variables due to direct access to the game engine of CARLA. Let us stress that the traffic variables are not used during the training of our contrastive models; they are only used as desired output for linear probing after training is completed.

In Table 2, we present the average correlation values observed for each of the 6 game state variables present in our evaluation dataset. First, we observe that both methods—SimCLR and GameCLR—improve upon the baseline of a randomly initialized ResNet18 model, verifying that contrastive learning is an effective solution for learning to differentiate between distinct game states. Across the six in-game variables, SimCLR provides a 157% improvement over the untrained baseline, on average, whereas GameCLR provides an improvement of 174% on average.

Since contrastive learning is guided by engine-specific hard negatives in GameCLR, the representations obtained by this method outperform SimCLR by approximately 11% on average on the linear probing task while using the same amount of images and training steps. This suggests that the ResNet18 encoder trained using the GameCLR approach extracts more meaningful representations that better capture traffic information in the game image as compared to SimCLR. All values for the different in-game variables in GameCLR are significantly higher than SimCLR (), with the highest improvement achieved for the distance to left vehicle (20% improvement over SimCLR), and—surprisingly—the least improved was for the direction to the left vehicle (5.5% improvement over SimCLR).

4. Conclusion

In this paper we introduced GameCLR, a contrastive learning technique for learning game state representations. The main contribution of this technique is the introduction of game engines for synthesizing training images and enriching data augmentation in this fashion. We notice that by synthesizing hard positives and negatives for each associated anchor image, we can better guide the contrastive learning process. Our results in the driving game environment CARLA suggest a 11% average improvement (in terms of ) when extracting critical traffic-related game state features from the RGB images of this game with our GameCLR approach over another comparable approach (SimCLR). The benefit of our proposed method is two-fold. Firstly, it enables the user to control which visual features of a game the SSL method learns from the input RGB images by specifying which engine variables (that impact rendering) produce synthetic positives and which produce synthetic negatives. Secondly, it shows the performance improvement over the standard contrastive learning approach SimCLR which uses simple image-based augmentation methods and does not exploit the game engine, as traditionally done when such computer vision methods are applied to games. Our proposed method is directly applicable to any downstream application within AI and games like game-playing, procedural content generation or player modeling by enabling the use of the game’s images as input instead of explicit state information obtained from the game engine.


This work was supported by the European Union’s H2020 research and innovation programme [Grant Nos. 951911, 101003397].