The testing of a broad range of computer vision algorithms require datasets with accurate 6DOF groundtruth poses, e.g., SfM (structure from motion) , visual SLAM (simultaneous localisation and mapping) 
, camera pose estimation. Furthermore, to thoroughly assess the performance of the algorithms on realistic operating conditions, it is vital to use datasets captured under varying conditions. Indeed, recent studies [4, 5] show that the efficacy of local feature detectors, which form the first input to many computer vision algorithms, reduces significantly when presented with the same scenes under different environmental conditions (weather, time-of-day, traffic density, etc.).
Unfortunately, collecting the necessary image datasets under varying conditions is extremely costly and time-consuming. This has been a persistent obstacle towards developing and testing computer vision algorithms that are robust and reliable under varying operating conditions.
To contribute towards solving this problem, we present G2D, an image simulator software that exploits the detailed virtual environment in GTA V. G2D allows users to collect hyper-realistic computer-generated imagery of an urban scene, under controlled 6DOF camera poses and varying environmental conditions (weather, season, time of day, traffic density, etc.). Users directly interact with G2D while playing the game; specifically, users can manipulate conditions of the virtual environment on the fly, while the gameplay camera is set to automatically retrace a predetermined 6DOF camera pose trajectory within the game coordinate system. Concurrently, automatic screen capture is executed while the virtual environment is being explored.
The output of G2D is a set of images with 6DOF groundtruth camera pose, captured under varying conditions; see Figure 7.
An intuitive summary of the functionality of G2D is given in Figure 8. Overall, creating sparse trajectory is a user-defined step with the assistance of trajectory tool. After obtaining a sparse trajectory, a dense trajectory could be constructed in two manners: automatic-defined or user-defined orientations. The option of user-defined orientation gives a permission to define the image appearance as users preference. With the obtained dense trajectory, users could retrace that one and simultaneously collect the image set with its camera pose set. Apart from trajectory tool, condition tool could change the conditions of the game, i.e., weather, time and traffic density. Because condition tool and trajectory tool work independently, several image sets with a consistent camera pose set under different conditions could be collected. More details are provided in the rest of this document, as well as in the project page: https://goo.gl/SS7fS6.
2 Related Works
There are several existing works that use virtual worlds for generating image datasets.
CARLA  provides a virtual world for autonomous driving systems. Europilot  leverages Euro Truck Simulator 2 to simulate every aspects in a driving system, e.g., wheel, brake, paddle, etc., in addition to rendering the scene. It has been suggested to use the data simulated using CARLA and Europilot to conduct experiments for autonomous-driving algorithms.
For semantic segmentation, SYNTHIA  is a large-scale synthetic image generator based on a virtual world constructed via the Unity platform . In that work, the synthetic images are used along with realistic images for training a deep network to improve semantic segmentation accuracy. Similarly, UnrealCV  leverages the power of Unreal engine , a well-known game development platform, to create synthetic images for various tasks of computer vision, based on the virtual worlds  originally adapted from Unreal marketplace .
Another paradigm for simulating images is to exploit readily made virtual worlds in computer games. A primary target has been GTA V, which is set in a hyper-realistic urban environment. In fact, excellent graphics is one of the reasons why GTA V is considered as the most profitable game of all time .
Richter et al. [15, 16] proposed a method to generate images from GTA V. Specifically, they take advantage of the communication mechanism between the game and graphics hardware by injecting a middleware between those layers, with the middleware collecting the desired game information, e.g., geometric meshes, texture maps and shaders. Based on those resources, they could then construct the groundtruth for a variety of computer vision problems, including visual odometry. In fact, the camera poses are obtained through the recovery from the recorded meshes and their transformation matrices..
Different from Richter et al., our software G2D accesses the native functions of GTA V. This enables us to directly attain the camera pose in every frame of the game. Additionally, users can control the environmental conditions through functions developed in G2D.
3.1 Scripthook V
G2D is based on Scripthook V , a library developed by Alexander Blade that provides access to the native functions of GTA V (the usage Scripthook V distinguishes our G2D from Richter et al.). The original aim of Scripthook V is to provide a framework to construct modifications (“mods") to the game. Currently, a wide range of fascinating mods are available , e.g., Invisibility Cloak  that can make the protagonist invisible. The list of native functions supported by Scripthook V could be found on .
3.2 Constructing Trajectory
G2D defines two types of camera trajectories: sparse and dense trajectories. A sparse trajectory consists of a set of vertices (a set of positions on the “top down" 2D map of the virtual environment), and an order in which to visit the vertices. Users specify sparse trajectories. Then, given a sparse trajectory, a dense trajectory is generated automatically by G2D. Basically, the tool traces a continuous path along the dense trajectory and captures the scene as observed from the gameplay camera at 60 frames per second111We measure this performance from a workstation Intel(R) Core(TM) i7-6700 @ 3.40GHz, RAM 16GB, NVIDIA GeForce GTX 1080 Ti and the maximum graphical configuration for GTA V, along with the 6DOF camera pose at each frame. In other words, G2D still guarantees the normal operation of the game as well as the collected dataset is in the standard video rate.
Figure 9 shows an example, while more details are available in the following.
3.2.1 Sparse Trajectory (user-defined)
The vertices of a sparse trajectory are defined by their coordinates on a 2D map . The order of visitation is specified using an index
where are integers indicating the order in which vertex is visited in the desired sequence (similarly for vertex ). For the example in Figure 9, the index would be
In G2D, the vertex and order of visitation are specified in the files vertex.txt and vertex_order.txt respectively.
3.2.2 Dense Trajectory (generated automatically)
Given a user-specified sparse trajectory, G2D moves the protagonist of the game to automatically follow the trajectory. The orientation (rotation) of the camera while the movement is being executed can be specified in two modes:
First-person view mode: G2D attaches the gameplay camera to the “eyes" of the protagonist, and the viewing direction always points forward without the need to handle camera orientations by the user.
Third-person view mode: While the protagonist automatically moves along the trajectory, the user can use the mouse to control the orientation of the camera.
While the environment is being explored, G2D calls the relevant native functions and performs the necessary computations to obtain the 6DOF pose of the gameplay camera at each frame. Every 6DOF pose is stored in line-by-line manner within the file trajectory_dense.txt. Each line within trajectory_dense.txt has the following format:
<protagonist position XYZ> <camera position XYZ> <camera rotation XYZ>
It is worth noting that because the dense trajectory is simply an editable text file, users could easily open the dense trajectory and make some manual modifications to create a noisy version of trajectory, hence a more challenging dataset could be generated.
3.3 Retracing a dense trajectory
With the obtained dense trajectory, G2D opens the file trajectory_dense.txt, sequentially loads each line within that file and then sets 6DOF pose to the camera object. With each 6DOF value, G2D performs a screenshot of the screen rendered by the camera object.
G2D stores all image data along with their corresponding 6DOF pose within 6dpose_list.txt as the following format:
<image file name 1> <camera position XYZ> <camera rotation XYZ>
<image file name 2> <camera position XYZ> <camera rotation XYZ>
. . .
<image file name N> <camera position XYZ> <camera rotation XYZ>
All the images and their 6DOF pose are automatically and fully generated from the predetermined dense trajectory as the explanation in section 3.2.2. Therefore, before carrying out the function of retracing the dense trajectory, users could change the environmental conditions (i.e. weathers, time and traffic density) as their preference to attain their desired dataset.
3.4 Changing the the environmental conditions
G2D provides the functions that could support users to change environmental conditions. There are three different settings regarding environmental conditions:
Regarding the weather, G2D allows user to select between clear, rain or snow.
In terms of the time, G2D implements day time (at 12:00) and night time (at 23:00).
With regard to the traffic density, G2D assists users to increase or decrease two types of traffic density, i.e. vehicle and pedestrian. The density value varying from to represents from none to normal numbers of pedestrians/vehicles on the road.
3.5 Unit Conversion
In the testing of some algorithms, it may be useful to conduct metric conversion of the distances in the 2D map of the game. To this end, we perform the following trick:
Make the game protagonist walk in the environment, and record the positions (in the 3D game coordinate) of the protagonist at every 2-3 steps.
Calculate the average walking-stride length of the protagonist in unit-distances of the 3D game coordinate (roughly 0.9 units based on the 3D game coordinate that we used).
According to , the average walking stride length for a male adult is about meters, hence 1 unit-distance in the 3D game coordinate is equal to about 0.85 meters.
4 Sample application—testing SfM pipelines
|Name||# images||# images for|
In this section, we would like to use the datasets collected by G2D for structure from motion (SfM), which is one of the fundamental problems in geometric computer vision. In order to leverage the camera poses in game coordinate as the groundtruth, we employ the registration  inside RANSAC to align the camera positions reconstructed by SfM to the game coordinate. Our experimental evaluation is quite similar to , but they use the camera position reconstructed by Bundler  as groundtruth while we use the exact camera poses in the game coordinate. The structure from motion framework that we use is COLMAP  .
|Name||size||# registered||# points||average error||median error|
With regards to our datasets, we collect three different datasets from three blocks in the Little Seoul region. Figure 11 demonstrates the location of those blocks and Table 11 shows that number of collected images. Actually, we only use the subset of those images for the reconstruction. Figure 24 illustrates the sample images within our three datasets. In terms of game conditions, the time, weather and traffic density are day (12:00), clear and 0 respectively.
This document presents G2D that could be utilized to collect the dataset in Grand Theft Auto V. Due to the capability of accessing to the native functions, G2D allows users to control the various environmental conditions within the game, i.e. weather, time and traffic density. In addition, G2D samples a set of images with their corresponding camera poses in the game coordinate. Apart from it, G2D is open-sourced, hence users could modify it as their preference to collect the desired dataset. The section of sample application demonstrates the use of those camera poses as groundtruth for the structure from motion.
-  N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,” in TOG, 2006.
-  A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” TPAMI, 2007.
-  V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” IJCV, 2009.
-  J. L. Schönberger, H. Hardmeier, T. Sattler, and M. Pollefeys, “Comparative evaluation of hand-crafted and learned local features,” in CVPR, 2017.
-  H. Zhou, T. Sattler, and D. W. Jacobs, “Evaluating local features for day-night matching,” in ECCV, 2016.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017.
-  https://github.com/marsauto/europilot.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in CVPR, 2016.
-  “Unity homepage,” https://unity3d.com/.
-  Y. Z. S. Q. Z. X. T. S. K. Y. W. A. Y. Weichao Qiu, Fangwei Zhong, “Unrealcv: Virtual worlds for computer vision,” ACM Multimedia Open Source Software Competition, 2017.
-  “Unreal engine homepage,” https://www.unrealengine.com.
-  “Unrealcv model zoo,” http://docs.unrealcv.org/en/master/reference/model_zoo.html.
-  “Unreal marketplace,” https://www.unrealengine.com/marketplace.
-  https://www.pcgamer.com/gta-5-estimated-to-be-the-most-profitable-entertainment-product-of-all-time/.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in ECCV, 2016.
-  S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in ICCV, 2017.
-  http://www.dev-c.com/gtav/scripthookv/.
-  https://www.gta5-mods.com/.
-  https://www.gta5-mods.com/scripts/invisibility-cloak.
-  http://www.dev-c.com/nativedb/.
-  http://livehealthy.chron.com/average-walking-stride-length-7494.html.
-  S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” TPAMI, 1991.
-  K. Wilson and N. Snavely, “Robust global translations with 1dsfm,” in ECCV, 2014.
-  J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
-  J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in ECCV, 2016.