Trained Trajectory based Automated Parking System using Visual SLAM

by   Nivedita Tripathi, et al.

Automated Parking is becoming a standard feature in modern vehicles. Existing parking systems build a local map to be able to plan for maneuvering towards a detected slot. Next generation parking systems have an use case where they build a persistent map of the environment where the car is frequently parked, say for example, home parking or office parking. The pre-built map helps in re-localizing the vehicle better when its trying to park the next time. This is achieved by augmenting the parking system with a Visual SLAM pipeline and the feature is called trained trajectory parking. In this paper, we discuss the use cases, design and implementation of a trained trajectory automated parking system. To encourage further research, we release a dataset of 50 video sequences comprising of over 100,000 images with the associated ground truth as a companion to our WoodScape dataset <cit.>. To the best of the authors' knowledge, this is the first public dataset for trained trajectory parking system scenarios.


page 1

page 2

page 3

page 4

page 5


Persistent Map Saving for Visual Localization for Autonomous Vehicles: An ORB-SLAM Extension

Electric vhicles and autonomous driving dominate current research effort...

Trajectory Servoing: Image-Based Trajectory Tracking Using SLAM

This paper describes an image based visual servoing (IBVS) system for a ...

Computer vision in automated parking systems: Design, implementation and challenges

Automated driving is an active area of research in both industry and aca...

FisheyeMultiNet: Real-time Multi-task Learning Architecture for Surround-view Automated Parking System

Automated Parking is a low speed manoeuvring scenario which is quite uns...

Direct Sparse Mapping

Photometric bundle adjustment, PBA, accurately estimates geometry from v...

An Online Learning System for Wireless Charging Alignment using Surround-view Fisheye Cameras

Electric Vehicles are increasingly common, with inductive chargepads bei...

Creating Navigable Space from Sparse Noisy Map Points

We present a framework for creating navigable space from sparse and nois...

I Introduction

Broadly, Autonomous Driving (AD) use cases are split into three scenarios according to the speed of operation namely high speed highway driving, medium speed urban driving and low speed parking maneuvers [5]. High speed use cases are relatively well defined and structured and hence features like highway lane keep assist are the most mature and already deployed in the market. Urban driving use cases correspond to medium speed, they are highly unstructured and most challenging. Parking is a low speed use case is somewhere in the middle in terms of structuredness. Relatively, the driving rules of parking and its associated road infrastructure (road markings and traffic signs) are less well defined but it is relatively easier to handle because it is low speed manoeuvring. Parking requires near-field sensing instead of the typical far-field sensing provided by front cameras [4]. This is typically achieved by four fish-eye cameras which provide full 360 coverage (Figure 1) around the near-field of the car.

Fig. 1: Images from the surround-view camera network showing near field sensing and wide field of view. Four fisheye cameras (marked green) provide 360 surround view.

In particular, it is quite common to repeatedly park in the same areas, e.g: home area of the owner either a garage or in front of the home and office space. An accurate map of the region will aid in automated maneuver to park more efficiently. This can be achieved by means of a Visual SLAM pipeline which builds a map of the parking area which can be used later for re-localization. Typically these parking areas are private regions and cannot be mapped by commercial mapping companies like TomTom, HERE, etc. Thus the vehicle has to have the intelligence to learn to map frequently parked areas and then relocalize. In this paper, we describe our system which provides this feature using a commercial automotive grade camera and embedded system.

Visual Simultaneous Localization And Mapping (VSLAM) is a well studied problem in robotics and autonomous driving. There are primarily three types of approaches namely (1) Feature based methods, (2) Direct SLAM methods and (3) CNN approaches. Feature based methods make use of descriptive image features for tracking and depth estimation

[8] which results in sparse maps. MonoSLAM [2], Parallel Tracking and Mapping (PTAM) [7] and ORB-SLAM [10] are seminal algorithms of this type. Direct SLAM methods work on the entire image instead of sparse features to aid building a dense map. Dense Tracking and Mapping (DTAM) [11] and Large-Scale Semi Dense SLAM (LSD-SLAM) [3] are the popular direct methods which are based on minimization of photometric error. CNN based approaches are relatively less mature for Visual SLAM problems and they are discussed in detail in [9].

The rest of the paper is structured as follows. Section II provides an overview of trained trajectory parking system use cases. Section III details the system architecture of a trained trajectory parking system and its components. Section IV discusses Visual SLAM pipeline in detail and its challenges. Finally, Section V summarizes the paper and provides potential future directions.

Ii Trained Trajectory Parking System

Trained trajectory parking works in two phases: training phase and replay phase. In training phase, a human driver drives the vehicle to park wherever needed (e.g. carport, garage, etc). The trajectory and its other surrounding information is stored for an automated replication at a later time. In replay phase, trained trajectory is loaded to the vehicle and the software is able to recognize the current vehicle’s location with respect to the learned trajectory throughout the path. This is illustrated in Figure 2.

Fusion of odometry and/or ultrasonic sensors information during training phase gives a more accurate trajectory to replay. Vehicle Control Planning then uses this calculated position of the vehicle to plan a route back to the parking location, and controls the steering and acceleration in order for the vehicle to drive itself there.

Fig. 2: Illustration of Trained Parking and Relocalization: The white dotted path is the trained trajectory (with features in red from surrounding objects). Yellow blob with arrow shows the current vehicle (with detected features in blue) moving in direction of arrow, following the trained path.
Fig. 3: Trained Parking System Architecture

Visual SLAM algorithm is used for both training and replay phase to calculate and recognize the trained trajectory and vehicle position. These phases of trained trajectory parking are used in different use cases as described below.

Home Parking:

A driver frequently parks the car in their home area and the idea is to learn the home region to automate the parking maneuver. A home parking system localizes the ego-vehicle using computer vision techniques within an already stored trajectory so that the vehicle is capable of driving completely autonomously into the home parking slot using the stored trajectory. In such applications, the system stores landmarks within the scene where the sensors work on detecting those landmarks. The driver trains the system to localize these landmarks and the system uses them for localization.

Automated Reverse Parkout: This facilitates the driver to reverse any maneuver (e.g. driving into a dead end, parking in). Usually different trained trajectories are stored in buffer on persistent memory, user can then choose the preferred trajectory for automated park-in or park-out based on vehicle’s current location. Trajectory for automated replay of park-out gets recorded continuously, generally without any manual trigger.

Valet Parking: Valet Parking is the most advanced form of parking which requires Level 4 automation. This system has to autonomously perform navigation to find parking slots, select the optimal one and then park itself. It is quite challenging to accomplish this in an unknown environment. Thus there are efforts to create infrastructure maps with artificial landmarks (special QR code like markers) which can be leveraged by vehicles to build an efficient local map.

Iii Parking System Architecture

The block diagram of our system is illustrated in Figure 3 and described in this section. The necessary computer vision modules required for a parking system are discussed in Section III-B. Visual SLAM needed for trained trajectory parking is discussed in more detail in Section IV.

Iii-a Platform Overview

Sensors: The car setup comprises of commercially deployed automotive grade sensors as shown in Figure 1. The primary sensors required for a parking system are fisheye cameras (for providing trajectory information) and Ultrasonic sensors (for proximal obstacle detection on the way to parking). There are four fisheye cameras (marked green in the figure) which are 1 megapixel resolution having a wide horizontal field of view (FOV) of . The four cameras together cover the entire FOV around the car. These cameras are designed to provide optimal near-field sensing upto 10 metres and slightly reduced perception upto 25 metres. There is also an array of 12 Ultrasonic sensors (marked gray in the figure) covering front and rear regions. They provide a robust safety net around the car to avoid collisions and it is necessary for a robust system. It is typically a single membrane sensor with modulated pulses for transmission and reception at a center frequency of . The sensor provides raw data from the piezoelectric element which is followed by an analog filter bank for signal conditional before digital conversion. Further processing steps to detect and localize objects are done on the ECU. Some higher end systems also have LiDAR which can be useful for localization but it is not the main sensor.

SOCs: Although autonomous driving prototypes are shown on large PCs, they have to be deployed on low-power and low-cost embedded systems. In spite of rapid growth of computational power of automotive embedded systems, it is still quite challenging to deploy computer vision algorithms [1] [6]. Figure 3

shows a typical automotive embedded system called Electronic Control Unit (ECU) on the top left region. The typical SOC vendors for automotive include Texas Instruments TDAx, Renesas V3H and Nvidia Xavier platforms. Majority of these SOCs provide custom computer vision Hardware accelerators for dense optical flow, stereo disparity and deep learning. A typical SOC system comes with

to TOPS of computing power and consuming less than watts of power.

Fig. 4: Semantic Segmentation on the fisheye camera
Fig. 5: Depth estimation via motion stereo
Fig. 4: Semantic Segmentation on the fisheye camera

Software Architecture: Typical pre-processing algorithms before being fed to vision algorithms includes fisheye distortion correction, contrast enhancement and de-noising. Objects are detected using computer vision algorithms (discussed in section III-B). They are then fed into the map to plan maneuvering for the car for automated parking. The objects in the image coordinates from the four cameras are converted to a centralized co-ordinate system in the world. Depth Estimation provides localization of objects like pedestrian and vehicles detected by semantic segmentation. Similarly, road objects like lanes and curb are extracted using connected component algorithm and mapped to world coordinates. Objects can also be extracted from other sensors like Ultrasonics and Lidar if available which are then fused in the previously used map.

Vehicle Control and Planning unit uses the map and the current position to plan a route back to the parking location, and controls the steering and acceleration in order for the vehicle to drive itself there. GPS can be used to provide a coarse localization of the vehicle at the starting of the trajectory. It is also important for software at system level to have a perception to detect an obstacle or pedestrian on the driving path, and based on it change the course of trajectory. The system should be able to utilize Auto Emergency Braking functionality to allow vehicle to apply emergency braking under a set of conditions.

Iii-B Standard Computer Vision Modules in Parking

In addition to traditional feature matching, a modern VSLAM system uses semantic information for robustness in re-localization. A modern practice is recognizing the dynamic and movable objects in the scene and give either zero or very less weights to features carried by these entities in the scene.

Semantic Segmentation: The main objects of interest are road-way objects like freespace, road markings, curb, etc and dynamic objects like vehicles, pedestrians and cyclists. They can all be detected by a unified semantic segmentation network [15] in real-time [16]. Figure 5 shows output of Enet network [13] on wide-angle fisheye image. These objects are detected in general for navigation and obstacle detection in automated driving. Specifically for our application, dynamic objects can be helpful to eliminate feature points in the map as they may not be in same location during re-localization. Where as static entities like lane and road markings provide valid trajectories which can be traversed during the maneuver.

Generic Obstacle Detection: In order to obtain a robust system, it is essential to detect objects using alternate cues other than appearance. Training an appearance based semantic segmentation for all possible objects is quite challenging in practice, there are quite rare object classes like Kangaroo or construction truck. Motion and depth are such cues which are very useful in automotive scenes. Typically, depth is used to detect static objects and motion is used for detecting dynamic objects. As mentioned before, most automotive SOCs provide dense optical flow and stereo hardware accelerators which can be leveraged. The stereo accelerator could be used for motion stereo of our monocular cameras. Figure 5 illustrates depth computed by motion stereo algorithm. In this case, the fisheye distortion manifold is piece-wise planar surfaces which are visualized below the point cloud. Alternatively, they can also be computed using an efficient multi-task network [17].

Fig. 6: Example of High Definition (HD) map from TomTom RoadDNA (Reproduced with permission of the copyright owner)
Fig. 7: Bird’s-eye view of point cloud and camera pose of a trajectory generated by Visual SLAM pipeline
Fig. 8: Block diagram of VSLAM showing two parallel pipelines for training and replay phases

Iv Visual SLAM for Parking

Iv-a Mapping Overview

Mapping is one of the key pillars of autonomous driving. Many first successful demonstrations of autonomous driving (e.g: by Google) were primarily reliant on localization to pre-mapped areas. Figure 6 illustrates a commercial HD maps service for autonomous driving provided by TomTom RoadDNA [12]. They provide a highly dense semantic 3D point cloud map and localization service for majority of European cities with a typical localization accuracy of 10 cm. When there is an accurate localization, HD maps can be treated as a dominant cue as a strong prior semantic segmentation is already available and it can be refined by an online segmentation algorithm [14]. However, this service is expensive as it requires regular maintenance and upgrades of various regions in the world. Due to privacy laws and accessibility, such a commercial service cannot be used in many situations and a mapping mechanism has to be built within a vehicle’s embedded system. For example, a private residential area cannot be mapped legally in many countries like Germany. Figure 7 demonstrates a point cloud generated by our system. It is quite sparse compared to the dense HD map due to the limited computational power available in a vehicle.

Iv-B Basic Pipeline of VSLAM

Visual Simultaneous Localization And Mapping (VSLAM) is an algorithm that builds a map of the environment surrounding the car, and figures out the current location of the car within that environment, simultaneously. The cameras mounted on the car produce wide angle images from any one or a combination of the four cameras. Then the process of mapping the vehicle’s surroundings and tracking the map is followed, which constitutes the basic pipeline of VSLAM visualized in Figure 8.

Mapping is the process of generating a map which consists of a trained trajectory and landmarks, out of the tracked sensor data. A trained trajectory is a group of key poses surrounded by landmarks spanned across vehicle’s origin to destination positions. These landmarks are represented using robust image features that are unique in the captured images. On reviewing the state of the art Visual SLAM pipelines, in terms of their advantages and disadvantages, we concluded that a feature based approach would be most suitable over direct methods, as it requires less memory, and is less sensitive to dynamic objects and structure change in the scene. A distinct feature in an image could be a region of pixels where the intensity changes in a particular way, or an edge or a corner. In order to estimate landmarks in the world, tracking is performed, wherein two or more views of the same features can be matched. Once the vehicle has moved a sufficient amount, VSLAM takes another image and extracts features. The corresponding features are reconstructed to get their coordinates and poses in real world.

Frame-to-frame 3D reconstruction and visual odometry can have drift and they need to be corrected globally. This is achieved by bundle adjustment step which jointly optimizes 3D points and camera positions. It is a very computationally intensive step as high reprojection errors of 3D points increases the number of iterations to reduce the cost and thus it cannot be performed for every frame. It is typically performed once in N frames and is called as windowed bundle adjustment. At the end of training, full (global) bundle adjustment is also performed wherein all the key frames (not every frame over the trajectory) are optimised to ensure global consistency of internal VSLAM map.

The final optimized trajectory gets saved in persistent memory as a map and is used by algorithm to relocalize the vehicle pose for automated maneuvering of the vehicle. During this, the live camera images are searched for features, and are matched to frames from the trained map. If features from the live images are matched to map, optimization module (bundle adjustment) can estimate the current position of the vehicle, relative to where it was during training of the trajectory.

Qualitative results on our validation dataset are shared in the following link The video shows our automated parking solution replaying a trained trajectory. Live images are coming from front (right) & reverse (left) view cameras. The small video played in the middle shows the trained trajectory map (vehicle poses shown as white dots), with sparse features surrounding it. Moving yellow arrow shows the live movement of vehicle as per the localized positions calculated from the VSLAM algorithm. In majority of scenarios, the proposed algorithm is able to re-localize a trained trajectory with a tolerance of 2° in orientation and 0.05m in position.

Table I presents the results of few selected challenging scenes in our WoodScape dataset. These scenes have variations in both time/day and lateral/angular offsets causing illumination and structural changes in the video sequences. Relocalization rate gets affected by the amount of variation between training and replay scenes. First three columns in the table refer to training and replay scenes, represented as per the time they were recorded (yymmdd_hhmmss). Fourth and fifth columns mention the difference of time (in days) at which training and replay scenes were captured, and difference in distance (in meters) between starting of training and replay scenes. Next two columns mention the average offsets of position and angle over the length of trajectory. Last column is average relocalization percentage we get for the combination of respective train and replay trajectory. Scene6 has the most challenging scenario due to large variations in both illumination and lateral offset, thus it has relatively worse relocalization rate.

Scene Difference Average Offset
relocalization rate
S.No. Training Replay Time (days) Distance (m) Position (m) Angle (degrees)
Scene1 20161208_121008 20161208_121405 0.003 4.723 0.468 4.704 81.00%
Scene2 20161208_125225 20161208_125945 0.005 2.483 0.355 5.366 98.60%
Scene3 20161208_125225 20161208_130048 0.006 2.692 0.3 5.149 94.30%
Scene4 20161208_125225 20170607_163643 181.156 2.49 1.085 8.162 74.90%
Scene5 20161208_125319 20170607_163643 181.155 0.066 0.903 9.498 86.90%
Scene6 20161208_125319 20170607_163529 181.154 4.96 0.896 10.751 42.80%
TABLE I: Quantitative results of relocalization rate on selected WoodScape dataset scenes (higher the relocalization rate, better the performance)

Iv-C Technical Challenges

We briefly list the practical challenges involved in deploying this system based on our experience.

  • Illumination or weather condition changes can cause the scene to appear visually different. For example, if the mapping and localization are done in day/night or summer/winter etc., the algorithm can degrade significantly as there will be less feature correspondence.

  • Residential areas can have similar structures which makes it difficult for matching unique features. Thus the system needs to be augmented by more specialized features or higher level semantics.

  • Majority of the current generation cars do not have access to cloud infrastructure and thus the mapping has to be done on the car’s embedded system. Thus at the end of the trajectory, there is an additional wait time for the driver to allow completion of global bundle adjustment of the map.

  • SLAM pipeline requires good initialization whereby the features along the trajectory can be matched effectively. This is typically done by noisy GPS signal which may cause unreliable relocalization.

  • Structural changes in the scene are quite common due to movement of objects and the map has to be dynamically updated to incorporate these new changes.

  • Automotive cameras typically have rolling shutter and it has to be compensated especially for relatively higher speeds.

  • Scale ambiguity is resolved by leveraging metric distance between multiple cameras but there is still possibility of scale drift due to the noise in estimation.

V Conclusion

In this paper, we provided an overview of an industrial trained trajectory automated parking system. We discussed the trained trajectory parking use cases and demonstrated how to extend current parking systems using a Visual SLAM pipeline. We described the Visual SLAM pipeline in detail and list the practical challenges encountered in commercial deployment. To encourage further research, we plan to release a dataset comprising of images. In future work, we plan to explore a unified multi-task network to perform visual SLAM and other object detection modules.


  • [1] A. Chavan and S. K. Yogamani. Real-time dsp implementation of pedestrian detection algorithm using hog features. In 2012 12th International Conference on ITS Telecommunications, pages 352–355. IEEE, 2012.
  • [2] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Trans. Pattern Anal. Mach. Intell., 29(6):1052–1067, June 2007.
  • [3] J. Engel, T. Schöps, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. In European Conference on Computer Vision, pages 834–849. Springer, 2014.
  • [4] M. Heimberger, J. Horgan, C. Hughes, J. McDonald, and S. Yogamani. Computer vision in automated parking systems: Design, implementation and challenges. Image and Vision Computing, 68:88–101, 2017.
  • [5] J. Horgan, C. Hughes, J. McDonald, and S. Yogamani. Vision-based driver assistance systems: Survey, taxonomy and advances. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems, pages 2032–2039. IEEE, 2015.
  • [6] B. R. Kiran, K. Anoop, and Y. S. Kumar. Parallelizing connectivity-based image processing operators in a multi-core environment. In 2011 International Conference on Communications and Signal Processing, pages 221–223. IEEE, 2011.
  • [7] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, 2007.
  • [8] V. R. Kumar, S. Milz, C. Witt, M. Simon, K. Amende, J. Petzold, S. Yogamani, and T. Pech. Monocular fisheye camera depth estimation using sparse lidar supervision. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018.
  • [9] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani. Visual slam for automated driving: Exploring the applications of deep learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 247–257, 2018.
  • [10] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
  • [11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 2320–2327, Washington, DC, USA, 2011. IEEE Computer Society.
  • [12] T. N.V. TomTomRoadDNA. [Online: 09/2019].
  • [13] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
  • [14] B. Ravi Kiran, L. Roldao, B. Irastorza, R. Verastegui, S. Suss, S. Yogamani, V. Talpaert, A. Lepoutre, and G. Trehard. Real-time dynamic object detection for autonomous driving using prior 3d-maps. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
  • [15] M. Siam, S. Elkerdawy, M. Jagersand, and S. Yogamani. Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pages 1–8. IEEE, 2017.
  • [16] M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, and M. Jagersand. Rtseg: Real-time semantic segmentation comparative study. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 1603–1607. IEEE, 2018.
  • [17] G. Sistu, I. Leang, S. Chennupati, S. Milz, S. Yogamani, and S. Rawashdeh. NeurAll: Towards a unified model for visual perception in automated driving. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 67–72. IEEE, 2019.
  • [18] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uricar, S. Milz, M. Simon, K. Amende, C. Witt, H. Rashed, S. Chennupati, S. Nayak, S. Mansoor, X. Perrotton, and P. Perez. Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), October 2019.