In recent years, electronic systems replaced manual toll collection to eliminate delays and traffic congestions on toll roads. At first, self-service toll booths were installed, where the driver pays the toll with cash or credit card. In this scenario the driver still needs to stop and process the payment at a machine, thus not resolving delay and congestion problem. After introducing automated toll collection systems, the vehicles are only required to slow down to a certain speed and pass the toll section.
The collection is either done via a pay-by-plate system where a computer vision system recognizes the license plate for billing or a transponder system where the billing is initiated after passing a gating system. These gating systems are built along the toll road and the vehicle passes them without further impact on the driving behavior.
Along with the automation of toll collection it has become more difficult to control the payments made to the operator, as the human as controlling factor has been mostly removed from the toll collecting system. An electronic toll device might deliberately report a false vehicle type to the toll system and pay less fees.
Therefore, different measures try to prevent fraud or inaccurate reporting of toll obligations.
One of them is computer vision aided control, where a camera system checks whether the reported toll information is correct or not. With some imagery at hand, a classification algorithm decides which category a seen vehicle belongs to. Designing such computer vision systems is a challenging task. To train complex classifiers like a Convolutional Neural Network (CNN), thousands of images need to be collected and labeled. Since typically only parts of the vehicle are visible, the available 2D information might be insufficient for classification due to ambiguities. To overcome these problems, a 3D reconstruction of objects can be used for classification. Such reconstructions can be obtained by applying Structure-from-Motion (SfM) on a sequence of many images. In real world applications like ours, often only very few images are available. This results in reconstructions of 3D models that are very limited in completeness and density, even after additional post-processing. While there has been progress with 3D convolutional neural networks recently, classifiers operating on such sparse 3D models often do not perform well enough in terms of accuracy to be employed in real-world applications. In addition, required hardware resources are mostly not available on site.
To overcome these limitations, our approach utilizes 2D and 3D information in an efficient way. We propose to use the sparse depth data as auxiliary loss to improve the classification accuracy of a CNN. Therefore we obtain a sparse point cloud from a SfM pipeline and project these 3D points into the camera views. We also use 3D lines for the projections to capture vehicle structure. This yields a 2.5D representation that we feed as a sparse depth prior along with the recorded images into a CNN for classification. See Fig. 1 for an overview of our CNN model structure.
The main benefit of our method is that we are able to efficiently leverage the 3D vehicle structure information in addition to the 2D appearance information. As we show in our experiments, by using depth as auxiliary loss we can significantly improve the accuracy of a CNN. Further, since we do not need the depth map during test time, we do not have to run a computationally expensive SfM pipeline. Consequently, our approach can run on embedded hardware in a portable toll control system.
Ii Related Work
We focus our summary of related work on the different topics our method relates to. First, we introduce the used SfM methods, then we give a short overview of 2D and 3D classification algorithms based on CNNs.
Ii-a Structure-from-Motion (SfM)
Structure-from-Motion is a technique to reconstruct a 3D model from 2D images. In most cases, thousands of images are required to output a good representation of the object. A typical workflow consists of the following steps: First, keypoints are detected at image locations that are distinguishable by their gradients (e.g. corners). The regions around those keypoints are described with SIFT  or SURF 
features such that each point is represented by a vector of same length and thus comparable with a distance metric. Matching keypoints between pairs of images are then found based on the feature vector distance. From these matches, the five-point algorithmestimates and verifies the relative motion between image pairs. In a final optimization step called bundle-adjustment , the camera poses are refined such that the triangulation error of the 3D points is minimized. The final result is a 3D point cloud, where every point can be seen from at least two images of the dataset. In this work we use the algorithm of  to obtain the sparse point cloud with oriented camera poses.
In our work, complementary to point clouds we also use 3D reconstructions consisting of lines. We use the method of , where 2D line segments are detected and then matched in 3D using geometric constraints. These constraints are defined by the camera poses generated from the SfM pipeline.
Ii-B 2D CNN Classifiers
Since the seminal work of , Convolutional Neural Networks (CNN) have become the standard tool for image classification due to their impressive performance. While researchers have been working on modeling the visual cortex with convolutional networks for some time before [9, 10]
, the computing capabilities to train networks with millions of parameters have become available only in recent years. In the last years significant improvements have been made due to e.g. ReLU, dropout , batch-norm14], and so on. In our method we use a popular network called VGG-16 that was proposed in  due to its widely acknowledged representation capabilities. Pretrained weights are available online so we used it as starting point for our experiments.
Ii-C 3D CNN Classifiers
More recently 3D CNN classifiers were proposed that operate on volumetric data instead of image data. These general deep networks are designed to work on arbitrary types of objects [16, 17]. However, they are limited in model size and complexity. Other works propose to change the underlying data structure, e.g.  use octrees to reduce the amount of memory and computing power needed for every convolutional layer. However, their maximum network input size is mainly due to memory limitations. This input size is not suitable for our sparse representations, where we want to focus on fine-grained differences between vehicle reconstructions. There are also works which operate directly on point clouds instead of volumetric renderings [19, 20]. These works rely on dense point clouds while in our case only sparse 2.5D information is available.
A different approach is taken in , where the authors train an ensemble of CNNs, where each CNN learns a view specific classifier rendered from poses surrounding a 3D shape model. While the idea of projecting 3D information is somehow related to our approach, in practical applications it is often not possible to render multiple views of the same object. In our setting we are additionally limited to a single viewing angle.
3D CNN classifiers can also be found in the area of medical imaging, where several 2D recordings are usually registered and stored as a 3D volume. These CNNs solve specific tasks related to certain organs or diseases and are strictly limited to this use case .
Iii 3d Reconstruction
The aim of 3D reconstructions is to recover the three-dimensional structure of an object or scene from 2D images captured by one or more cameras. One prominent method is SfM, where camera poses are estimated and 3D points are triangulated from multiple camera views. We apply SfM to reconstruct vehicles for the task of vehicle classification on highways with a mobile vision system that is equipped with a single camera. This camera is mounted in a certain angle to the roadway and records the passing vehicles. In such a typical highway scene, as shown in Fig. 2, only parts of the street and of the vehicle are visible within one image frame. Therefore we integrate the 3D structure obtained by reconstruction into the classifier to capture vehicle properties that are not available in single images. For an optimal setting, we first calibrate the camera (Sec. III-A). In contrast to standard SfM approaches, where the camera moves within the scene, our camera remains fixed and the vehicles we want to reconstruct pass the camera. For the reconstruction process this does not really make any difference, as the algorithm assumes the object to be static and the camera position is calculated relative to the object position. Every picture taken by the camera looks like a new camera to the SfM pipeline. As the vehicle passes the camera in constant direction, this generates a camera trajectory where the displacement between the virtual cameras is constant. Due to the high velocity of the passing vehicles, the number of frames available for reconstruction is very limited (typically less than or equal to ). This poses some challenges especially for the matching process, as many matches are found on the background. We exploit our knowledge of the scene to reduce the amount of background matches (Sec. III-B). After reconstructing all of the recorded vehicles (Sec. III-C), we transform the point clouds into a common coordinate system by scaling it with a known scene element (Sec. III-D).
Most 3D reconstruction algorithms require a calibrated camera system, especially when multiple cameras are used. As we reconstruct the 3D models of the passing vehicles from only one camera, the calibration is limited to the lens of this camera (intrinsic parameters). We use the toolbox provided by the authors of  to calibrate the intrinsics. While it could be beneficial for specific tasks like metric vehicle volume estimation, we refrain from a metric calibration with the real world scene. Thus, our system can be easily set up at different positions without cumbersome manual calibration from the human operator. However, with known metric size of any scene part, this calibration could be easily added at anytime afterwards. For example, instead of taking a section of the middle line as reference like in Sec. III-D, one could place a marker stick of certain length on the road and take one picture. With this known length, the reconstruction could be mapped into a metric coordinate system.
Iii-B Exploiting Context
While the practical setting we face in this work comes with some disadvantages (few images, fixed view), we exploit knowledge about two scene properties to improve the reconstruction results: Static camera and driving direction.
Static camera. As the camera is static, we remove false matches between two images by setting a threshold that determines the minimum distance in pixels a matching keypoint must have moved between two frames. A value below this threshold leads to removal of this match. In our experiments, we set this threshold to .
Automatic estimation of driving direction. Vehicles pass the camera driving in a certain direction. Thus, the movement of correctly matched keypoints must also correlate with this direction. To automatically determine the moving direction, we extract lines from the captured scene images and apply Hough transform to discover the most prominent angle that corresponds to the main driving direction. We allow some deviation to make sure no correct matches are excluded. In our experiments the valid angles range from to , where is the horizontal line.
Iii-C Reconstructing Points and Lines
We use three types of 3D input data in our experiments. The first is a 3D point cloud and results from a standard SfM pipeline. As we deal with vehicles, the objects we want to classify are rigid and of cuboid shape. It seems natural to use 3D lines to describe the vehicles. We use , which takes the camera poses from the SfM and generates a 3D line model. We use this as second type of input. As a third input type, we merge lines and points into one model, which is straightforward as both lie within the same coordinate system. While the points tend to capture a denser model of the vehicle, the lines are better representing the vehicle structure. In our experiments, we found that a combination of both yields the best results. See Fig. 3 for comparison.
Iii-D Aligning the Reconstructions within one World Coordinate System
The SfM outputs camera poses that are equidistantly distributed along a trajectory. As the vehicles pass the camera with different speeds, the scale of the reconstructions differs and the depth information is not directly comparable. To resolve this issue, the reconstructed vehicles are mapped into a common world coordinate system. We translate the model to move the first view of the camera trajectory to the origin and choose a line on the street that is parallel to the driving direction and consequently also parallel to the camera trajectory in 3D space. We set the line length to in our 3D world coordinate system and require the camera distance to this line to be the same for all reconstructions. We then recover the 3D position of the points and use them to calculate the scale of the current reconstruction. With this scale we can transform the 3D model such that all models are equally scaled and made comparable.
To be more specific, the points and spanning the line are visible in the first camera. and lie on the rays and casted from the camera center at distances and . The camera trajectory direction is parallel to the 3D line at a distance of . The distance from the camera center to the projection of onto the camera trajectory is denoted with . Consequently, the distance from the projection of onto the trajectory is , as we set the line length to . The angle describes the angle between the camera trajectory and the ray from camera center to point and corresponds to the angle between camera trajectory and ray to point .With these preliminaries we can recover the scale by deploying trigonometric functions. To get the angles and , we first resolve the following equations:
We can then use the angles to calculate by solving the set of equations
for a. This results in
Now we can calculate the distance for one of the points by inserting into either
We then calculate the distances from the points to the camera center with
Finally, the scale is defined by
This way we use the distance from camera to as reference length for scaling. Fig. 4 visualizes the procedure.
Iv Vehicle Classification With Depth Prior
Our classification method is based on the combination of 2D and 3D information. We reflect this in the design of our CNN that we employ for classification. The input images have two channels, of which the first contains the grayscale image captured by the camera and the second contains a depth reprojection from the 3D point cloud. We alter the CNN structure to incorporate an auxiliary branch that helps to classify the vehicles.
Iv-a Depth Reprojection
One CNN input is a reprojection map including points and/or lines projected from the 3D model into the 2D camera view. Any reprojected pixel value represents the depth measured from the origin of the world coordinate system, in our case the first camera. A projection example can be seen in Fig. 3
. To capture as much vehicle structure as possible, we set the parameters in a non-strict fashion to allow an imperfect reconstruction from the limited number of available images. This results in some errors that cause entirely wrong reprojected depth values. At first, this poses a problem to the optimization task, where we optimize on a very sparse number of values, however our network is able to deal with this through a selected loss function, as described in the next section.
Iv-B CNN Model Structure
As basis network structure, we use a VGG-16 model 
pretrained on ImageNet and finetune it on our data. As the pretrained model is for RGB images, we replicate the grayscale image. We modify the network structure and add an auxiliary branch, which tries to estimate the depth from the 2D input image. We use the activation maps after the conv5 layer as inputs to this branch. We then add a convolution to reduce the number of activation maps from to
. On top of this layer, we add a deconvolution layer that upscales the activations to the input size. We then calculate an auxiliary loss between these upscaled activations and the input reprojection image. We also set the number of outputs of the softmax layer to, according to the number of vehicle classes. Fig. 1 shows the changes made to the original network structure.
Auxiliary Loss. Due to the imperfect reconstruction, there exists a limited number of wrong depth values within the reprojection map. These wrong values could heavily impact the optimization during training if using a standard loss. To avoid instability problems during optimization, we use a Huber loss to compensate for errors in the reprojection map. The Huber loss has a linear part for absolute values larger than and is defined as
In our case, is the difference of the depth reprojection and the CNN depth prediction. The reprojection map is very sparse, therefore we mask the loss only for pixels with known depth. In our experiments, we set . We weight the auxiliary loss with a parameter and add it to the classification loss , to train our network with the loss
To show the efficacy of our method, we evaluate multiple experiments on a dataset of over vehicles. To set a baseline, we deactivate the auxiliary branch and train the network without depth information. We compare the baseline to results with our three input variations (points, lines, both) and report the classification accuracy on a test set for single images and sequence-wise. We train our network on a NVIDIA TITAN Xp GPU with a batch size of
until convergence and employ a stochastic gradient descent (SGD) optimizer with a learning rate ofand momentum set to .
Dataset. Our dataset consists of sequences comprising images. The vehicles are labeled into different categories: special transport ( images), car (), camper (), van (), truck () and semitrailer (). For our experiments, we split the dataset for training and for testing.
Results. Table I shows the experimental results of our method for all three input variants (points, lines and both). We report the accuracy on the test set image-wise and sequence-wise. For the latter case, we first classify all images of a sequence (typically 3 to 10) and count it as correct if more than half of the images are correctly classified. Our method improves over the baseline without auxiliary branch for all input types. The combination of both, points and lines, yields the highest accuracy.
|Input variant||Image Accuracy||Sequence Accuracy|
In this paper, we present a method that exploits 3D information to improve 2D classification accuracy. We reconstruct 3D models of vehicles passing a static camera and encode the depth in a 2D reprojection. These reprojections are used in an auxiliary branch of our CNN, where the network reconstructs depth values and acts as a regularizer. We show that our method improves classification accuracy for all three input variants (points, lines, both) over the baseline without 3D information on a real world dataset. At test time, our method does not need 3D information and can thus be employed on mobile vision systems.
-  K. Simonyan, and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,“ arXiv:1409.1556, 2014.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,“ International Journal of Computer Vision (IJCV), 2004.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),“ Computer Vision and Image Understanding (CVIU), 2008.
-  D. Nist r, “An efficient solution to the five-point relative pose problem,“ Pattern Analysis and Machine Intelligence (PAMI), 2004.
-  B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment - a modern synthesis“, International Workshop on Vision Algorithms (IWVA), 1999.
-  A. Irschara, M. Rumpler, P. Meixner, T. Pock, and H. Bischof, “Efficient and globally optimal multi view dense matching for aerial images,“ ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Annals), 2012.
-  M. Hofer, M. Maurer, and H. Bischof, “Efficient 3D scene abstraction using line segments,“ Computer Vision and Image Understanding (CVIU), 2016.
-  A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,“ Advances in Neural Information Processing Systems (NIPS), 2012.
K. Fukushima, and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,“ Competition and cooperation in neural nets, 1982.
-  Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R.E. Howard, W. Hubbard, et al, “Hand-written digit recognition with a back-propagation network,“ In Advances in Neural Information Processing Systems (NIPS), 1990.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,“ Journal of Machine Learning Research (JMLR), 2014.
-  S. Ioffe, and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,“ arXiv:1502.03167, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,“ Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  http://www.meshlab.net/
-  D. Maturana, and S. Scherer, “Voxnet: A 3D convolutional neural network for real-time object recognition,“ Intelligent Robots and Systems (IROS), 2015.
-  C. R. Qi, H. Su, M. Nie ner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view CNNs for object classification on 3D data,“ Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  G. Riegler, A. O. Ulusoy, and A. Geiger, “Octnet: Learning deep 3D representations at high resolutions,“ International Conference on Computer Vision (ICCV), 2017.
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,“ Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,“ Advances in Neural Information Processing Systems (NIPS), 2017.
-  H. Su, S., Maji, E., Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3D shape recognition,“ International Conference on Computer Vision (ICCV), 2015.
-  D. Nie, H. Zhang, E. Adeli, L. Liu, and D. Shen, “3D deep learning for multi-modal imaging-guided survival time prediction of brain tumor patients,“ Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2016.
-  D. Ferstl, C. Reinbacher, G. Riegler, M. Ruether, and H. Bischof, “Learning depth calibration of Time-of-Flight cameras,“ British Machine Vision Conference (BMVC), 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,“ Conference on Computer Vision and Pattern Recognition (CVPR), 2009.