siamese-two-stream
The ICIP2019 paper "A Two-Stream Siamese Neural Network for Vehicle Re-Identification by Using Non-Overlapping Cameras"
view repo
We describe in this paper a novel Two-Stream Siamese Neural Network for vehicle re-identification. The proposed network is fed simultaneously with small coarse patches of the vehicle shape's, with 96 x 96 pixels, in one stream, and fine features extracted from license plate patches, easily readable by humans, with 96 x 48 pixels, in the other one. Then, we combined the strengths of both streams by merging the Siamese distance descriptors with a sequence of fully connected layers, as an attempt to tackle a major problem in the field, false alarms caused by a huge number of car design and models with nearly the same appearance or by similar license plate strings. In our experiments, with 2 hours of videos containing 2982 vehicles, extracted from two low-cost cameras in the same roadway, 546 ft away, we achieved a F-measure and accuracy of 92.6 network, available at https://github.com/icarofua/siamese-two-stream, outperforms other One-Stream architectures, even if they use higher resolution image features.
READ FULL TEXT VIEW PDFThe ICIP2019 paper "A Two-Stream Siamese Neural Network for Vehicle Re-Identification by Using Non-Overlapping Cameras"
This paper address the problem of matching moving vehicles that appear in two videos taken by multiple cameras with non-overlapping fields of view. See Fig. 1. This is a common sub-problem for several applications in intelligent transportation systems, such as enforcement of road speed limits, criminal investigations, monitoring of commercial transportation vehicles, and traffic management.
Some of these applications traditionally use physical sensors placed over, near, or under the road, such as pressure-sensitive cables and inductive loop detectors [1, 2]. However, these detectors present limitations such as the same vehicles entering or leaving the road between the two cameras. Other applications use optical character recognition (OCR) algorithms [3] to translate the license plate image regions into character codes, such as ASCII. However, this translation is not straightforward when two or more lanes are recorded at the same time, producing small license plate regions that are very hard to read. Recognition of vehicles by shape and color is not sufficiently reliable either, since vehicles of the same brand and model often look exactly the same [4].
For such reasons, in our solution we have opted to identify vehicles across non-overlapping cameras by using an hybrid strategy, that is, we developed a Two-Stream Siamese Neural Network that is fed, simultaneously, with two of the most distinctive and persistent features available, the vehicle’s shape and the registration license plate. Then, for fusion of the Two-Streams we concatenate the distance descriptors extracted from each single Siamese network and add fully connected layers for classification. We also show that the combination of small image patches produces a fast network that outperforms other complex architectures, even if they use higher resolution image patches. The rest of this paper is organized as follows. In Sec. 2, we discuss the related work. In Sec. 3, we describe the Two-Stream Siamese Network. Experiments are reported in Sec. 4. Finally, in Sec. 5 we state the conclusions.
Vehicle re-identification is an active field of research with many algorithms and extensive bibliography [1, 2, 5, 6, 7, 8]. The survey of Tian et al. [9] listed this problem as an open challenge for intelligent transportation systems. Traditionally, algorithms for this task were based on the comparison of electromagnetic signatures. However, as observed by Ndoye et al. [2], such signature-matching algorithms are exceedingly complex and depends on extensive calibrations or complicated data models.
Video-based algorithms have been proven to be powerful for vehicle re-identification [2, 6, 7, 8, 10]. Such algorithms need to address fine-grained vehicle recognition issues [11], that is, to distinguish between subordinate categories with similar visual appearance, caused by a huge number of car design and models with similar appearance. As an attempt to solve these issues many authors proposed to use handed-crafted image descriptors such as SIFT [12]. Recently, inspired by the tremendous progress of the Siamese Neural Networks Tang et al. [8], in 2017, proposed for vehicle re-identification in traffic surveillance environment to fuse deep and hand-crafted features by using a Siamese Triplet Network [13]. In 2018, Yan et al. [6]
proposed a novel deep learning metric, a Triplet Loss Function, that takes into account the inter-class similarity and intra-class variance in vehicle models considering only the vehicle’s shape. Also in 2018, Liu
et al. [7]proposed a coarse-to-fine vehicle re-identification algorithm that initially filters out the potential matchings by using hand-crafted and deep features based on shape and color and, then they used the license plates in a Siamese Network and a Spatiotemporal re-ranking to refine the search.
The idea of a two stream convolutional neural networks (CNN) is not new. Ye
et al. [14] proposed an architecture that uses static video frames as input in one stream and optical flow features in the other stream for video classification. Chung et al. [15]also proposed a two-stream architecture composed by two Siamese CNN fed with spatial and temporal information extracted from RGB frames and optical flow vectors for person re-identification. Zagoruyko
et al. [16] described distinct Siamese architectures to compare learning image patches. In special, the Central-Surround Two-Stream architecture is similar to the one proposed here.Finally, some authors [1] use self-adaptive time-window constraints to define upper and lower bounds in order to predict the search space and narrow-down the potential matches. That is, they predict a time-window size based on the camera’s distance and the traffic conditions, e.g. free flow or congested. However, we are not trying to solve the travel time estimation problem here, thus, we considered the maximum number of true or false matchings available in order to evaluate the robustness of the architectures.
The inference flowchart of the proposed Two-Stream Siamese Network is shown in Fig. 2. The left stream processes the vehicle’s shapes while the right stream the license plates. The network weights
are shared only within each stream. We merged the distance vectors of each Siamese — whose similarity is measured by a Mahalanobis distance — and combined the strengths of both features by using a sequence of fully connected layers with dropout regularization (20%) in order to avoid over-fitting. Then, we used a softmax activation function to classify matching pairs from non-matching pairs.
We extracted the vehicle rear end and the vehicle license plate by using the real-time motion detector and algorithms described by Luvizon et al. [17, 18].
The CNN used in our Siamese is shown in Fig. 3. Basically, it is a simplified VGG [19] based network, with a reduced number of convolutional and fully connected layers so as to save computational effort.
For our tests, we used 10 videos — 5 from Camera 1 and 5 from Camera 2 (20 minutes of duration each one) — recorded with frame resolution of pixels, at frames per second. They are summarized in Table 1.
Set | Camera 1 | Camera 2 | No.Match. | ||
---|---|---|---|---|---|
#Vehicles | #Plates | #Vehicles | #Plates | ||
01 | 389 | 343 | 280 | 245 | 199 |
02 | 350 | 310 | 244 | 227 | 174 |
03 | 340 | 301 | 274 | 248 | 197 |
04 | 280 | 251 | 233 | 196 | 140 |
05 | 345 | 295 | 247 | 194 | 159 |
Total | 1704 | 1500 | 1278 | 1110 | 869 |
There are multiple distinct occurrences of the same vehicle as it moves across the video. Therefore, instead of only 869 matchings as shown in Table 1, we can generate thousands of true matchings by doing the Cartesian product between a sequence of images of the same vehicle that appears in Camera 1 and 2. This data augmentation is usually necessary for CNN training. Therefore, we used the MOSSE tracker [20] to extract the first occurrences of each license plate (see Fig 4). Note that negative pairs are easier to generate, since we can use any combination of distinct vehicles from Camera 1 and 2.
We also adjusted another parameter that was meant to multiply the number of false negatives pairs (non-matchings) in the testing set to simulate the network in a real environment assuming it may have many more tests of non-matchings pairs than the opposite. In Table 2 we show some parameter settings for our experiments. Note however, that we keep the same proportion of positive and negative pairs during the training in order to avoid class imbalance.
Settings | Training | Testing | ||
---|---|---|---|---|
#positives | #negatives | #positives | #negatives | |
3867 | 3867 | 3903 | 19515 | |
42130 | 42130 | 42707 | 427070 |
The quantitative criteria we used to evaluate the architectures performance are the precision , recall , accuracy and -measure. As shown in Table 3, the Two-Stream Siamese outperforms two distinct One-Stream Siamese Networks: the first one, Siamese-Car, is fed only with the shape of vehicles ( pixels); the second, Siamese-Plate, only use patches of license plates ( pixels). Note that even when we increased the number of false matchings in the negative testing set, , the -measure of the Two-Stream Siamese was similar in both scenarios. The accuracy is usually much higher since the number of negative pairs is much larger. Some inference results are shown in Fig. 6.
Algorithm | ||||
---|---|---|---|---|
Siamese-Car (Stream 1) | 85.8% | 93.1% | 89.3% | 96.3% |
Siamese-Plate (Stream 2) | 75.9% | 81.8% | 78.8% | 92.6% |
Siamese (Two-Stream) | 92.7% | 93.0% | 92.9% | 97.6% |
Algorithm | ||||
Siamese-Car (Stream 1) | 92.4% | 83.5% | 87.8% | 97.9% |
Siamese-Plate (Stream 2) | 86.8% | 59.5% | 70.6% | 95.5% |
Siamese (Two-Stream) | 94.7% | 90.6% | 92.6% | 98.7% |
We also tried different CNN in our Two-Stream Siamese, their performance are reported in Table 4. Furthermore, as can be seen in Fig. 5, we also evaluated the performance of the proposed Two-Stream Siamese against two One-Stream Siamese versions fed with larger image patches ( pixels). Note that we achieved a higher
-measure by using two small image patches than a single patch containing both features. Another advantage is the Two-Stream Siamese training time: 1938 seconds per epoch (
and ) against 3441 seconds per epoch of the Siamese-Car by using the same Small-VGG and 4937 seconds with ResNet [21]. The experiments were carried out on a Intel i7 with 32GB DRAM and a Nvidia Titan Xp GPU.Siamese (Two-Stream) | ||||
---|---|---|---|---|
CNN = Lenet5 | 89.6% | 85.2% | 87.3% | 97.8% |
CNN = Matchnet [22] | 94.5% | 87.1% | 90.7% | 98.4% |
CNN = MC-CNN [23] | 89.0% | 90.1% | 89.6% | 98.1% |
CNN = GoogleNet | 88.8% | 81.8% | 85.1% | 97.4% |
CNN = AlexNet | 91.3% | 86.5% | 88.8% | 98.0% |
CNN = Small-VGG | 94.7% | 90.6% | 92.6% | 98.7% |
We proposed in this paper a fast Two-Stream Siamese that combines the discriminatory power of two distinctive and persistent features, the vehicle’s shape and the registration plate, to address the problem of vehicle re-identification by using non-overlapping cameras. Tests indicate that our network is more robust than other One-Stream Siamese architectures which are fed with the same features or larger images. We also evaluated simple and complex CNNs, used by the Siamese Network, to find a trade-off between efficiency and performance.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU and the agencies CNPq, CAPES and SETRAN-Curitiba.
“A car-face region-based image retrieval method with attention of sift features,”
Multimedia Tools and Applications (MTA), Springer, pp. 1–20, 2016.Similarity-Based Pattern Recognition
. 2015, pp. 84–92, Springer.The IEEE international conference on computer vision (ICCV)
, 2017.