Detection and Tracking of Pallets using a Faster R-CNN based on a 2D LRF
The problem of autonomous transportation in industrial scenarios is receiving a renewed interest due to the way it can revolutionise internal logistics, especially in unstructured environments. This paper presents a novel architecture allowing a robot to detect, localise, and track multiple pallets using machine learning techniques based on an on-board 2D laser rangefinder. The architecture is composed of two main components: the first stage is a pallet detector employing a Faster Region-based Convolutional Neural Network (Faster R-CNN) detector cascaded with a CNN-based classifier; the second stage is a Kalman filter for localising and tracking detected pallets, which we also use to defer commitment to a pallet detected in the first step until sufficient confidence has been acquired via a sequential data acquisition process. For fine-tuning the CNNs, the architecture has been systematically evaluated using a real-world dataset containing 340 labeled 2D scans, which have been made freely available in an online repository. Detection performance has been assessed on the basis of the average accuracy over k-fold cross-validation, and it scored 99.58 experiments have been performed in a scenario where the robot is approaching the pallet to fork. Although data have been originally acquired by considering only one pallet, artificial data have been generated as well to mimic the presence of multiple targets in the robot workspace. Our experimental results confirm that the system is capable of identifying, localising and tracking pallets with a high success rate while being robust to false positives.READ FULL TEXT VIEW PDF
In the past few years, the technology of automated guided vehicles (AGVs...
This paper focuses on the problem of online golf ball detection and trac...
We introduce the DROW detector, a deep learning based detector for 2D ra...
Automated tracking of animal movement allows analyses that would not
Code smells are characteristics of the software that indicates a code or...
In this paper, we develop a functional Unmanned Aerial Vehicle (UAV), ca...
This paper presents a track-before-detect labeled multi-Bernoulli filter...
Detection and Tracking of Pallets using a Faster R-CNN based on a 2D LRF
The adoption of the Industry 4.0 paradigm is thought to intrinsically change the nature of shop-floor and warehouse environments along many dimensions, and the use of autonomous mobile robots for inbound freight transportation and delivery is no exception DAndrea2012 . Traditionally, automated guided vehicles (AGVs) have been adopted in industrial environments for freight transportation and delivery under a number of assumptions, namely:
a well-defined, structured, and obstacle free workspace for robot navigation, and
unambiguous robot sensing and perception capabilities as far as their interaction with the environment is concerned.
Nowadays, in spite of high levels in shop-floor and warehouse automation, such assumptions largely still hold, even in case of novel solutions proposed by the start-up ecosystem, with a few notable exceptions such as the one commercialised by Otto Motors111Web: www.ottomotors.com. and Fetch Robotics222Web: fetchrobotics.com/.. However, the tenets of the Industry 4.0 paradigm are expected to require relaxing such assumptions. Given the goal of providing customers with personalised and just-in-time delivery of products, it is foreseen that warehouse environments will become more dynamic and human-friendly, and will host human-robot collaborative processes to a great extent Krugeretal2009 ; Heyer2010 ; Darvishetal2018 . As far as AGVs are concerned, such directives imply higher standards in autonomy, as well as more robust perception and decision making capabilities.
Notwithstanding such ferment, pallets are still considered of the utmost importance in warehouses. According to a survey by Peerless Research Group333Web: www.peerlessresearch.com/., pallets are preferred over novel automated logistics systems for a number of reasons: purchase price ( of the qualified responses), strength (), durability (), and reusability (), just to name a few. Among the various materials employed for pallet design and manufacturing, wood pallets are the preferred ones. When asked how many pallets survey respondents are using with respect to what they did one year before, of them declare using approximately the same number of pallets, more pallets, and only are using fewer pallets. Therefore, it is possible to foresee a positive trend in pallets usage.
Given all considerations above, the need arises to (i) provide standard AGVs with the capability of detecting, localising and tracking standard pallets (ii) when the location of pallets cannot be assumed to be precisely known in advance, and (iii) in environments where human co-workers operate and other objects are present. So far, pallet detection, localisation and tracking have received much attention both in scientific literature and in industry-oriented research. A huge number of studies have been presented, which discuss model-based approaches either adopting computer vision or using 2D laser rangefinder data, and the most relevant ones for this work are discussed in Section 2.1. When compared to approaches based on computer vision, 2D rangefinders have the advantages of generating reliable data with a well-characterised statistical sensor noise pfister2003weighted ; Mastrogiovannietal2013a , being more accurate for long distances, and not being influenced by light conditions. However, since laser rangefinders can provide only contour information, they are often coupled with cameras when unique pallet identification is needed schulenburg2003self . As a consequence, the objective of the work described in this paper is two-fold:
developing an architecture for commercially available AGVs, in particular forklifts, which has the capability of detecting, localising and tracking (possibly multiple) pallets, using 2D laser rangefinder information; as a target scenario, we refer to the automation of a warehouse located in Tortona, Italy, where a purposely modified commercial forklift has been put in operation, as shown in Figure 1;
providing an open, freely available dataset444Web: https://github.com/EMAROLab/PDT. to the community for further research activities, comprising a collection of labelled 2D scans related to pallets located in real-world environments.
The major contribution of the paper is an architecture made up of two components: (i) a pallet detector module employing a Faster Region-based Convolutional Neural Network (Faster R-CNN) detector girshick2015fast ; Ren2017 coupled with a CNN-based classifier for classification purposes operating on a bitmap-like representation of 2D range scans; (ii) a Kalman filter based module for localising and tracking the detected pallets, as well as increasing the confidence associated with their detection on-line. In particular, the proposed architecture:
to the best of our knowledge, is the first framework for pallet detection, localisation and tracking using machine learning approaches based on 2D range data exclusively;
is designed to detect, localise and track multiple pallets at the same time;
exhibits independence from a possible a priori knowledge about a pallet’s location;
does not require any modifications to existing standard pallets, as done in other well-known approaches in the literature, for example in lecking2006variable ;
does not require information about the forklift’s pose either in absolute terms or relative to the target pallet;
to the best of our knowledge, this is the first attempt to perform object detection, classification and tracking using a 2D laser rangefinder in conjunction with machine learning methods, instead of the more common model-based approaches. Due to the limited and sparse nature of the data provided by this sensor, this poses different challanges compared to apporaches based on 3D LiDAR, cameras, or both feng2018towards ; zhou2017voxelnet ; asvadi2017depthcn ; redmon2016you ; liang2018deep .
The paper is organised as follows. Section 2 discusses related work and introduces the reference scenario. Section 3 describes the methods to pallet detection, localisation and tracking employed in the proposed architecture. The overall data flow pipeline as well as the pallet tracking algorithm are described in Section 4. Implementation details and the experimental evaluation are discussed in Section 5. Conclusions follow.
The problem of designing an autonomous forklift able to fork, transport and place pallets is not new, likewise the problem of pallet detection, localisation and tracking. Given the geometric shape of a pallet’s structure, a number of model-based solutions have been proposed in the literature, which make use of either vision or 2D range information, or both.
. A number of vision-based approaches making use of different features extracted from images to detect and track pallets have been presented, and examples include the work described inbyun2008real ; chen2012pallet ; oh2014development ; syu2016computer ; holz2016fast ; varga2016robust .
One of the first approaches to pallet detection and pose estimation has been discussed ingaribotto1997service . Soon afterwards, an image segmentation method based on pallet’s colour and geometric characteristics has been presented in pages2001computer . However, these approaches require very stable illumination conditions and a very precise camera calibration, which is quite a strong assumption in real-world settings. The method proposed in nygards2000docking attempts to estimate a pallet’s pose using a structured light method, which is based on a combination of range and video information. The main problem associated with such an approach is that its accuracy quickly decreases with distance. Being able to detect a pallet when it is still distant is a nice-to-have feature in all those cases where pallets are located in a certain load/unload area without a specific arrangement. The estimate of a pallet’s pose has been attempted also using artificial visual features in the form of markers placed on the pallets to detect seelinger2005automatic ; aref2014macro . While such approaches do not rely on well-defined illumination conditions, nor they assume a precise camera calibration, it is often difficult to place fiducial markers in real-world environments, because such a process increases setup times to a great extent.
A model-based algorithm using visual information without any fiducial markers or specific illumination conditions has been presented in garibott1996robolift . This algorithm exploits the identification of a pallet’s central cavities to identify two pallet slots and estimate their geometric centre in calibrated images. However, such a system requires an accurate a priori knowledge of a pallet’s pose, which (as described above) is not realistic in real-world settings. A retrofitted autonomous forklift with the capability of stacking racks and fork pallets placed within a certain area with uncertainty was presented in kim2001model . The docking method for pallet forking is based on the detection of specific reference lines for concurrent camera calibration and pallet identification, and it allows for the stacking of well-illuminated racks and the localisation of pallets in front of the vehicle. Unfortunately, such a solution proves to be limited to the stacking task only. The approach described in Cucchiara2000FocusBF is based on a more complex visual processing pipeline, which employs a number of hierarchical visual features like regions, lines and corners using both raw data and template-based detection. In wang2016autonomous , the authors present an autonomous pallet handling method based on a line-structured light sensor, where the design of such a sensor is based on an embedded image processing board containing an FPGA and a DSP. This approach can identify and localise pallets using their geometrical structure based on a model-matching algorithm, and uses a position-based visual servoing method to drive the vehicle while it approaches the pallet to fork. Unfortunately, it also requires the development of custom hardware.
An approach for the automated pallet detection combining stereo reconstruction and object detection from monocular images has been presented in varga2014vision . Improvements and extensions for a stereo camera system responsible for autonomous load handling were presented, by the same authors, in varga2015improved . However, the use of stereovision and structure-from-motion algorithms can hardly fit with real-time requirements typically needed when autonomous vehicles are present. The work described in cui2010robust introduces a method to identify pallets using color segmentation in real time. However, such a method is prone to the presence of false positives, unless assumptions about pallets colour are posed. A comparison between two common 3D vision technologies, namely the photonic mixer device (PMD) and typical stereo camera systems, was presented in beder2007comparison . The authors conclude that the PMD system is characterised by a greater accuracy than a typical stereo camera system. On the basis of such an insight, a solution for pallet loading and de-palletising detection employing a PMD camera has been introduced in weichert2013automated . Again, such approach requires the introduction of ad hoc hardware, at the expense of cost and maintenance.
Overall, vision-based systems are characterised by a number of drawbacks, which make their use still limited to specific conditions, including: (i) the need for fiducial markers or similar mechanism to reduce false positives; (ii) the need for stable environmental conditions; (iii) computational load of the associated computer vision algorithms; (iv) the need for custom hardware solutions to enable real-time operations.
Rangefinder-based systems. Traditionally, 2D laser rangefinders have been extensively employed for robot localisation and mapping, and such techniques have been also successfully applied to environments characterised by a high degree of human presence Mastrogiovannietal2007 ; Mastrogiovannietal2008 ; Capezioetal2011 . In the past few years, a number of model-based approaches have been presented and discussed in the literature, which constitute effective methods to detect, localise and track pallets in range data. In contrast to vision-based algorithms, such approaches do not suffer from image distortions (related to camera calibration), varying illumination issues or object scaling problems, which can result in false detections or mis-detections of significant features, and are characterised by lower computational requirements. The early work by Hebert et al. hebert1986outdoor describes techniques for scene segmentation, object detection, and object recognition with an outdoor robot using range data. In hoffman1987segmentation , the authors present a method for detecting and classifying objects using range information. A model-based technique that leverages prior knowledge of an object’s surface geometry to jointly classify and estimate the surface structure was proposed in newman1993model . However, such models are characterised by bold assumptions related to perfect data association and absence of noise.
Starting from these initial results, range data have been applied to the pallet detection problem. Data acquired from a laser rangefinder are used in baglivo2008object to detect and localise pallets, but the approach cannot deal with ambiguous matches, i.e., it requires perfect data association. The solution discussed in walter2010closed
uses a fast linear programming method for detecting line segments in range data, as applied to pre-filtered points selected by a human using an image provided by a camera mounted on a forklift. In particular, pallets are identified by the classification of detected line segments belonging to its front, and their position is therefore computed. However, such a method requires a pre-processing step, and its precision can be hampered in the case of specific pallet poses. Inlecking2006variable , the authors present two approaches based on 2D range data: the former assumes the availability of pallets modified with reflectors to compute their position and orientation, whereas the latter uses only their geometrical characteristics as it may be unfeasible to place reflector marks in all pallets. In the second case, the Iterative Closest Point (ICP) algorithm is used to match range data with the pallet model. However, the main drawback of the approach is that ICP needs an initial (although approximate) pallet’s location, otherwise the iterative computation can become very time consuming and leading to inaccurate results. As discussed above, this may be unrealistic in real-world situations. The work presented in he2010feature discusses a feature-to-feature matching for pallets, which first detects line segments, and then matches them with the pallet’s geometric model. However, such an approach can lead to a number of false positives and to ambiguous pose estimations. Other methods for 2D data segmentation, feature detection, fitting, and matching have been presented in premebida2005segmentation ; bostelman2006visualization , but all these approaches are characterised by the same drawbacks. An integrated laser and camera sensory system, for solving the problem of simultaneously identifying and localising pallets whose location is characterised by a great uncertainty, has been presented in baglivo2011autonomous . However, such an approach suffers from a number of drawbacks typically associated with vision processing.
In summary, rangefinder-based systems avoid certain limitations associated with vision-based approaches, but are nonetheless limited as far as detection capabilities are concerned, such as: (i) the need for a model-based approach grounded on pallet geometry; (ii) the necessity of computing features enabling model matching processes; (iii) the ease at which detection estimates can diverge.
Discussion. At a first glance, model-based approaches seem appealing because pallets are characterised by a well-defined shape and geometrical features. However, in order to enable a reliable and robust detection, experience suggests that many assumptions are to be posed. In fact, all the approaches in the literature, in a way or another, are characterised by recurrent limitations: while vision-based methods are highly dependent on light conditions (or assume them to be stable), camera calibration issues, and pallet-to-camera distance, or assume to retrofit the environment with the adoption of fiducial markers on each pallet, range-based methods are based on models grounded on pallet’s geometry, and require the stable detection of certain characterising features. From our analysis, it appears that the use of machine learning techniques for the problem of pallet detection and localisation, coupled with solutions for pallet’s pose tracking, and when only range information is used, has not been explored in the literature. Such an approach is expected to avoid the drawbacks associated with vision-based approaches, does not assume any a priori model for pallets, does not compute high level features, and adopts a sequential classification procedure to reduce the occurrence of false positives.
Furthermore, it can be seen generally that the most used sensors for object detection, classification and tracking based on machine learning techniques are 3D LiDARs feng2018towards ; zhou2017voxelnet ; asvadi2017depthcn , cameras555Web: http://cs-chan.com/source/FADL/Online_Paper_Summary_Table.pdf or a combination of both liang2018deep ; matti2017combining , while 2D laser rangefinders are usually avoided for this task despite their convenience, as they provide only partial contour information. This fact poses the challenge of how to make use of sparse data with limited information content, while still achieving a system with robust tracking capabilities and a small number of false positive detections.
The scenario we target in this paper includes a purposely modified model EXU low lift pallet truck manufactured by STILL GmbH, which has been put in operation in a warehouse environment in Tortona, Italy. The forklift, depicted in Figure 1, can lift up to at a height. It has been equipped with two safety laser rangefinders for obstacle detection, placed as to cover a full scan around the truck, and one of them can be used to provide the data required by our architecture. Furthermore, it has been extended with a localisation system performing tri-lateration using a number of intelligent devices distributed in the environment Mastrogiovannietal2009a ; Mastrogiovannietal2010 ; Capezioetal2011 ; Mastrogiovannietal2013a . Being able to localise and avoid obstacles, the forklift can freely move in the warehouse. The map of the environment is assumed to be known (either a priori available or built off-line), and a number of relevant locations, such as forking and placing areas, are identified as semantic tags in the map. It is noteworthy that forking and placing areas are roughly regions where a pallet can be located anywhere inside it. Therefore, it is not possible to assume in advance a specific location or pose for the pallet, as it is typically done by other approaches in the literature described above, but it can be fairly assumed that it lies within the area.
Missions are defined using a knowledge representation and planning framework previously developed for mobile robots Mastrogiovannietal2004 . The framework is able to express a high-level goal in an ontology-based representation, and to determine a corresponding set of planning problems, whose solutions (i.e., plans) are guaranteed to achieve the goal, if available. Missions are typically configured as sequences of forklift motions to a given goal location, approaching the pallet to fork, forking, delivering the pallet to the placing area. Once the forklift moves towards the forking area, it needs to detect, localise and then track the pallet (which is still) to compensate its own motion. As per functional requirements of the use case we consider, only one pallet is present in the forking area, and therefore the specification is related to the detection of one pallet only. In the paper, we also consider the case in which two pallets may be present in the forking area at the same time, in order to better discuss the capabilities of our approach. In principle, given the forklift’s localisation system, tracking would not be necessary, as pallets do not move and robot motion information could be used for pallet localisation. However, in our case we decided to determine whether it is possible to perform pallet tracking without relying on the robot’s localisation capabilities, therefore taking inspiration from the literature on the use of minimal information for localisation and navigation in mobile robotics Mastrogiovannietal2009b .
Obviously enough, the problem has been already explored in the literature. In particular, the work described in aref2013position ; aref2014macro integrates visual information with robot’s odometry to implement a smooth and non-stop transition from autonomous navigation to visual servoing. However, in order to perform such a transition, it is necessary to understand when visual servoing can be activated to avoid scattering motions. When using commercially available rangefinders, the typical maximum pallet detection range from a robot is approximately aref2016multistage . A few systems include multiple-view rangefinders, and therefore are capable of attaining longer detection distances in the forklift’s workspace, i.e., up to walter2010closed ; walter2015situationally . As we better discuss below, being able to detect a pallet from longer distances is a nice-to-have feature when sequential classification processes are employed.
In our scenario, the forklift does not employ any specific strategy to approach pallets. When moving towards the picking area, pallet detection, localisation and tracking are activated. As soon as 2D range scans are collected (approximately at ), each range scan is processed to detect pallets. We focus on standard EUR/EPAL pallets, which size is by by . Multiple range scans are subsequently used to localise and track pallets, and to remove false positive detections. When a sufficient confidence level on a tracked pallet is reached, then the pallet is considered successfully detected. Such a sequential classification process can benefit from the fact that, when the forklift is approaching the forking area, it can already ascertain whether a pallet is present and where it is located. The whole process is described in details in Section 4.
Convolutional Neural Networks (CNNs) are a class of neural networks specifically suited for image processing applications Lecun2015 . In a generic neural network, each neuron
is connected to several others and is characterised by a bias and a weight for every connection with a peer. Neurons are organised inlayers, each one performing some kind of transformation on its input data. As a result, the network as a whole expresses a differentiable score function, and it can be trained by minimising its statistical error on a training dataset. This can be achieved by defining a loss function and applying the back-propagation algorithm to consequently adjust the network parameters (i.e., the weight and biases of each neuron).
CNNs are based on the same general idea, but they encode certain properties characterising images into the network’s architecture, improving its efficiency and vastly reducing the number of parameters to identify. This is a necessary prerequisite, since in generic neural networks the layers are usually fully connected, meaning that each neuron is connected to all the neurons in the previous layer, rapidly leading to an unmanageable number of parameters to train even for small sized images. Convolutional layers have a few unique features, as:
they are able to transform 3D input into 3D output volumes, with each depth level corresponding to a different feature computed on the input;
they exploit local connectivity and the topographic arrangement in images, i.e., each neuron is connected only to a few neighbouring neurons in a given width and height range of the previous layer output (but they are still fully connected along the depth);
they assume that the features learned to process a certain part of an image (e.g., being able to detect an edge) proves equally useful in other parts too.
As a consequence, each convolutional layer can apply a series of local filters panned unmodified over the image span. Since each filter is defined only by a small number of parameters compared to a fully connected layer, the number of parameters to be trained in the network is greatly reduced. A number of hyper-paramenters are introduced and should be defined when designing a convolutional layer, such as: (i) its depth, i.e., the number of filters to be used; (ii) the receptive field, i.e., the size of each filter; (iii) the stride, being how much to slide the filters; and finally (iv) the zero-padding
, i.e., how many zeros to pad a layer’s input with, so to control the output’s size.
A CNN is usually composed of more than a single convolutional layer, and often includes other kinds of layers. A typical structure involves:
an input layer, having the same dimensions as the input data (e.g., a colour image having dimensions pixels by colour channels);
the convolutional layer, as described in the previous paragraph, usually increasing the depth of the volume by computing multiple filters;
a pooling layer, to perform downsampling on spatial dimension (i.e., width and height);
a final fully connected layer
, to classify the input data based on the previously computed features, i.e., a vector with length equal to the number of possible classes.
It is noteworthy that not all the layers introduced above have learnable parameters, as ReLU and pooling layers do not, but all of them except ReLU are characterised by some hyper-parameters to be defined. Typical deep learning neural networks Lecun2015 differentiate themselves by the number of employed layers (called again depth, not to be confused with the depth of a convolutional layer), the way they are dimensioned and connected, and the choice of said hyper-parameters.
Based on the advancements in CNNs, Region-based Convolutional Neural Networks (R-CNNs) have been proposed to perform object detection tasks girshick2014rich ; girshick2015fast ; Ren2017 , i.e., the task of associating a number of bounding boxes with an image, each one possibly corresponding to (i.e., enclosing in image space) an object of interest. The R-CNN family includes R-CNNs girshick2014rich , Fast R-CNNs girshick2015fast and Faster R-CNNs Ren2017 . In general, these region-based approaches are organised as a 2-step process: they generate a set of bounding box proposals, and submit those regions of interest to a classifier to determine whether any of them is an object (i.e., their objectness) and which object they correspond to. In brief, R-CNNs and Fast R-CNNs rely on external region proposals generated by Selective Search uijlings2013selective , and present a rather complex training pipeline. On the contrary, Faster R-CNNs add a fully convolutional layer on top of the features maps generated by the last convolutional layer, called Region Proposal Network (RPN). The RPN works by passing a sliding window over a set of convolutional feature maps on the last convolutional layer, so as to propose bounding box candidates of predefined scales and aspect ratios. RPN defines a number of region boxes in the image space (called anchors) and ranks them on the basis of their likelihood of containing objects, in our case pallets. As it is customary in Faster R-CNNs, for each sliding window on the convolutional feature map anchors are generated with different sizes and 3 different aspect ratios in all possible combinations girshick2014rich ; girshick2015fast ; Ren2017 , all of them compatible with standard EUR pallets. Furthermore, for each anchor a value , is computed, which refers to the overlap ratio between the areas of anchors and of ground truth bounding boxes:
where IoU (which stands for intersection over union) can be defined as:
In (1), the thresholds over can be tuned experimentally. Eventually, these features are then fed to a network with two main tasks, namely regression and classification. The regression output determines the predicted bounding boxes, each with a form of , while the output of the classification network is the value indicating whether each predicted bounding box contains an object, according to (1).
This implies that Faster R-CNNs achieve efficient and fully end-to-end training, as a single CNN is used for region proposal and classification. Hence, Faster R-CNNs address the limitations of other architectures and achieve greatly improved performance, being much faster than regular R-CNNs.
The simplest sequential analysis method applies a Bayesian analysis to compare the joint class-conditional probabilities of the observations so far, by evaluating their ratio. If, at observation number
, the posterior probabilities ofClass 1 (e.g., a pallet is present) and Class 0 (no pallet is present) given the observations , , are and respectively, and if , () are two thresholds related to the balance between errors related to false positives and to false negatives, then the decision criterion is:
In the original formulation, the probabilities are assumed to be known and observations to be mutually independent, so if at step the class-conditional probabilities of the current observation are and , respectively, we can write the basic sequential probability ratio as:
In the present case, this method has been applied with scores generated by a soft classifier rather than true probabilities, as described in the previous Section. More importantly, the assumption of independence is not realistic when considering subsequent 2D scans acquired at . Methods taking into account dependencies, for instance under a Markov assumption Novikov2001
, are available. These have a higher computational time, possibly incompatible with real-time operation. Moreover, they have a higher number of parameters, since they explicitly model the expected dynamics (for instance as a Markov chain), so they have higher sample complexity and are more prone to overfitting. It should be noted that the independence assumption in this context is safe, although maybe suboptimal, since it gives a worst-case estimate. As shown in Section5, this worst-case approach proved to yield good results, so considering its computational and learning complexity advantages this was the preferred approach.
In this Section, we discuss the structure of the proposed architecture, which is depicted in Figure 2. It consists of three parts. First, raw range data are acquired, and each scan is converted into a 2D bitmap-like image, so that it is in the most appropriate format for a CNN. Then, a dataset of real-world 2D scans (each one converted into a bitmap) is collected and offline training is performed. Once training is complete, the pallet detection module trained in the previous step is coupled with a Kalman filter-based tracker, which is used online to match potential pallet detections over time. The novelty of this step is that, instead of immediately accepting a potential pallet, the decision can be deferred until sufficient confidence in the candidate is achieved, reducing the chance to pursuit a false positive and stabilising true positives in case of a temporary occlusion or sensor noise. On the other hand, if the candidate’s confidence falls below a certain threshold or the it disappears for a several frames, the candidate pallet is just discarded.
A single laser rangefinder scan taken at the time instant can be represented using a set of polar coordinates:
being the number of single range points, i.e., related to the angular sensor’s resolution. Hence, is the measured distance of an object with respect to the rangefinder location in the direction given by the angle . For a single point in range data, we can obtain a binary image of the operating area’s floor plan, converting the acquired data to Cartesian coordinates with the following formula:
This second representation is preferred for object detection and tracking as it allows for recovering correlations among neighbouring pixels in 2D images, which can be exploited by the CNN layers. In particular, we first convert data from a laser rangefinder, which has been limited to maximum depth, into pixel binary images. Such images are then resized to pixels, leading each pixel to cover an area of . Such discretisation has been deemed sufficient to take into account motion noise during pallet forking actions.
When such operation is done, images are ready to be used for online detection, localisation and tracking. However, two additional steps are required to prepare the necessary datasets for training purposes, namely the use of artificially generated images and the definition of specific regions of interest (ROIs). ROIs are 2D bounding boxes of the objects (e.g., pallets) in the dataset images and, as described above, are defined by their top-left and bottom-right corner points, i.e., and , which uniquely identify a region’s position and size. First, the available dataset of real-world data is augmented with artificially generated images. The new images can be obtained by translations and rotations of the original ones, in order for the network better generalise with respect to pallet locations and poses. The generation of artificial data with the aim of reducing overfitting in image-based training has already been used teng2010real ; brust2015convolutional . Such a technique also has the advantage of reducing the time and efforts devoted to: (i) collecting a large amount of real-world data, and (ii) labelling such data with ground truth so that they can be used for training, as it is possible to infer the new labels whilst the corresponding data is generated. This first dataset is used to train the Faster R-CNN detector. Once this first network has been trained, it can be used to extract the ROIs associated with the objects in the dataset. The ROIs dataset, as it is described in Section 5, is then used to train the CNN-based classifier, that detects which ROIs may correspond to a pallet. A summary of these steps can be found in Figure 3.
In order to track pallets in 2D images obtained via range data, we need first an approach to reliably detect them in each single image. As anticipated, we designed such module using neural networks. This Section is focused on the general architecture of such neural networks. Details on the training process for our specific experiments, such as size and composition of datasets, are given in Section 5.2.
The pallet detection process is made-up of two steps: a state-of-the-art Faster R-CNN detector which detects the ROIs in each image, and a CNN-based classifier taking as input the previous step detections and discriminating which of them could be a possible pallet candidate. In the first step, we use a Faster R-CNN for two reasons: (i) it allows for detecting possibly multiple pallets while being robust to false positives, and in this sense the Faster R-CNN provides us with a number of ROIs that can undergo further inspection in the second step; (ii) we want to estimate the position of each detected pallet, and the centroid of an associated ROI can be used to that purpose. However, the Faster R-CNN is not sufficient for a reliable identification of each ROI, and this is why a CNN-based classifier is necessary. It is noteworthy that in so far as the 2D laser rangefinder is a robust and reliable sensor, the amount of data it provides is limited to partial objects’ contours on a plane. A CNN-based classifier is able to detect pallets in a ROI with a small number of false positives despite the limited amount of cues. As anticipated above, the two networks are completely independent, and are trained with different training sets.
The Faster R-CNN detector is composed of several layers, divided in three main stages: the input layer, an intermediate convolutional stage, and a final fully connected stage. The input layer consists of the input image corresponding to the 2D scan, downscaled to a
pixel grey-scale or RGB images to improve general performance. The central convolutional stage is made up of two convolutional layers, interleaved by two ReLU layers, and followed by a final max-pooling layer. Each convolutional layer appliesfilters, with a size of and a stride and a padding of , whereas the max-pooling layer employs pooling regions of size and a stride of , which produces output images of size . The final stage is composed of two fully connected layers, followed respectively by one ReLU layer and a softmax classification layer. The first fully connected layer is composed of neurons, which is followed by the ReLU layer. The output size of this layer is an array with a length of , which represents the most significant features in the image. Such features are then used by the last fully connected layer combined with the softmax classification layer to determine whether a ROI proposed by the RPN belongs to one of the object classes (i.e., pallets) or to the background, using sequential classification. The overall output is a list of candidate ROIs, defined by two corner points as described in the previous Section.
The CNN-based classifier that follows is trained to classify the most promising ROIs detected by the Faster R-CNN as pallets. The classifier is trained using a dataset obtained based on the ROIs bounding boxes and the original images, as detailed in Section 5.2. Compared to the first network, the CNN-based classifier is simpler in its structure. The input layer gets as input filtered full-size images so that they contain only the ROIs, and therefore it has a size. The input layer is then followed by a convolutional layer, a ReLU layer, and a max-pooling layer. The convolutional layer has depth of , filter size of , and stride and padding set to , whereas the max-pooling layer applies pooling regions of size and stride equal to . Finally, a fully connected layer and a softmax classification layer are employed to compute the salient features and classify the image accordingly. Eventually, the performance of the proposed networks is evaluated by empirical validation with -fold cross-validation, as it provides a reliable assessment of the network accuracy arlot2010survey without excessive burdening on computation time.
The two networks described in the previous Section are trained to detect ROIs as bounding boxes and determine which of them correspond to a pallet. It is tempting to assume that the pallet detection problem is solved and consider tracking just as a step necessary to approach pallets. We argue that deferring the decision about pallet detection until a sufficient confidence level is reached is a wiser approach, and that tracking should play an important role in the detection process. The aim is to avoid all those situations where a single spurious sensor reading can mislead the system towards a false positive or immediately give up on a promising candidate pallet. Consequently, we do not immediately accept a certain ROI as a true positive pallet classification, but rather like a candidate that must be validated. This can be achieved adopting a sequential classification approach, i.e., by tracking all candidate pallet detections but taking a final decision only later, when different scans have been acquired and the system has gained sufficient confidence.
As it is typical when performing object tracking, the overall tracking process involves two steps:
detecting candidate pallets in each single frame using the pre-trained networks as perception models, therefore obtaining a set of ROIs;
performing data association, i.e., associating ROIs related to the same pallet over different scans, which we refer to as a track.
In our system, data association for each track is based on perceived track displacements. Assuming that the robot is moving with a constant velocity, each detected pallet in the current frame is located in a slightly different location with respect to the previous one. In reality, pallets are still, but robot perceptions create such illusion of displacement because of ego-motion effects. Data association is achieved by estimating the motion of each candidate pallet over several frames using a linear Kalman filter cuevas2005kalman ; mohamed2017detection . The filter is used to predict the position of the centroid of each track in the current frame based on past positions, and the corresponding bounding box is updated accordingly. In the current version of the system, rotations are not considered. Then, the associations between ROIs and tracks are computed and ranked. The association is done using the Hungarian algorithm kuhn1955hungarian . The algorithm minimises a cost function computed using the overlap between the bounding box location as predicted by the Kalman filter, and the bounding box detected by the pre-trained networks. The minimum is achieved when the predicted bounding box is perfectly aligned with the detected bounding box, i.e., the overlap ratio is one. In any frame, some of the ROIs may be assigned to tracks, while others may remain unassigned. At the same time, already available tracks may not be associated with any ROIs in the current scan. Estimated centroids of assigned tracks are updated using the corresponding ROIs inversely weighted by the corresponding confidence values acting as covariance matrices, while unassigned tracks are not updated. It is foreseen that each unassigned ROI originates a new track. In order not to propagate older tracks, each track is associated with a counter related to the number of consecutive frames where no associations have been made with such a track, and to the recent average confidence values. If such a counter exceeds a specified threshold or the average score associated with the ROI’s likelihood of being an object is below a certain threshold, the algorithm assumes that the pallet associated with the ROI is no longer in the rangefinder’s field of view, or it is a false positive, and therefore it deletes the track.
The data acquisition process is better described in Algorithm 1, which employs the two networks described above. The set of candidate pallets (i.e., the corresponding ROIs) is first initialised (line 2). A scan is acquired (line ), and converted to a bitmap-like image (line ). Then, the Faster R-CNN detects all possible ROIs (line ). The neural network is embedded in a function called DetermineROIs(). For each ROI , if the associated objectness score is above a given threshold , then is passed down to the CNN-based classifier in function Score() to compute the associated confidence score (line ). If such a score is higher than a threshold (line ), then is included in the set of candidate pallets.
Online pallet tracking is described in Algorithm 2 with more detail. The set of candidate pallets to track (i.e., the corresponding ROIs) is referred to as , is initially empty (line 3), and it is updated as long as the Algorithm proceeds. The set of unassigned possible candidates and the set of updated pallet candidates are initialised (lines 4 and 5), and updated afterwards. At each iteration, the Algorithm first calls the Acquisition routine (line 7), and the set of candidate pallets is determined. Then, for each already tracked candidate pallet , a number of parameters are retrieved, i.e., the number of times it has been detected (line 9), its average confidence (line 10), and its pose (line 11). Afterwards, its predicted pose is computed using the robot velocity (line 12).
For each candidate pallet in the current acquisition, one of the following cases is foreseen:
if the associated ROI closely matches with the expected pose of an already tracked candidate pallet (line 16), the ROI is associated with the same candidate, the candidate pallet’s pose is updated with the new observation (line 17), the number of times the pallet has been detected is increased (line 18), the confidence in the candidate is updated taking the average of each detection’s confidence score in a recent time window (line 19), and is labelled as updated (line 20); data association is achieved by computing how much of the two relative bounding boxes overlap and comparing the result to an acceptance threshold ;
if the candidate pallet does not match with sufficient precision any currently tracked candidate, it starts to be tracked as a new candidate, and it is labelled as unassigned (line 22).
For all unassigned candidate pallets, a corresponding tracked candidate pallet is generated and initialised (lines 27-29). If a tracked candidate pallet does not match any detected prospect ROI (line 25), then it is assumed as currently not visible and the average confidence in that candidate pallet decreases (line 32). The Algorithm can attempt to take a decision on every currently tracked candidate based on the associated confidence:
if the average confidence exceeds a given threshold , and it has been detected for more than times (line 35), then the candidate is recognised as a pallet;
if the average confidence decreases below a threshold , or the candidate has not been detected for a number of times (line 37), it is removed.
Also note that more than a candidate can be confirmed at any time, effectively allowing to track multiple pallets in the environment, if present. Finally, the detection, localisation, and tracking system loops through these steps, and whenever a detection is confirmed, it communicates to the robot control architecture the pallet’s pose , so that further action can take place (e.g., approaching the pallet).
Considering trade-offs, such an approach adds a small delay from the instant a pallet is first detected by the classifier to the moment when it is actually recognized as such by the system, allowing the robot to act on the pallet. On the other side though, we argue that such delay is usually very short even on modest hardware, and can be managed acting on the choice of parameters used in Algorithm2, especially and . Considering also the moderate speed at which these robots are meant to operate, this is a reasonable trade-off in order to achieve extremely few false positives and a more stable detection of true positives.
Our setup employs a commercially available 2D laser rangefinder from SICK AG, model S3000 Pro CMS. The rangefinder is connected to a PC through a RS422-USB converter. The sensor has a maximum range of ( at reflectivity), a resolution of , a maximum refresh frequency, and an empirical error of . The maximum field of view of the rangefinder is , which is largely sufficient for the detection of objects in front of the robot. As mentioned in Section 4, the sensor generates an array of points expressed in polar coordinates, each array having size .
Two different computers have been used for experimental validation: a lower-end PC is used online for range data acquisition and pallet detection, whereas a more powerful workstation has been adopted for offline training and testing the proposed networks. In particular, the former is equipped with an Intel® Core i5-4210U CPU and GB of RAM, and runs Ubuntu 16.04 bit. The latter mounts an Intel® Core i7-4790 CPU, GB of RAM and an Nvidia Geforce® GTX970 GPU, and runs Ubuntu 14.04 bit.
The overall architecture has been implemented using MATLAB- and ROS-based software components. In particular, range data are acquired using an ad hoc ROS node developed in C++, whereas pallet detection, classification and tracking are implemented in MATLAB using the Computer Vision System Toolbox. The Robotics System Toolbox has been used to interface MATLAB with ROS in order to perform online pallet detection, classification and tracking, and (offline) the training of the two neural networks.
The first step in our experiments consisted in training the neural networks described in the previous Section. A dataset has been collected in an indoor environment including pallets, trolleys, multiple obstacles such as walls as well as other robots, and furniture. The dataset consists of 2D range scans, each one corresponding to a frame from the 2D laser rangefinder located in a different position. Raw data are converted to 2D bitmap-like images and augmented by generating new artificial images, obtained by rotating the original images clockwise and anticlockwise of . As a consequence, the total dataset consists of images. Each image is resized to and stored in a CSV file. In the file, each line corresponds to a single 2D bitmap-like image which has mainly two entries for training the Faster R-CNN detector. The first entry is the path to each image, while the second entry contains the ROI labels in the image, i.e., pallet. More different entries are added, after the Faster R-CNN processing step, so as to list all the ROIs detected in the image in two classes for training the CNN-based classifier.
In order to train the Faster R-CNN detector, the whole dataset has been divided in two parts: ( samples) as a training set and (
samples) as a test set. Stochastic Gradient Descent (SGD) has been used to train the network and the initial learning ratehas been set to . The training process runs for epochs, leading to an approximately -minute training time on our workstation. Once training is complete and all the ROIs are generated, the corresponding bounding boxes are additionally filtered using non-maximum suppression with an overlap threshold of , as shown in (1). Figure 4(a) shows a sample image, the ROIs detected in it, and their corresponding confidence scores, while Figure 4(b) shows only the ROIs remaining after suppression.
As anticipated above, the set of all ROIs obtained through this procedure can be labelled and used as an input for training the CNN-based classifier. Considering for example the case in Figure 4(b), different ROIs are detected and therefore new images are generated with the same size as the original one, but including only data inside the ROI’s bounding box. This process is better depicted in Figure 5. Class 0 objects (i.e., objects unlikely to be pallets) are sorted out by Class 1 objects (i.e., pallets) on the basis of the confidence score associated with the related bounding box, as shown in (3). Considering that each image has to be labelled manually, a smaller set is actually used to train the CNN-based classifier, i.e., only an amount of images that are strictly necessary to achieve satisfying accuracy results on the test set are used. Hence, images have been randomly selected and labelled among the available samples: of them represent a pallet (Class 1 in Figure 5), while the other represent some undefined object (Class 0 in Figure 5). SGD and -fold cross-validation (with ) are used to train the CNN-based classifier with an initial learning rate , and mini-batch size set to , leading to an accuracy on the test set after a minutes training time.
After having developed the architecture described in Section 4.3, and having trained the Faster R-CNN detector and the CNN-based classifier, we have run several real-world experiments to validate the proposed approach. We recorded the data generated by the employed rangefinder sensor as it moved with constant velocity in our test area. Four different trajectories involving scans each (e.g., successive frames) have been recorded. The different trajectories differ by the initial and final rangefinder’s position, the pallet’s pose (i.e., in terms of position and orientation), and the location, size and shape of obstacles in the area. As far obstacles are concerned, also dynamic obstacles have been considered: either moving obstacles (including humans) or static obstacles that have been removed from the area during data acquisition.
In this Section, we report our analysis of the employed pallet tracking approach. As far as the used parameters are concerned, the time window size on which we compute the average confidence score on a tracked candidate pallet has been set to , whereas the maximum time a tracked candidate pallet can be invisible before being discarded by the system has been set to . The average confidence acceptance and the average confidence rejection thresholds have been set to and , respectively, whereas the minimum number of times (i.e., frames) a tracked candidate pallet must be acquired before it can be confirmed as a pallet given a sufficient average confidence score is set to .
Single pallet tracking. We applied the process illustrated in Algorithm 2 to real-world data obtained by imposing the trajectories described above. For all of the four trajectories, the approach is able to detect the pallet and avoid false positives. As an example, Figure 6 shows the salient events in one of these trajectories up to frame . In the frames, each ROI represents a tracked candidate pallet. For the sake of clarity, ROIs are generically contoured in yellow when they first appear in the robot’s field of view, and then get assigned with an identification number and characteristic colour only after they are detected more than times and in case they have a high confidence score. In frame 1, a ROI is immediately detected (T1), while a second one appears in frame (T2). T2 is a weak candidate, and its average confidence becomes less than in frame 16. Consequently, the Algorithm stops tracking that candidate considering it as a likely false positive (i.e., a Class 0 detection). It is noteworthy that higher values for lead the Algorithm to delete weak candidates faster, but also increase the likelihood of deleting a true positive pallet. As a consequence, should be kept fairly small. On the other hand, in frame 17 the Algorithm takes a positive decision on T1 as its average confidence surpasses and it has been detected at least times.
As far as performance aspects are concerned, the average computation time for frames is
with a variance ofand root mean square variation of . This result seems to suggest that it is possible to run the pallet tracking Algorithm at a frequency of about , which is sufficient to apply it online even with robots moving at a moderately high speed, such as the ones employed in indoor logistics applications.
Finally, note that in theory, due to the simple nature of the data provided by the 2D laser rangefinder, the system should be prone to false positives. Yet, we have not experienced any issues in our testing despite objects with similar contours were present in the environment such as trolleys and tables. This may be partially due to the online pallet tracking strategy outlined in Section 4.3.
Multiple pallets tracking. As anticipated above, we used artificially generated data to get preliminary results in scenarios possibly involving multiple pallets. This was not part of the reference scenario presented in Section 2.2, but it was worth exploring as it is an intrinsic property of the proposed system. Our results confirm the ability to detect pallets and avoid false positives even in the case two pallets are present in front of the robot. Figure 7 depicts one example using the same graphical representation as the one introduced above. In this case, the Algorithm detects three possible candidates in the first five frames, but the third one (T3) is dropped on frame 16, due to its low average confidence score. Furthermore, the first two (T1 and T2) are stronger candidates and they finally get accepted as pallets in frame 17. Concerning performance metrics, the average computation time for frames is with a variance of and root mean square variation of .
At the moment, the main limitation in the multiple pallets case is that the system is not able to unequivocally identify pallets, but only distinguish them with respect to each other. This was not a limitation in our reference scenario as described in Section 2.2, but it can be an issue in other applications. We will explore in the future solutions to this problem, for example employing identification markers, which are much less prone to robustness issues compared to localization ones.
The paper presents a discusses a possible solution to the problem of detecting, localising, and tracking pallets using machine learning techniques and 2D laser rangefinder data only. This is achieved by converting 2D rangefinder data into bitmap-like images where CNNs can look for possible candidate pallets. Pallet candidates detected by the two CNNs in cascade are passed down to a Kalman filter based tracker, which allows for having an estimate of pallet positions at any time, even when they are momentarily not visible, as well as helping the system filter out false positives. In the paper, we detail the proposed architecture and present detection and classification results. We conclude that our approach is a viable solution to correctly detect, localise and track pallets reliably, while attaining reasonable performance for real-world applications.
Future work includes refining the precision of position estimate for pallets with respect to a robot-centred reference frame, as well as integrating orientation estimation and running a series of experiments onsite to further validate the approach. We intend to explore the possibility of targeting pallet types or different objects at the same time, in so far as the approach offers a certain degree of modularity, since it is possible to add more CNN-based classifiers in parallel without the necessity to retrain the existing networks or substantially modify the tracking algorithm. Finally, we will explore methods that may allow to univocally identify an item identified and tracked by the system.
In: Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Minneapolis, MN, USA (2007)
In: Proceedings of the 2008 IEEE International Conference on Tools with Artificial Intelligence (ICTAI). Daytona, OH, USA (2008)