Machine Vision in the Context of Robotics: A Systematic Literature Review

by   Javad Ghofrani, et al.

Machine vision is critical to robotics due to a wide range of applications which rely on input from visual sensors such as autonomous mobile robots and smart production systems. To create the smart homes and systems of tomorrow, an overview about current challenges in the research field would be of use to identify further possible directions, created in a systematic and reproducible manner. In this work a systematic literature review was conducted covering research from the last 10 years. We screened 172 papers from four databases and selected 52 relevant papers. While robustness and computation time were improved greatly, occlusion and lighting variance are still the biggest problems faced. From the number of recent publications, we conclude that the observed field is of relevance and interest to the research community. Further challenges arise in many areas of the field.



There are no comments yet.


page 1

page 2

page 3

page 4


The Security of Smart Buildings: a Systematic Literature Review

Smart Buildings are networks of connected devices and software in charge...

How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review

Context: Machine Learning (ML) has been at the heart of many innovations...

Distilling Information from a Flood: A Possibility for the Use of Meta-Analysis and Systematic Review in Machine Learning Research

The current flood of information in all areas of machine learning resear...

What Do We Mean by "Accessibility Research"? A Literature Survey of Accessibility Papers in CHI and ASSETS from 1994 to 2019

Accessibility research has grown substantially in the past few decades, ...

A Decade of Information Architecture in HCI: A Systematic Literature Review

Information Architecture (IA) is a blueprint for the information system ...

State-of-the-Art in Smart Contact Lenses for Human Machine Interaction

Contact lenses have traditionally been used for vision correction applic...

Embodiment and Computational Creativity

We conjecture that creativity and the perception of creativity are, at l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotics is a fast growing research field with its wide spread application in various areas such as smart production systems, cyber physical systems and smart homes. Robots are now able to inspect their environments and make independent reactions to changes, discrepancies and unforeseen situations. Higher quality in production, lower rejection rates and reduced costs are the results because of lower human maintenance. Intelligent robots would not be possible without the capability to react to their environment by vision and other sensors. A future without intelligent industrial and personal robots is not imaginable. As this development shows no signs of slowing, it is difficult to capture all current trends and challenges. The aim of this study is to determine the current state of the art and possible research gaps in the field of machine vision in the context of robotics through a systematic literature review, proposed by Kitchenham et al. [1]. The advantages of a systematic literature review are its reproducibility and repeatability which are ensured by its systematic execution and strict documentation. The review will be performed by using a strict review protocol to achieve maximum replicability and minimal bias. If performed again by independent researchers, only small differences should show in the results due to newly published papers. The method shares the initial steps with the Systematic Mapping Study proposed by Petersen et al. [2, 3], but factors in the quality of the papers and provides much greater detail.

I-a Related Work

Before carrying out our own literature research, it is important to look for related works that will tell us how other researchers have structured their evaluations and what aspects they have examined. Surprisingly few papers have been found that have similar objectives to our Literature Review. The first paper from 2012 [4] is a review of the image recognition techniques that are or could be used in automated agriculture. Aspects of image recognition such as camera technology, recognizable features and recognition algorithms in the context of agricultural applications are discussed. Compared to industrial applications, the problems described are very similar, but sometimes even more challenging. Paper 2 from 2016 [5] promises to give an overview of the current methods for object recognition based on Local Invariant features. For each of the different stages of object recognition, the most important technologies are listed and described in detail. Finally, a research project based on the described technologies is carried out. Table 3 lists multiple facets, which have or haven’t been part of theses two studies. The work of Kapach et al. [4] shows a broad overview over it’s field, while [5] shows a narrow slice of it in great detail. Furthermore an object tracking survey was conducted in [6]. Finally, Paper 3 [7]

reviews 19 neural network techniques used in image processing and its applications, weighting the pros and cons. Our approach is different in that it covers a broader field with no specific domain while following a strict review protocol. Additional material to our study is available to the interested reader on Figshare 


I-B Objective

This review will be performed by using a strict review protocol to achieve maximum repeatability and minimal bias. If performed again by independent researchers, only small differences should be visible in the results due to newly published papers. Therefore, if this study would be performed again later in time, the differences would show the progress of the research in this field of study (under the same conditions).

The rest of this paper is organized as follows: Section II describes the methodology used to perform the literature review in detail, III presents the results and answers the research questions, finally IV summarizes the work and gives an outlook.

Ii Research methodology

The methodology used follows closely the approach proposed by [1] and shares the initial procedure steps of systematic mapping studies proposed by [3]. First, the research questions for the review were defined, they should aim to produce the most relevant results in the chosen field of research. Next, a review protocol was created to specify and pin down all following steps. This allows the recreation of the study under the same conditions in the future. The step selection of primary studies included building a search string and fine tuning it, followed by searching appropriate databases with it. Next, inclusion and exclusion criteria according to the research questions were applied to these results. By checking off a list of quality metrics, the quality of each paper was evaluated. Suitable metrics were derived from the research questions to extract data from the data set. Afterwards the synthesizing of the acquired data by collating, summarizing and documenting the obtaining results was performed. Finally, the review was completed by answering the research question with the acquired data. The general steps of the procedure are displayed in Figure 1.

Fig. 1: Procedure conducted for Systematic Literature Review

Ii-a Review Protocol

The pre-defined protocol specifies the methods that were used to carry out the systematic review. The protocol is necessary to reduce the possibility of researcher bias and improves replicability.

Ii-B Research Questions

The research questions to be addressed by this study are:

  • RQ1: Studies on which image processing techniques result in the most practical applications?

  • RQ2: Which problems are typically solved by the studies in this research field and which ones are still unsolved?

Ii-C Search Process

The search for papers was conducted via a predefined search string, which was adapted to each utilized research database. The aim was to collect around 150 initial papers, which were later reduced to around 50 papers by applying inclusion and exclusion criteria. We limited the search area to 2009 - October 2018, because Nvidia began supporting GPU computation by providing the CUDA platform at that time [9] that allowed developers to directly utilize the power of GPUs for computational tasks. The used databases together with the respective search string and the number of results are shown in Table I.

Database Searchstring Number of Results
IEEE Figure 5 40
Scopus / 105
ACM Figure 6 9
Web Of Knowledge / 18
TABLE I: used databases, search strings and number of results.

Ii-D Inclusion and exclusion of papers

Inclusions and exclusions are criteria which define if a paper that fits the search criteria is suitable for the final set. These were applied manually after reading the abstract and keywords for each paper. This process was used to further reduce the amount of papers and only keep the most relevant ones. Two researchers have completed this task independently to reduce subjectivity. Differences in decisions were discussed and resolved. The final set contained 52 papers.

Inclusion Criteria

papers and articles since 2009, from a computer science or engineering background which uses a deep learning approach, peer reviewed, experience reports, pose estimation systems, vision based gesture recognition, visual servoing, experiments

Exclusion Criteria

non-English articles or sources of subjective quality like summaries or keynotes, keywords only in background of abstract, extreme or very specialized applications, secondary studies, no reference to the topic given

Ii-E Quality Assessment

Each papers quality was evaluated by a set of 14 predefined boolean questions of a specific area with an assigned numeric value. The value was modelled after the importance of each aspect for the quality of a paper. The higher the overall score of a paper, the higher is its apparent quality. The used quality measurements are presented in Table II. The quality was measured by both researchers, then the mean value was calculated.

Ii-F Data Collection

The answers to the following questions were extracted from each paper:

  • The year when the paper was published

  • Quality score for the study

  • From which field of research does this study come? e.g. visual servoing, robot navigation, robot manipulation.

  • How direct can the results be transferred into practical applications? (0 - purely theoretical treatise 10 - finished product was presented)

  • On which existing methods is the approach based?

  • Which improvements where made through the study?

  • In which areas where improvements made?

  • What is the magnitude of the progress made?

The data was extracted by both researchers, differences being resolved manually.

Ii-G Data Analysis

The data was filled in a spreadsheet to show basic information and the collected answers for each paper.

Iii Results

In this section, the collected data from the reviewed papers will be examined and analyzed.

Iii-a Meta data

Iii-A1 Publication Year

We explained in section II-C why we only included papers released since 2009. The collected papers show that the number of publications per year is stagnant between 2009 and 2012 at one to three papers. The years between 2012 and 2017 show a slight, but steady increase and in 2018 the amount of papers was doubled compared to the preceding year. The data collection was carried out in October 2018 and since then, a number of fitting papers have been released that haven’t been reviewed in this study. This suggests that the trend towards increasing research in this field of study is nowhere near stopping. The exact numbers can be seen in Figure 2.

Fig. 2: Publications per year

At the beginning of the time period covered, in 2009, Song et al. [10]

proposes a pose-variant face recognition system based on BPNNs and Active Appearance model (AAM). In 2018 Shen et al. 

[11] trained a CNN based on YOLO architecture to detect flames in video sequences.

Iii-A2 Publication Form

28 Papers have been released as articles, 23 were published in conferences and a single one as part of a workshop. This means there is only a slight trend towards the publication in form of journal articles. Articles are generally considered to be of higher quality, due to the more self contained nature of the papers, whereas conference papers are designed to be openly discussed.

Iii-A3 Measured Quality

We measured the quality of each paper by applying the quality measurements explained in section II-E. The asked yes/no questions were mostly aimed at aspects of the craftsmanship of the paper, such as its completeness, its quality of documentation or the inclusion of statistical reasoning. The scientific value of the publication had to be evaluated subjectively by the researchers by questioning the credibility, importance and magnitude of the research. Another, widely acknowledged parameter of quality is the amount of received citations of a paper. Since a lot of papers were published only recently, only a few have received more than single digit citation numbers. We measured the quality of the four most cited papers which are Miljkovi’ c et al. [12] with a measured quality of 75%, Ghesu et al. [13] with 93%, Pinto et al. [14] with 73%, and Franceschini [15] with 100%. This shows a close correlation to our metrics and suggests that our measurement is sufficient for estimating the quality of the reviewed papers. 14 papers scored above 80% on our index, indicating a very high quality of work. Notable examples are: (i) Franceschini [15], which presented a very extensive collection of insect inspired robot developments over the last decades. (ii) Martins et al. [16], where a very promising, new shape coding approach using proto object categorization was presented. (iii) Wen et al. [17], where a novel object recognition system using radar spectograms as the environment-independent input data was presented.

37 papers scored above 50% and only two papers had quality ratings below 20%. These numbers suggest that the majority of research is executed with sufficient rigor, showing the maturity of the field. A lot of authors aren’t native English speakers, which led to varying quality of readability and comprehensibility. Therefore, a portion of the papers was very time consuming to read due to these factors. Since the quality of the language doesn’t reflect the quality of the content, we have decided to give it only a low relevance in the measurement. While a large number of researchers give a longer introduction to the reader who is not familiar with the research field, a large proportion assume that the reader has domain knowledge that hinders readability for the untrained reader.

Iii-B Field of Research

The reviewed papers all stem from a wide range of research areas, all of which are more or less closely related to robotics. This means that the reached results were either directly developed for the use in a robotic environment or are still beneficial for robot development. In the first step, we categorized the areas exactly, resulting in only a few overlaps between separate studies, like robot navigation or object recognition. In the second step, we condensed these categories down into seven upper categories.

Iii-B1 General Development

Under the term fundamental research we grouped multiple papers which presented very low level approaches with no immediate real world usability. But by using their knowledge as a foundation, great future works are possible. Cui et al. [18]

optimized restricted boltzmann machines (RBM) by restricting their data to sparse matrices, pushing the efficiency of RBMs greatly. Angeletti et al. 

[19] refines the foundations of image recognition by training neural networks with image background- instead of foreground data, enabling more flexibility when detecting actual objects. The next field of research is automated visual inspection. A system or object is monitored with a camera or a set of sensors, ensuring either its correct function or a lack of errors. Kadmin et al. [20]

used radial basis function networks (RBF) in combination with a robotic arm to determine what class of consumer goods an object belongs to. Pei et al. 

[21] introduced a camera based system with location specific ANNs, which let a robot arm reach its target in a simulated environment, while being very efficient. Qiu et al. [22]

used a camera to determine the magnitude of the vibrations of a component. Likewise, they were able to utilize a RBF to reduce these arbitrary vibrations, omitting the use of a cost- or space-intensive sensor. Object tracking/detection summarizes the attempts at localizing, classifying, recognizing and tracking objects in single images or image sequences. Karayaneva and Hintea 


implemented different functions of the OpenCV computer vision library on a NAO robot, including the recognition of different colors and shapes for the means of child education. A more refined approach was presented by Kuremoto et al. 


, which correctly classified a number of hand gestures using self organizing maps (SOM) in unison with an asymmetric neighborhood function. Song et al. 

[10] proposed an algorithm based on a back propagating NN (BPNN) that detects faces, even when they aren’t pointing directly to the camera.

Iii-B2 Robot Development

Visual servoing is the procedure of using collected image data directly to control a robotic system. Typically a specific feature will be selected and the robot heads towards it. Petković et al. [25] presents a visual servoing controller that utilizes fuzzy controls and neural networks to the task, improving it greatly. Robot manipulation combines all activities, where a robot is used to alter a system, mainly by grasping and placing small objects. Haochen et al. [26]

described the training of a convolutional neural networks (CNN) to let a robot arm correctly localize and grasp three different types of circuit boards. Zhihong et al. 

[27] proposes a robot arm grasping system, which automatically detects and localizes items on a garbage conveyor belt with a fast recurrent CNN (Fast R-CNN). Robot navigation has the goal of enabling robots to traverse known or unknown environments fully or partially autonomous. A very simple example, although only used as a test bed for a vastly more complex learning algorithm is the line following robot of Murali et al. [28]. Utilizing Q-learning with ANNs, the algorithm can use arbitrary sensory input data to do its tasks. For the line following, a live video feed from the front of the robot was used as the input. Prieto et al. [29] presented research on swarm based robots. These monitor neighboring swarm robots and mimic their behaviour using automatic neural-based pattern classifiers (ANPAC). These were tested in simulations.

Iii-B3 Quantitative Differences

Of the 52 papers, the biggest field of research is object tracking/detection with 15 papers, followed by robot manipulation and robot navigation with 12 and eleven publications. Seven papers describe automatic visual inspection systems, five are fundamental researches and two center around visual servoing. There was no evidence observed, that the quality of the research correlated with specific research fields. All research areas had an equal distribution of high- and low quality papers.

Iii-C Practicability

We measured practicability by using a self defined numeric metric in the interval [1,10] giving a score to each paper’s practical relevance in terms of how we evaluated their proposal. We gave a score of 1 to papers which were laying out foundations for other researches with no evaluation and a score of 10 when researchers tested their system or method for example on a real robotic system or mobile robot. There are numerous factors on which practicability depends, naming: the used type of benchmark / experiment, the type of evaluation, the application area. Typical signs are furthermore the number of test objects, included pictures of the real setup. On average, we gave papers a score of 5.1 points, which means that the included papers are not purely theoretical research, which we ruled out by usage of our search string. This means that a portion of researchers have spent resources on non-simulation test scenarios.

We conclude, that practicability does not correlates with quality because the chosen metrics evaluate the quality of writing rather than the proposal itself. For example [30] got a score of 37,5 (84%) while being being not very practicable. On the other hand [31] scored only 17 points but having a higher practical relevance (7). A similar case is [32] with high practicality but a low quality score. The most practical papers in our opinion were [33] showing the system setup including a camera and a robotic arm and [34] with an evaluation of the proposal, which was conducted on real pig eyes. On the other side of the spectrum, [35] experimented work piece recognition on very simple geometric shapes scoring low points.

Why are some approaches practicable and others are not? We expect that this depends on the amount of time and resources spent on the project (a simulation is less expensive when lacking funding) and the current state-of-the-art achievable as well as the amount of improvement made through the study. For example, Gerrard et al. [36] have carried out fundamental research, creating neural networks based on chemical reaction chains in cells, proving they can provide complex behaviour without complex neural systems. For a proposal in an early stage it makes more sense to evaluate it on a small example and if its usefulness is proven than subsequent work can be implemented on real systems.

Iii-D How are the developments of the field structured?

A lot of papers describe the development of a new method, algorithm or system that is based on an already existing approach. If the base system is described in another paper from our review, interesting trees of development can be observed. Two such trees have been observed, one building on the widely popular CNN, the other one the related Regions with CNN (R-CNN).

Iii-D1 Cnn

A lot of papers found by our review are based on or are utilizing the concepts of CNNs. As Figure 4 shows, a very wide graph can be constructed from these connections, which demonstrates the relevance of CNNs in the field of robot vision. This popularity stems from their impressive successes in many image classification benchmarks. The first such demonstration is described by Krizhevsky et al. [37], where at the time groundbreaking top 1 error rates of 39.7% were attained. Naturally, all subsequent ranking leaders were using CNNs. CNNs are neural networks that utilize one or multiple convolutional layers followed by a pooling layer. Multiple such combinations are executed and finally finished by a fully connected layer. Through the unique representation of the learned weights, a lot less RAM is needed to extract the wanted features from images and respectable outcomes are achievable without the use of super computers. There are multiple reasons that were stated on why researchers chose to use CNNs in their work or base it on them. The biggest one is, to no surprise, that they wanted their research their work on methods which are commonly viewed as the state of the art. Their research scope also usually isn’t the development of CNNs. Such a paper is presented by Farazi et al. [38]. Papers like Quin et al. [39] compare multiple methods and are ultimately choosing CNNs. Yeboah et al. [40] recognized CNNs as the state of the art and used that argumentation to try and enhance it. The most elaborate reasoning is presented by papers like Wen et al. [17], where all the required features of an approach were listed and finally CNNs were chosen for the task. Therefore, the ease of computation of complex computer vision can be stated as the main reason for the use of CNNs. After this fact was proved and functional examples and frameworks were made available, big parts of the research community simply relied on this insight. One may be tempted to critically question this development, since other, still unexplored approaches may be even more powerful in the specified tasks, but aren’t being researched as widely because of the focus on CNNs. On the other hand, a lot of great discoveries and breakthroughs are made simply because a simple, reliable platform for computer vision research exists.

Iii-D2 R-Cnn

The second tree 8 shows, that all the major steps of the development of R-CNN and adjacent developments are present in our review scope. R-CNN was presented by Girshick et al. [41] and utilized for a partial problem in Lee et al. [42]. R-CNN uses a selective search algorithm to divide a given image into regions, which are then analyzed by a CNN. Fast R-CNN was developed again by Girshick [43]. Instead of generating a lot of regions, the image is fed into the CNN to generate a single convolutional feature map, reducing the computation time immensely. Zhihong et al. [27] use Fast R-CNN to recognize and localize objects on a garbage conveyor belt, enabling a robot to grasp them in real time. The last subsequent step is Faster R-CNN by Ren et al. [44], where a separate Network predicts the region proposals. Lee et al. [42] integrates Faster R-CNN into multiple CNN architectures, concluding that ResNet provides the highest precision of the tested models. Fu et al. [45] uses Faster R-CNN in combination with a Zeiler and Fergus network (ZFNet) to detect the exact count and position of kiwifruits in photos taken on the field. Another, comparable approach is the You Only Look Once (YOLO) method proposed by Redmon et al. [46]

. Instead of dividing the image into separate parts, a single convolutional network generates class probabilities for predicted bounding boxes. This makes the model very fast, whilst accuracy especially in small details is sacrificed. The model is used by Llopart et al. 

[47] to detect doors and door handles for an autonomous robot navigation system. The system described by Wang et al. [33] needs to localize, classify and finally sort many small objects in a short amount of time. Faster R-CNN and YOLO are both considered but ultimately rejected in favor of Region-based full convolutional networks (R-FCN) described by Dai et al. [48]. This network has the same approach as YOLO by using position sensitive score maps to classify whole images. Ultimately it provides a better balance between accuracy and speed than YOLO. Interestingly, no other reviewed paper used R-FCNs, which may be due to its low age.

Iii-D3 Lessons learned

Both observed trees show interesting properties of popular robot vision approaches. In the case of CNNs, a first come, first served mindset is visible. The first effective, functional approach to machine vision is used the most as a base for other developments. R-CNNs, which are also based on CNNs, show a different course of development, with a steady stream of improvements and the subsequent overthrow by other, more performant developments.

Iii-E Research Questions

In the next section we discuss our findings relating to the research questions.

Iii-E1 Studies on which image processing techniques result in the most practical applications

We investigate this issue by using a numeric metric giving a score [1,10] to each paper on its practical relevance. A lower score indicates that only a theoretical foundation was proposed or the evaluation was conducted in a simulation. On the other side, a high score means that the experiment was evaluated on a real robotic system, preferably in real world conditions. The numbers in between represent gradations between these two extremes. The most practical papers included [34][33][17][45] and [14], where researchers used robotics systems in the evaluation of their proposals. The method of evaluation was already discussed in subsection III-C

. We took the papers we evaluated as practical and observed the techniques used. For practicality often a low enough computation speed is needed. Which of the techniques produce the most practical applications depends highly on the type of problem solved. In object recognition 2D-CNN architectures are used by relying on architectures build upon ImageNet 

[49] classifiers such as AlexNet, ResNet and GoogLeNet. These have had great success on the popular ILSVRC111 challenge starting with the breakthrough by AlexNet [37] which used GPU computation in the training step. CNNs are used in vision applications because this type of data nearly always has spatial relationships between related objects in the image. The computation complexity of high resolution images is reduced by down sampling and using sliding windows scanning the whole image and selecting a region of interest (RoI). In object localization (by which we mean the detection of objects in an image and segmentation of background) CNNs as well as improvements building upon it like R-CNN are being used for example in [47] where YOLO [46] model (proposed in 2015) is used to identify handles of doors and estimate its pose for a grasping application tested on a mobile robot. There is an even faster, lightweight variant called Faster YOLO achieving higher processing speed. ZFNet [50] - a faster R-CNN variant, is used in [45] to detect multiple kiwifruits from images in clustered scenes. When taking video sequences as as input, typical problems are the classification of actions or motion planning. Here DCNN networks are used for example the VGG [51] architecture [52]), as well as Self Organizing Maps network which in [53] receives information about the human’s location and pose in a robot work space based on pressure activated notes in a safety mat.

Iii-E2 Which problems are typically solved by the studies in this research field and which ones are still unsolved

Most reviewed papers are trying to solve one or more discrete problems which were imposed by their previous research. As a result, multiple classes of papers can be found. Approaches showcase the foundations and a short proof of work for a novel or improved, small subarea of a research field. The papers Enikov and Escareno [54], Olaque et al. [55] and Pinto et al. [14] were classified as approaches. Methods are more rigorous, as they have the aim of presenting a ready to deploy method, tested and validated for the intended use. Chen et al. [56], Peretroukhin and Kelly [57], Shirzadeh et al. [58] and Rupprecht et al. [59] present methods. Systems, frameworks and architectures present a whole environment, in which multiple approaches and methods are combined and their interaction with each other and the outside world is designed. Examples are Lin et al. [60], Sanders et al. [61] and Calli et al. [62]. The class algorithm presents the conception and testing of a single algorithm, as demonstrated by Li et al. [63]. Other classes like reviews, comparisons and implementations are trying to solve different, tertiary problems and shall not be discussed here. Each of these classes has a different tendency to what problems are trying to be solved and which problems arise during the research. Most researchers define a narrow range, in which their solution is settled and therefore works best. The limitations imposed by such a restriction were also considered problems.

  • Approaches mostly solved non-complex object recognition tasks. Shaker and ElHelw [64] present an easier to train OCR model, Shen et al. [11] trained a CNN to detect flames in an image sequence and Qiu et al. [22] use a RBF neural network to reduce vibrations in a flexible manipulator. The common flaw of these approaches is their very narrow field of work. Each is tested only on a small sample size or in a simple environment, so a lack in generalization abilities can be assumed. Of course, as stated before, this is the intended area of approaches.

  • The majority of methods try to solve object classification and localization problems like Chen et al. [56] and Llopart et al. [47], which are both detection doors in rooms. The methods are tested in appropriately complex environments, reducing the simplification error seen in approaches. Errors are now harder to be overseen and can therefore be addressed easier. The most reported flaws are the accumulation of accuracy errors.

  • Papers on systems, frameworks and architectures are presenting whole environments and are therefore able to address a wide range of problems. All kinds of robot vision field of researches, like action recognition, motion planning and gesture recognition are dealt with. Occlusion, where parts of an object are covered by the environment or other objects, is a big problem, as described by Fu et al. [45] and Wang et al. [33]. The amount of training data needed for sufficient recognition abilities is problematic for papers like Farazi et al. [38]. Many papers state the goal of achieving real time results, but have to ultimately cut corners on the accuracy to reach the desired times. Insufficient sensors are stated as problems by papers like Najmaei and Kermani [53] and Quin et al. [39].

A huge problem researchers in all classes addressed is the variability of lighting in scenes. It changes the environment of object recognition algorithms drastically and increases the amount of training data needed. Therefore, it already is its own field of research. Wen et al. [17] approaches the problem by using radar spectograms as a light invariant form of imaging. Angeletti et al. [19] tries to train domain invariant features to a CNN to make it recognize object features which don’t change with lighting.

Iii-F Threats to validity

Identified threats to the validity of our study and possible causes are:

External validity

Because the selected research field in this study is really broad, we assume that it represents a cross-section of the available research. However, our study may be restricted by not covering the whole research field. The coverage depends on the chosen search string and is hard to measure. However a more general search string would imply a more time consuming process, including more potentially relevant papers.

Internal and construct validity

The internal validity may be threatened by the fact that only 2 researchers extracted the data, which could potentially lead to human errors and biased results. However in critical situations, the results were double checked by a third researcher.

Iv Conclusion

Lastly, considering the results from this systematic literature review (SLR), potential future research directions are suggested. As robustness and computation time are two key-component for real time applications, we assume that the research field covered in this study will continue to find possible improvements in these areas just like the numerous approaches inspired by the ILSVRC 2012 winner - AlexNet. As occlusion and lighting variance pose challenges to the field, we further expect that these will be addressed for example. A portion of the reviewed work was in an earlier stage of research, so subsequent papers demonstrating the capabilities on more than simple experiments is anticipated. In conclusion, the problem of finding the balance between efficiency and accuracy is a dilemma in robot vision as much as any other research area. In our future work, we will utilize the knowledge acquired in this study to solve an object recognition problem in an cyber physical systems environment in real time.


  • [1] B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” 2007.
  • [2] K. Petersen, S. Vakkalanka, and L. Kuzniarz, “Guidelines for conducting systematic mapping studies in software engineering: An update,” Information and Software Technology, vol. 64, pp. 1–18, 2015.
  • [3] K. Petersen, R. Feldt, S. Mujtaba, and M. Mattsson, “Systematic mapping studies in software engineering.” in EASE, vol. 8.   Swindon, UK: BCS Learning & Development Ltd., 2008, pp. 68–77. [Online]. Available:
  • [4] K. Kapach, E. Barnea, R. Mairon, Y. Edan, and O. Ben-Shahar, “Computer vision for fruit harvesting robots–state of the art and challenges ahead,” International Journal of Computational Vision and Robotics, vol. 3, no. 1/2, pp. 4–34, 2012.
  • [5] L. M. Patricio Loncomilla, Javier Ruiz-del-Solar, “Object recognition using local invariant features for robotic applications: A survey,” Pattern Recognition, vol. 60, pp. 499––514, 2016.
  • [6] Q. Liu, X. Zhao, and Z. Hou, “Survey of single-target visual tracking methods based on online learning,” IET Computer Vision, vol. 8, no. 5, pp. 419–428, 2014.
  • [7] M. Jena and S. Mishra, “Review of neural network techniques in the verge of image processing,” in Advances in Intelligent Systems and Computing.   Springer Singapore, 2017, pp. 345–361.
  • [8] D. Roßburg, R. Kirschner, and J. Ghofrani, “Online Material to systematic mapping study,”, 2019.
  • [9] R. Shams and N. Barnes, “Speeding up mutual information computation using nvidia cuda hardware,” in 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications (DICTA 2007), 2007, pp. 555–560.
  • [10] K. Song, S. Wang, M. Han, and C. Kuo, “Pose-variant face recognition based on an improved lucas-kanade algorithm,” in 2009 IEEE Workshop on Advanced Robotics and its Social Impacts.   IEEE, 2009, pp. 87–92.
  • [11] D. Shen, X. Chen, M. Nguyen, and W. Q. Yan, “Flame detection using deep learning,” in 2018 4th International Conference on Control, Automation and Robotics (ICCAR).   IEEE, 2018, pp. 416–420.
  • [12]

    Z. Miljković, M. Mitić, M. Lazarević, and B. Babić, “Neural network reinforcement learning for visual control of robot manipulators,”

    Expert Systems with Applications, vol. 40, no. 5, pp. 1721–1736, 2013.
  • [13] F. C. Ghesu, E. Krubasik, B. Georgescu, V. Singh, Y. Zheng, J. Hornegger, and D. Comaniciu, “Marginal space deep learning: Efficient architecture for volumetric image parsing,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1217–1228, 2016.
  • [14]

    A. Pinto, L. Rocha, and A. Paulo Moreira, “Object recognition using laser range finder and machine learning techniques,”

    Robotics and Computer-Integrated Manufacturing, vol. 29, no. 1, pp. 12–22, 2013.
  • [15] N. Franceschini, “Small brains, smart machines: From fly vision to robot vision and back again,” Proceedings of the IEEE, vol. 102, no. 5, pp. 751–781, 2014.
  • [16] J. Martins, J. Rodrigues, and J. du Buf, “Proto-object categorisation and local gist vision using low-level spatial features,” BioSystems, vol. 135, pp. 35–49, 2015.
  • [17] Z. Wen, D. Liu, X. Liu, L. Zhong, Y. Lv, and Y. Jia, “Deep learning based smart radar vision system for object recognition,” Journal of Ambient Intelligence and Humanized Computing, pp. 1–11, 2018.
  • [18] Z. Cui, S. S. Ge, Z. Cao, J. Yang, and H. Ren, “Analysis of Different Sparsity Methods in Constrained RBM for Sparse Representation in Cognitive Robotic Perception,” JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, vol. 80, no. 1, SI, pp. S121–S132, 2015.
  • [19] G. Angeletti, B. Caputo, and T. Tommasi, “Adaptive deep learning through visual domain localization,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 7135–7142.
  • [20] A. Kadmin, K. Aziz, A. Soufhwee, S. Abd Razak, M. Salehan, N. Abdul Hadi, R. Hamzah, and W. Abd Rashid, “Performance analysis of neural network model for automated visual inspection with robotic arm controller system,” Journal of Telecommunication, Electronic and Computer Engineering, vol. 10, no. 2-2, pp. 19–22, 2018.
  • [21] J. Pei, S. Yang, and G. Mittal, “Vision based robot control using position specific artificial neural network,” in 2010 International Conference on Computational Intelligence and Communication Networks.   IEEE, 2010, pp. 110–115.
  • [22] Z. Qiu, B. Ma, and X. Zhang, “End edge feedback and rbf neural network based vibration control of flexible manipulator,” in 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO).   IEEE, 2012, pp. 1680–1685.
  • [23] Y. Karayaneva and D. Hintea, “Object recognition algorithms implemented on nao robot for children’s visual learning enhancement,” in Proceedings of the 2018 2Nd International Conference on Mechatronics Systems and Control Engineering, ser. ICMSCE 2018.   ACM, 2018, pp. 86–92.
  • [24] T. Kuremoto, T. Otani, M. Obayashi, K. Kobayashi, and S. Mabu, “A hand shape instruction recognition and learning system using growing som with asymmetric neighborhood function,” Neurocomputing, vol. 188, pp. 31–41, 2016.
  • [25] D. Petković, S. Shamshirband, N. Anuar, A. Sabri, Z. Abdul Rahman, and N. Pavlović, “Input displacement neuro-fuzzy control and object recognition by compliant multi-fingered passively adaptive robotic gripper,” Journal of Intelligent and Robotic Systems: Theory and Applications, vol. 82, no. 2, pp. 177–187, 2016.
  • [26] L. Haochen, Z. Bin, S. Xiaoyong, and Z. Yongting, “CNN-Based Model for Pose Detection of Industrial PCB,” in 2017 10TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION (ICICTA 2017), ser. International Conference on Intelligent Computation Technology and Automation.   IEEE, 2017, Proceedings Paper, pp. 390–393.
  • [27] C. Zhihong, Z. Hebin, W. Yanbo, L. Binyan, and L. Yu, “A Vision-based Robotic Grasping System Using Deep Learning for Garbage Sorting,” in PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), ser. Chinese Control Conference.   IEEE, 2017, Proceedings Paper, pp. 11 223–11 226.
  • [28] N. Murali, K. Gupta, and S. Bhanot, “Analysis of q-learning on anns for robot control using live video feed,” in 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).   IEEE, 2017, pp. 524–529.
  • [29] A. Prieto, F. Bellas, P. Caamaño, and R. Duro, “Automatic neural-based pattern classification of motion behaviors in autonomous robots,” Neurocomputing, vol. 75, no. 1, pp. 146–155, 2012.
  • [30] W. Pan, M. Lyu, K.-S. Hwang, M.-Y. Ju, and H. Shi, “A neuro-fuzzy visual servoing controller for an articulated manipulator,” IEEE Access, vol. 6, pp. 3346–3357, 2018.
  • [31] M. Du, J. Wang, L. Wang, H. Cao, J. Fang, Z. Gao, J. Lv, and S. Zhang, “A research on autonomous position method for mobile robot manipulator based on fusion feature,” in 2013 IEEE International Conference on Mechatronics and Automation.   IEEE, 2013, pp. 1447–1452.
  • [32] S. Chen, W. Wang, and H. Ma, “Intelligent control of arc welding dynamics during robotic welding process,” Materials Science Forum, vol. 638-642, pp. 3751–3756, 2010.
  • [33] T. Wang, Y. Yao, Y. Chen, M. Zhang, F. Tao, and H. Snoussi, “Auto-sorting system toward smart factory based on deep learning for image segmentation,” IEEE Sensors Journal, vol. 18, no. 20, pp. 8493–8501, 2018.
  • [34] T. Probst, K. Maninis, A. Chhatkuli, M. Ourak, E. V. Poorten, and L. V. Gool, “Automatic tool landmark detection for stereo vision in robot-assisted retinal surgery,” IEEE Robotics and Automation Letters, vol. 3, no. 1, pp. 612–619, 2018.
  • [35] K. Tang, F. Hu, W. Liu, Y. Deng, X. Wu, and D. Luo, “Corner detection based real-time workpiece recognition for robot manipulation,” in 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO).   IEEE, 2017, pp. 2185–2190.
  • [36] C. Gerrard, J. McCall, G. Coghill, and C. Macleod, “Exploring aspects of cell intelligence with artificial reaction networks,” Soft Computing, vol. 18, no. 10, pp. 1899–1912, 2014.
  • [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12.   USA: Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available:
  • [38] M. Farazi, M. J. Abbas-Zadeh, and H. Moradi, “A machine vision based pistachio sorting using transferred mid-level image representation of convolutional neural network,” in 2017 10th Iranian Conference on Machine Vision and Image Processing (MVIP).   IEEE, 2017, pp. 145–148.
  • [39] J. Qin, H. Liu, G. Zhang, J. Che, and F. Sun, “Grasp Stability Prediction using Tactile Information,” in 2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM).   IEEE, 2017, Proceedings Paper, pp. 498–503.
  • [40] Y. Yeboah, C. Yanguang, W. Wu, and Z. Farisi, “Semantic scene segmentation for indoor robot navigation via deep learning,” in Proceedings of the 3rd International Conference on Robotics, Control and Automation, ser. ICRCA ’18.   ACM, 2018, pp. 112–118.
  • [41] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CoRR, vol. abs/1311.2524, 2013. [Online]. Available:
  • [42] C. Lee, H. Kim, and K. Oh, “Comparison of faster r-cnn models for object detection,” in 2016 16th International Conference on Control, Automation and Systems (ICCAS), vol. 0.   IEEE, 2016, pp. 107–110.
  • [43] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [44] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015. [Online]. Available:
  • [45] L. Fu, Y. Feng, Y. Majeed, X. Zhang, J. Zhang, M. Karkee, and Q. Zhang, “Kiwifruit detection in field images using Faster R-CNN with ZFNet,” IFAC PAPERSONLINE, vol. 51, no. 17, pp. 45–50, 2018.
  • [46] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” CoRR, vol. abs/1506.02640, 2015. [Online]. Available:
  • [47] A. Llopart, O. Ravn, and N. A. Andersen, “Door and cabinet recognition using convolutional neural nets and real-time method fusion for handle detection and grasping,” in 2017 3rd International Conference on Control, Automation and Robotics (ICCAR).   IEEE, 2017, pp. 144–149.
  • [48] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: object detection via region-based fully convolutional networks,” CoRR, vol. abs/1605.06409, 2016. [Online]. Available:
  • [49] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [50] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” CoRR, vol. abs/1311.2901, 2013. [Online]. Available:
  • [51] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [52] G. Yao, X. Liu, and T. Lei, “Action recognition with 3d convnet-gru architecture,” in Proceedings of the 3rd International Conference on Robotics, Control and Automation, ser. ICRCA ’18.   ACM, 2018, pp. 208–213.
  • [53]

    N. Najmaei and M. Kermani, “Applications of artificial intelligence in safe human-robot interactions,”

    IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 2, pp. 448–459, 2011.
  • [54] E. Enikov and J.-A. Escareno, “Application of sensory body schemas to path planning for micro air vehicles (mavs),” in ICIMCO 2015 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL. 1, Filipe, J and Madani, K and Gusikhin, O and Sasiadek, J, Ed., vol. 1.   IEEE, 2015, Proceedings Paper, pp. 25–31.
  • [55] G. Olague, D. E. Hernández, E. Clemente, and M. Chan-Ley, “Evolving head tracking routines with brain programming,” IEEE Access, vol. 6, pp. 26 254–26 270, 2018.
  • [56] W. Chen, T. Qu, Y. Zhou, K. Weng, G. Wang, and G. Fu, “Door recognition and deep learning algorithm for visual based robot navigation,” in 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014).   IEEE, 2014, pp. 1793–1798.
  • [57] V. Peretroukhin and J. Kelly, “Dpc-net: Deep pose correction for visual localization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2424–2431, 2018.
  • [58] M. Shirzadeh, A. Amirkhani, A. Jalali, and M. Mosavi, “An indirect adaptive neural control of a visual-based quadrotor robot for pursuing a moving target,” ISA Transactions, vol. 59, pp. 290–302, 2015.
  • [59] C. Rupprecht, C. Lea, F. Tombari, N. Navab, and G. D. Hager, “Sensor Substitution for Video-based Action Recognition,” in 2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016).   IEEE, 2016, Proceedings Paper, pp. 5230–5237.
  • [60] C. Lin, C. Tsai, Y. Lai, S. Li, and C. Wong, “Visual object recognition and pose estimation based on a deep semantic segmentation network,” IEEE Sensors Journal, vol. 18, no. 22, pp. 9370–9381, 2018.
  • [61] D. Sanders, G. Lambert, J. Graham-Jones, G. Tewkesbury, S. Onuh, D. Ndzi, and C. Ross, “A robotic welding system using image processing techniques and a cad model to provide information to a multi-intelligent decision module,” Assembly Automation, vol. 30, no. 4, pp. 323–332, 2010.
  • [62] B. Calli, W. Caarls, M. Wisse, and P. Jonker, “Active vision via extremum seeking for robots in unstructured environments: Applications in object recognition and manipulation,” IEEE Transactions on Automation Science and Engineering, vol. 15, no. 4, pp. 1810–1822, 2018.
  • [63] Q. Li, Y. Qiao, and J. Yang, “Robust visual tracking based on local kernelized representation,” in 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014).   IEEE, 2014, pp. 2523–2528.
  • [64]

    M. Shaker and M. ElHelw, “Optical character recognition using deep recurrent attention model,” in

    Proceedings of the 2Nd International Conference on Robotics, Control and Automation, ser. ICRCA ’17.   ACM, 2017, pp. 56–59.
  • [65]

    F. Husain, B. Dellen, and C. Torras, “Action recognition based on efficient deep feature learning in the spatio-temporal domain,”

    IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 984–991, 2016.
  • [66] S. Soyguder, “Intelligent control based on wavelet decomposition and neural network for predicting of human trajectories with a novel vision-based robotic,” Expert Systems with Applications, vol. 38, no. 11, pp. 13 994–14 000, 2011.