Results of the Survey: Failures in Robotics and Intelligent Systems

08/24/2017 ∙ by Johannes Wienke, et al. ∙ Bielefeld University 0

In January 2015 we distributed an online survey about failures in robotics and intelligent systems across robotics researchers. The aim of this survey was to find out which types of failures currently exist, what their origins are, and how systems are monitored and debugged - with a special focus on performance bugs. This report summarizes the findings of the survey.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

Abstract

In January 2015 we distributed an online survey about failures in robotics and intelligent systems across robotics researchers. The aim of this survey was to find out which types of failures currently exist, what their origins are, and how systems are monitored and debugged – with a special focus on performance bugs. This report summarizes the findings of the survey.

footnotetext: * Research Institute for Cognition and Robotics (CoR-Lab) & Center of Excellence Cognitive Interaction Technology (CITEC), Bielefeld University, Germany. Contact: {jwienke,swrede}@techfak.uni-bielefeld.de

1 Introduction

Despite strong requirements on dependability in actual application scenarios, robotics systems are still known to be error prone with regular failures. However, not many publications exist that have systematically analyzed this situation. Therefore, we have decided to carry out a survey to get an assessment of the current situation in research robotics111Parts of the results have previously appeared in Wienke2016. The aim of this survey was to collect the impressions of robotics developers on the reliability of systems, reasons for failure, and tools used to ensure successful operation and for debugging in case of failures. The survey specifically focused on software issues and software engineering aspects. Apart from general bugs, performance bugs have been specifically addressed to understand their impact on robotics systems and to determine how performance bugs differ from other bugs. A considerable amount of work in this direction has been done in other computing domains like highperformance computing or for cloud services [][]Gunawi2014,Jin2012,Zaman2012. However, in robotics such work is missing. To our knowledge, only Steinbauer2013 presents a systematic study on general faults in robotics systems, but without a specific focus on performance aspects.

Our survey was implemented as an online questionnaire (following methodology advices from GonzalezBanales2007) which was distributed around robotics researchers using the wellknown mailing lists euRobotics (eurondist)222https://lists.iais.fraunhofer.de/sympa/info/euron-dist and roboticsworldwide333http://duerer.usc.edu/mailman/listinfo.cgi/robotics-worldwide as well as more focused mailing lists. The detailed structure of the survey can be found in Appendix A. Please refer to this appendix for details on the phrasing of questions and permitted answers. Results presented in the following sections are linked to the respective questions of the survey.

In total, complete submissions and incomplete ones444Incomplete submissions also include visitors who only opened the welcome page and then left. were collected. of the participants were researchers or PhD candidates at universities, regular students and from an industrial context (A.12.1). On average, participants had years of experience in robotics (sd: , A.12.2). Participants spend their active development time primarily with software architecture and integration as well as component development, despite individual differences visible in the broad range of answers ( Figure 1, A.12.3). Other activities like hardware or driver development are pursued only for a limited amount of time.

Figure 1: Development time spent by survey participants on different development aspects of robotics systems. Individual answers have been normalized to sum up to . Inside the violins, a box plot is shown with the white dot representing the median and the red dot the mean value. Numbers above the plot express the sample size, which differs as answers were optional.

2 Tool usage

A first set of questions tried to assess which software tools are used to monitor and debug robotics systems in general. For different types of tools, participants could rate on a 5 point scale from 0 (Never) to 4 (Always), how often the respective type of tool is used during usual development and operation of their systems. For general monitoring tools (A.2.1) the answers are depicted in Figure 2.

Figure 2: Usage frequency for different categories of monitoring tools. For each category, the answer counts are displayed as a histogram and the grey point marks the median value. The categories are ordered by median and – if equal – mean values. Numbers above the plot express the sample size, which might differ as answers were optional.

According to the developers’ opinion, special purpose visualization tools like rviz rviz or debug windows for image processing operations are most frequently used to monitor systems. These are followed by lowlevel operating system level tools like ps or htop and logfiles. Tools related to distributed systems like utilities of the IPC mechanism form the last category of tools that is regularly used. Remote desktop connections are used only sometimes. In contrast, autonomous fault detection methods and special dashboards for visualizing system metrics are only rarely used, despite the fact that such tools are wellestablished for operating largescale systems with high dependability requirements.

A second question regarding monitoring tools asked for the exact names of tools that are used (A.2.2). The answers to this question are summarized in subsection B.1. The most frequently mentioned category of tools matched the previous question (visualization tools, most notably rviz rviz). These tools are followed by middlewarerelated tools, most notably the ROS command line tools and rqt, as well as operating system tools with htop and ps being the most frequently mentioned examples. Finally, manual log reading, remote access tools, custom missionspecific tools, and generic network monitoring tools like wireshark wireshark are used. Additionally, one participant also explicitly mentioned hardware indicators like LEDs for this purpose.

Regarding tools used to debug robotics systems (A.3.1), participants mostly use basic methods like printf or log files as well as simulation (Figure 3).

Figure 3: Usage frequencies for different categories of debugging tools and methods.

Generalpurpose and readily available debuggers are less frequently used than these basic methods. Unit testing seems to be partially practiced and accepted in the robotics and intelligent systems community.

The actual tools being used have been summarized in subsection B.2 as a result of question A.3.2. Debuggers represent the most frequently mentioned category of tools with gdb gdb leading this category. Another frequently used debugging tool is valgrind valgrind for checking memory accesses. Besides printf debugging, other categories of used tools comprise middleware utilities, simulation and visualization (with gazebo being the most frequently mentioned software), and unit testing.

3 Bugs and their origins

In a second set of questions we have addressed the reasons for and effects of bugs in robotics systems. As actual numbers for failure rates in robotics systems are rarely available, one question asked participants for the MTBF they have observed in systems they are working with (A.4.1). As visible in Figure 4, the answers form a bimodal distribution where one part of the participants rates MTBF of their systems to be within the range of minutes to a few hours, whereas others indicate MTBF rates in the range of days to weeks.

Figure 4: Participant answers for the observed MTBF in their systems.

One can think of multiple explanations for these diverging replies:

  • The systems participants have been working with are different in nature and some are closer to production systems.

  • Answers with higher MTBF include the system’s idle time in the calculation, despite an explicit indication in the explanation of the question that the operation time is the basis for this number.

  • Differences can be explained by the way people use debugging or monitoring tools in their systems. However, no significant relations could be found in the results.

As for the first two hypotheses no data is available to validate them and the third one cannot be proven using the survey results, the effective reasons for the bimodal distribution are unknown.

To generally understand why systems fail, participants were asked to rate how often different bug categories were the root cause of system failures (A.4.2). The categories have been selected based on related survey work from robotics and other domains Steinbauer2013,Gunawi2014,Jin2012,McConnell2004.

Figure 5: Rating of different bug categories being the reason for system failures.

Figure 5 displays the results for this question. Hardware bugs represent the most frequent category followed by a set of categories representing highlevel issues (configuration, coordination, deployment, IPC) as well as general logic bugs. Most of the highlevel issues seem to be technical problems and not specification problems because specification issues only rarely cause failure (median).

Apart from the aforementioned categories, participants could provide further causes in text form (A.4.3). After removing items that relate to categories already presented in the previous question, answers can be summarized as a) environment complexitychanges (8 mentions) b) lowlevel driver and operating system failures (3 mentions) c) hardware configuration management (1 mention) and d) hardware limitations (1 mention). subsection B.3 shows the answers in detail as well as how categories have been assigned. In the survey, we explicitly excluded the environment as an origin of system failures because it does not represent a real defect in any component of the system. However, the results still show how important the discrepancy between intended usage scenarios and capabilities of systems in their real application areas is in robotics and intelligent systems.

4 Performance bugs

In order to understand performance bugs in robotics and intelligent systems, a dedicated set of questions was added to the survey. First, participants were asked for the percentage of bugs that affected resource utilization (A.5.1). On average, (sd ) of all bugs affected system resources.

Figure 6: Frequency of bug effects on system resources.

Participants also had to rate how frequently different system resources were affected by performance bugs (A.6.1). These results are visualized in Figure 6. Memory, CPU and network bandwidth are the most frequently affected resources. Network bandwidth can be explained by the distributed nature of many of the current robotics systems. These three primarily affected resources are followed by disk space. Countable resources like processes or network connections are only rarely affected. A question for further affected resources (A.6.2) yielded IPCrelated virtual resources like event queues and IO bandwidth in addition to the previous categories (subsection B.4).

To get an impression of common causes for performance issues in robotics and intelligent systems, a question asked participants to rate how frequently different categories of root causes were the origin of performance bugs in their systems (A.7.1). The categories are the ones of the previous general questions on bug origins (A.7.1) extended with two items specifically targeting performance bugs: skippable computation, unnecessary computation that does not affect the functional outcomes (based on the results in Gunawi2014) and algorithmic choices.

Figure 7: Frequency of reasons for performance bugs.

Figure 7 depicts the results for this question. The most frequent reason for performance bugs is the choice of inappropriate algorithms followed by resource leaks and unnecessary computations. Interestingly, configuration issues are also among the frequent causes for performance bugs. When comparing answers to this questions with the answers for origins of general bugs (A.4.2), most categories are less likely origins for performance bugs than for general bugs apart from resource leaks (Table 1). Interestingly, threading issues do not significantly affect performance bugs differently than general bugs.

Category Change
Communication
Configuration **
Coordination ****
Deployment ****
Error handling *
Hardware ****
Logic *
Resource leaks **
Specification *
Threading
Others
Table 1: Changes to the mean ratings for different categories being the origins of failures when comparing performance bugs to general bugs. A change of would indicate a shift from one answer category to the next higher one. Significances have been computed using a MannWhitney U test.

5 Bug examples

Finally, participants were asked to provide detailed descriptions of bugs they had observed in their systems. Two questions in this direction were asked with four subanswers explicitly requesting a) the visible effects on the system, b) the underlying defect causing the bug, c) the steps performed to debug the problem, and d) the affected system resources. These questions were added to the survey to get an impression of the actual problems current robotics developers are facing with their systems and how they are addressed.

The first of these questions asked for a description of any type of bug that participants remembered from their systems that is particularly representative for the kind of bugs frequently observed (A.9). In total, answers were submitted for this question with a complete listing of the answers available in subsection B.5. Most notably, of the answers () were related to basic programming issues like segmentation faults or memory leaks, for instance caused by CC++ peculiarities. answers (

) described an issue that can be classified as a performance bug. Issues related to the IPC usage or infrastructure were mentioned by

answers (). Also, answers indicated bugs related to the coordination of the system (for instance, loops in the controlling state machines) of which answers were related to unexpected environment situations. Additionally, answers were related to timing aspects and another answers indicated that a bug was never or only accidentally understood and solved. Please refer to the tagging in subsection B.5 for details.

A second question asked participants to describe the most interesting bug they could remember in the same format. This was done to get an impression of which extreme types of bugs are possible in robotics systems. participants answered this question and their answers are listed in subsection B.6. In line with the previous question, programming problems related to lowlevel issues also represent the most frequently mentioned type of bugs with answers (). Furthermore, answers () described bugs caused by driver or operating system problems.

Answers to both questions indicate that memorymanagementrelated programming issues are often debugged using established tools like gdb gdb or valgrind valgrind – however – with varying success. One answer specifically mentioned that these tools are often not helpful for distributed systems.

6 Result interpretation

The presented survey results show that there is still a great potential for improvements in the dependability of robotics systems. With MTBF rates in the range of hours, a major part of the surveyed systems is far from being reliable enough for longerterm operations and work in this direction is needed, even if the majority of developers reached with this survey is working on research systems, which rarely end up in production use cases. Nevertheless, an appropriate level of dependability is required also in this domain to allow efficient experimentation and reliable studies. Still, monitoring tools that are specifically geared towards operating systems with high availability and reliability like fault detection or dashboards for a quick manual inspection of systems states are only rarely applied in robotics. The survey does not provide answers why this is the situation. Reasons could include the overhead of deploying such approaches which might not be feasible in smaller, relatively shortlived systems, or the lack of knowledge about such approaches, especially as many robotics researchers do not have a strong background in maintaining largescale systems. Therefore, improving approaches and making them more easily usable is one promising direction to foster their application.

With respect to system failures and their origins, the quantitative results from this survey indicate that hardware faults are among the most frequent causes for failure. This contradicts the findings from Steinbauer2013, which might potentially be caused by the wider range of applications covered in this survey. Generally, system failures seem to originate more frequently from bugs occurring in high levels of abstraction like coordination, deployment or configuration and less often from componentinternal issues like crashes. Still, a majority of the requested bug descriptions for representative bugs actually dealt with such componentinternal issues. One reason for this might be that, while frequently being observed, such componentrelated issues are often noticed immediately and therefore are perceived as part of the development work and not as system failures. In any case, these issues are strikingly often caused by basic programming issues, often related to the manual memory management and syntax idiosyncrasies of CC++. A general shift in robotics towards less errorprone languages with automatic memory management, clearer syntax and better debugging abilities has the potential to avoid a major amount of bugs currently found during development and operations.

With respect to the performance aspects, one quarter of the bugs found in current systems can be classified as performance bugs. In the descriptions of representative bugs even more than one third of the answers was performancerelated. Therefore, specifically addressing such issues is not only a niche but instead provides the potential to avoid a major amount of failures in the future. The survey has indicated that performance bugs are significantly less often caused by highlevel aspects like coordination or deployment and also by hardware issues. Therefore, addressing them on a componentlevel should already result in reasonable improvements.

Generally, systems are often debugged using log files and printf

instructions specifically placed for debugging. Participants have indicated that debuggers and memory checkers like valgrind valgrind are used less frequently. This is probably caused by the fact that such tools cannot be used for all problem kinds. The detailed bug reports still show that these tools are frequently used to debug programming issues on the component level. Participants have also indicated that these tools cannot be easily used for problems related to the distributed systems nature of current robots. Further work on debugging infrastructure respecting this fact might improve the situation. Finally, simulation seems to be an important tool for debugging robotics systems and explicit support for simulationbased testing and debugging might provide one future avenue for more dependable robotics systems.

7 Threats to validity

The survey results represent the opinions and memorized impressions of interviewed developers, not objective measurements of the real effects. As such, results may be biased. However, general tendencies derived from the results should still be valid as a complete misassessment is unlikely across all participants.

Due to the distribution of the survey via primarily researchrelated mailing lists, results are only representative for systems developed in this context and cannot be generalized towards industrial, productionready systems.

The categories used in questions regarding the frequencies of bug origins may have partially been hard to distinguish from each other. Therefore, in some cases, ratings might be blurred between multiple categories due to the imprecise definitions. When possible, the conclusions drawn from the survey have been based on a grouping of multiple categories to mitigate this effect.

Appendix A Questionnaire structure

The following sections represent the structure of the online survey. This is a direct export of the survey structure without modifications.

a.1 Introduction

Thank you very much for taking the time to participate in this survey. This survey is part of my PhD project with a focus on exploiting knowledge about computational resource consumption in robotics and intelligent systems, persued at Bielefeld University. Therefore, in order to participate, you should be involved or have been involved in the development or maintenance of such systems. In case you have worked or are working with mutiple systems in parallel, please provide answers on the combination of all theses systems.

Participating in this survey should not take longer than 15 minutes. The survey consists of several questions and you are free to skip questions in case you do not want to answer them. Moreover, you can go back and forth between the questions you have already answered in order to revise them. All data you enter in this survey will be anonymized.

Johannes Wienke

jwienke [at] techfak.uni-bielefeld.de

a.2 Monitoring Tools

The first part of this survey addresses how robotics and intelligent systems are monitored at runtime in order to assess their health and understand the ongoing operations. Monitoring includes the ongoing collection of runtime data, the observation of operations as well as the assessment of system health.

a.2.1 How often do you use the following kinds of tools to monitor the operation of running systems? 

Rate individually for:

  • Operating system command line tools e.g. htop, iotop, ps (OS)

  • Logfiles (LOG)

  • Dashboard views e.g. munin, graphite, nagios (DASH)

  • Inter-process communication introspection e.g. middleware logger (IPC)

  • Autonomous fault or anomaly detectors (FD)

  • Special-purpose visualizations e.g. rviz, image processing debug windows (VIS)

  • Remote desktop connections e.g. VNC, rdesktop (RDP)

  • Others (OTH)

Answer type

Fixed choice

  • Never (0)

  • Rarely (1)

  • Sometimes (2)

  • Regularly (3)

  • Always (4)

a.2.2 Please name the concrete tools that you use for monitoring running systems.

Separate different tools with a comma.

Answer type

longtext (length: 40)

a.3 Debugging Tools

This part of the survey addresses tools that are used in order to debug systems in case a failure has been detected. Debugging is the process of identifying the root cause of an observed abnormal system behavior.

a.3.1 How often do you use the following tools for debugging?

Rate individually for:

  • Console output e.g. printf, cout (PRNT)

  • Logfiles (LOG)

  • Debuggers e.g. gdb, pdb (DBG)

  • Profilers e.g. kcachegrind, callgrind (PROF)

  • Memory checkers e.g. valgrind (MEMC)

  • System call introspection e.g. strace, systemtap (SYSC)

  • Inter-process communication introspection e.g. middleware logger (IPC)

  • Network analyzers e.g. wireshark (NWAN)

  • Automated testing e.g. unit tests (TEST)

  • Simulation (SIM)

  • Others (OTH)

Answer type

Fixed choice

  • Never (0)

  • Rarely (1)

  • Sometimes (2)

  • Regularly (3)

  • Always (4)

a.3.2 Please name the concrete tools that you use for debugging.

Separate different tools with a comma.

Answer type

longtext (length: 40)

a.4 General Failure Assessment

Please provide information about failures you have observed in the systems you are working with.

a.4.1 Averaging over the systems you have been working with, what to do you think is the mean time between failures for these systems?

The mean time between failures is the average amount of operation time of a system until a failure occurs.

Answer type

Fixed choice

  • < 0.5 hours (0)

  • < 1 hour (1)

  • < 6 hours (2)

  • < 12 hours (3)

  • < 1 week (4)

  • > 1 week (5)

a.4.2 Please indicate how often the following items were the root cause for system failures that you know about.

Rate individually for:

  • Hardware issues (HW)

  • System coordination e.g. state machine (COORD)

  • Deployment (DEPL)

  • Configuration errors e.g. component configuration (CONF)

  • Logic errors (LOGIC)

  • Threading and synchronization (THRD)

  • Wrong error handling code (ERR)

  • Resource leaks or starvation e.g. RAM full, CPU overloaded (LEAK)

  • Inter-process communication failures e.g. dropped connection, protocol error (COMM)

  • Specification error / mismatch e.g. component receives other inputs than specified (SPEC)

  • Others (OTH)

Answer type

Fixed choice

  • Never (0)

  • Rarely (1)

  • Sometimes (2)

  • Regularly (3)

  • Very often (4)

a.4.3 Which other classes of root causes for failures did you observe? 

Separate items by comma.

Answer type

text (length: 24)

a.5 Resource-Related Bugs

The following questions deal with the consumption of computational resources like CPU, memory, disk, network etc.

a.5.1 How many of the bugs you have observed or know about had an impact on computational resources,  e.g. by consuming more or less of these resources as expected?

Please approximate the amount with a percentage value of the total number of bugs you can remember. A quick guess is ok here.

Answer type

integer (length: 10)

a.6 Impact on Computational Resources

The following questions deal with the consumption of computational resources like CPU, memory, disk, network etc.

a.6.1 Please indicate how often the following computational resources were affected by bugs you have observed.

A computational resource was affected by a bug in case its consumption was higher or less than expected, e.g. in comparable or non-faulty situations.

Rate individually for:

  • CPU (CPU)

  • Working memory (MEM)

  • Hard disc space (HDD)

  • Network bandwidth (NET)

  • Number of network connections (CON)

  • Number of processes and threads (PROC)

  • Number of file descriptors (DESC)

Answer type

Fixed choice

  • Never (0)

  • Rarely (1)

  • Sometimes (2)

  • Regularly (3)

  • Very often (4)

a.6.2 If there are other computational resources that have been affected by bugs, please name these.

Answer type

longtext (length: 40)

a.7 Performance Bugs

The following question specifically addresses performance bugs. A system failure or bug is a performance bug in case it is visible either through degradation in the observed performance of the system (e.g. delayed or very slow reactions) or through an unexpected consumption of computational resources like CPU, memory, disk, network etc.

a.7.1 Please rate how often the following items were the root causes for performance bugs you have observed.

Rate individually for:

  • Hardware issues (HW)

  • System coordination e.g. state machine (COORD)

  • Deployment (DEPL)

  • Configuration errors e.g. component configuration (CONF)

  • Logic errors (LOGIC)

  • Threading and synchronization (THRD)

  • Wrong error handling code (ERR)

  • Unnecessary or skippable computation (SKIP)

  • Resource leaks or starvation e.g. RAM full, CPU overloaded (LEAK)

  • Inter-process communication failures e.g. dropped connection, protocol error (COMM)

  • Specification error / mismatch (SPEC)

  • Algorithm choice (ALGO)

  • Others (OTH)

Answer type

Fixed choice

  • Never (0)

  • Rarely (1)

  • Sometimes (2)

  • Regularly (3)

  • Always (4)

a.8 Case Studies

For the following questions, please provide descriptions of any kind of bug that you remember.

a.8.1 Thinking about the systems you have worked with so far, is there a bug that you remember which happened several times or which is representative for a class of comparable bugs?

Answer type

Fixed choice

  • Yes (Y)

  • No (N)

a.9 Case Study: Representative Bug

Please briefly describe the representative bug that you remember.

a.9.1 How was the representative bug noticed?

Please explain the observations that were made and how they diverged from the expectations.

Answer type

longtext (length: 40)

a.9.2 What was the root cause for the bug?

Please explain which component(s) of the system failed and in which way.

Answer type

longtext (length: 40)

a.9.3 Which steps were necessary to analyze and debug the problem?

Please include the information sources that had to be observed and the tools that got applied.

Answer type

longtext (length: 40)

a.9.4 Which computational resources were affected by the bug?

Computational resources include CPU, working memory, hard disc space, network bandwidth & connections, number of processes and threads, nubmer of file descriptors etc.

Answer type

longtext (length: 40)

a.10 Case Studies

For the following questions, please describe any kind of bug that you remember.

a.10.1 Thinking about the systems you have worked with so far, is there a bug that you remember which was particularly interesting for you?

Answer type

Fixed choice

  • Yes (Y)

  • No (N)

a.11 Case Study: Interesting Bug

Please describe briefly the most interesting bug that you remember from one of the systems you have been working with.

a.11.1 How was the interesting bug noticed?

Please explain the observations that were made and how they diverged from the expectations.

Answer type

longtext (length: 40)

a.11.2 What was the root cause for the bug?

Please explain which component(s) of the system failed and in which way.

Answer type

longtext (length: 40)

a.11.3 Which steps were necessary to analyze and debug the problem?

Please include the information sources that had to be observed and the tools that got applied.

Answer type

longtext (length: 40)

a.11.4 Which computational resources were affected by the bug?

Computational resources include CPU, working memory, hard disc space, network bandwidth & connections, number of processes and threads, nubmer of file descriptors etc.

Answer type

longtext (length: 40)

a.12 Personal Information

As a final step, please provide some information about your experience with robotics and intelligent systems development.

a.12.1 In which context do you develop robotics or intelligent systems?

Answer type

Fixed choice

  • Student (excluding PhD students) (STUD)

  • Researcher at a university (PhD students, scientific staff) (RES)

  • Industry (IND)

  • Other (OTHER)

a.12.2 How many years of experience in robotics and intelligent systems development do you have?

Answer type

integer (length: 10)

a.12.3 How much of your time do you spend on developing in the following domains?

Please indicate in percent of total development time. Numbers may not sum up to 100.

Rate individually for:

  • Hardware (HW)

  • Drivers (DRV)

  • Functional components (COMP)

  • Inter-process communication infrastructure (COMM)

  • Software architecture and integration (ARCH)

  • Other (ANY)

Answer type

integer (length: 3) Hint: Percent of development time

a.13 Final remarks

Thank you very much for participating in this survey and thereby supporting my research.

In case you have further questions regarding this survey or the research topic in general, please contact me via email.

Johannes Wienke

jwienke [at] techfak.uni-bielefeld.de

Appendix B Result details

b.1 Used monitoring tools

The following table presents the results for question A.2.2. The free text answers have been been grouped into categories (caption lines in the table). For each answer that included at least one item belonging to a category, the counter of each category was incremented. Hence, the counts represent the number of answers that mentioned a category at least once. Additionally, for each category, representative entries have been counted the same way. Some of the answers include uncommon or special-purpose tools or techniques. These have not been counted individually and, hence, are only visible in the category counts.

Tool Answer count
Visualization
          rviz 22
          gnuplot 2
          matplotlib 1
Middleware Tools555Represents entries that are specific to the middleware-related aspects of an ecosystem. For instance, ROS_DEBUG has not been counted here. Instead, this belongs to the “Manual log reading” category.
          ROS command line 14
          rqt 5
          RSB 4
Basic OS Tools
          htop 12
          ps 7
          top 7
          acpi 1
          du 1
          free 1
          lsof 1
          procman (gnome) 1
          pstree 1
          screen 1
          tmux 1
Manual Log Reading
Remote Access
          ssh 5
          putty 1
          rdesktop 1
          vnc 1
Custom Mission-Specific
Generic Network
          netstat 1
          tcpdump 1
          wireshark 1
Hardware Signals

b.2 Used debugging tools

The following table presents the results for question A.3.2. The free text answers have been been grouped into categories (highlighted lines in the table). For each answer that included at least one item belonging to a category, the counter of each category was incremented. Hence, the counts represent the number of answers that mentioned a category at least once. Additionally, for each category, representative entries have been counted the same way. Some of the answers include uncommon or special-purpose tools or techniques. These have not been counted individually and, hence, are only visible in the category counts.

Tool Answer count
Debuggers
          gdb 17
          pdb 3
          VS debugger 2
          ddd 1
          jdb 1
Runtime Intropsection
          valgrind 12
          callgrind 2
          kcachegrind 1
          strace 1
Generic
          printf, cout, 14
          logfiles 4
          git 1
Middleware Tools666Represents entries that are specific to the middleware-related aspects of an ecosystem.
          ROS command line 5
          RQT 2
          RSB 2
Simulation & Visualization
          gazebo 4
          rviz 1
          Vortex 1
          stage 1
Functional Testing
          gtest 2
          junit 2
          cppunit 1
          rostest 1
IDEs
          Qt Creator 2
          KDevelop 1
          LabVIEW 1
          Matlab 1
          Visual Studio 1
Generic Network
          wireshark 2
          tcpdump 1
Dynamic Analysis
          Daikon 1

b.3 Summarization of free form bug origins

The following table presents all answers to question A.4.3. Individual answers have been split into distinct aspects. These aspects have either been assigned to an existing answer category from question A.4.2 or to new categories.


Answer Category
Existing New
unknown driver init problems (start a driver, and works only after second trial) Driver & OS
environment noise (lighting condition variation, sound condition in speach recognition) hard to adapt to every possible variation Environment
Insufficient Component Specifications Specification
Changed mapsenvironments Environment
lossy WiFi connections Hardware
unreliable hardware Hardware
in Field robotics, the environment is the first enemy… Environment
Environment changes Environment
sensor failures Hardware
unprofessional users Environment
Operation System / Firmware failure Driver & OS
network too slow Hardware
Loose wires Hardware
other researchers changing the robot configuration Config mgmt
coding bugs Logic
algorithm limitations Environment
sensor limitations Hardware lim
perception limitations Environment
wrong usage Environment
Failures in RT OS timing guarantees Driver & OS

b.4 Summarization of other resources affected by bugs

The following table presents the free text results of question A.6.2. Answers have been split into distinct aspects and these aspects have either been assigned to one of the existing categories from question A.6.1 or – if these did not match – new categories have been created to capture the answers. Parts of answers that did not represent system resources which have a resource capacity that can be utilized have been ignored. These are marked as strikethrough text.


Answer Resource
Existing New
USB bandwidth and or stability IO bandwidth
locks on filesdevicesresources File descriptors
permissions
file system integrity
interprocess communication queues, queue overflow IPC
Files (devices) left open. File descriptors
Wrong operation in GPU leads to restart.
Memory leak – not sure why or where Memory

b.5 Representative bugs

The following subsections present answers to the questions for representative bugs (A.9). For the analysis, answers have been tagged for various aspects and types of bugs being mentioned in them. Raw submission texts have been reformatted to match the document and typographical and grammatical errors have been corrected.

b.5.1 Representativ bug 8

Observation

computer system unresponsive

Cause

memory leak

Debugging
  • find faulty process

  • analyze memory usage (valgrindgdb)

  • repair code

Affected resources

main memory

Tags

basic programming issue; performance bug

b.5.2 Representativ bug 10

Observation

System got stuck in infinite loop.

Cause

Unexpected infinite loop in the behaviour (state machine). Noise in the data caused the system to infinitely switch between two states.

Debugging
  1. Detection of which states were affected.

  2. Detection of the responsible subsystem(s).

  3. Detection of the responsible functions.

  4. Recording data that caused the problem.

  5. Analyzing the data and searching for unexpected situations.

  6. Modification of the system in order to handle such situation correctly.

Affected resources

CPU

Tags

coordination; environment-related

b.5.3 Representativ bug 14

Observation

high latency in spread communication

Cause

wrong spread configurationwrong deployment of components

Debugging

trial & error: reconfiguration, stopping and starting components, monitoring of latency via rsb-tools

Affected resources

network-latency

Tags

communication; performance bug

b.5.4 Representativ bug 21

Observation

Incorrect response of the overall system according to requested task request. System thinks it did not grasp an object although it did and restarts grasping operation or cancels the task due to the missing object in hand.

Cause

State machine design andor logic error andor untriggered event due to sensor not triggering as expected (hardware) or too much noise (environment noise). The root cause is often a case not being handled correctly in a big system with a lot of sensors and possible case.

Debugging

event logger analysis over XCF XML data, unit test of single sensor output to see noise level or false positives.

Affected resources

Hardware (noise in the sensor)

Tags

coordination; environment-related

b.5.5 Representativ bug 26

Observation

Segfault

Cause

Segfault

Debugging

gdb

Affected resources
Tags

basic programming issue

b.5.6 Representativ bug 30

Observation

Unexpected overall behavior.

Cause

Wrong logic in the abstract level.

Debugging

Run simulation in the abstract layer.

Affected resources

None.

Tags

coordination

b.5.7 Representativ bug 41

Observation

Failure to observe expected high-level output. More specifically, a map that was being built was lacking data.

Cause

Congested wireless network connection. The amount of data could not be transmitted within the expected time frame.

Debugging

Logging of signals between modules on the deployed system to verify data was being produced and transmitted correctly, and logging of data received.

Affected resources

Network connection

Tags

communication; timing

b.5.8 Representativ bug 42

Observation

Because of timing mismatch the planning system was working with outdated data.

Cause

Non-event based data transfer.

Debugging

Going through multiple log files in parallel to find the data that was transmitted in comparison to the data that was used in the computation.

Affected resources

Non. Mostly mismatch between specification and performed actions.

Tags

coordination; timing

b.5.9 Representativ bug 46

Observation

Navigation did not work correctly

Cause

Algorithmic errors

Debugging

Dig in and verify steps in the algorithm

Affected resources
Tags

b.5.10 Representativ bug 60

Observation

delays in robots command execution

Cause

supervision and management part of the framework

Debugging

benchmarking, profiling

Affected resources
Tags

performance bug

b.5.11 Representativ bug 69

Observation

memory leak

Cause

resource management, dangling pointers

Debugging

check, objectresource timeline, usually start with resources that are created often and handed over regularly and therefore might have unclear ownership

Affected resources

memory, CPU

Tags

basic programming issue; performance bug

b.5.12 Representativ bug 70

Observation

constantly increasing memory consumption

Cause

Memory leaks

Debugging

Running the code in offline mode with externally provided inputs and observing the memory consumption pattern. Tools like valgrind or system process monitor helps to discover the problem

Affected resources

Working memory

Tags

basic programming issue; performance bug

b.5.13 Representativ bug 76

Observation

Visually in system operation. In one case, elements within a graphical display were misdrawn. In another, command codes were misinterpreted, resulting in incorrect system operation.

Cause

Variable type mismatch integer vs. unsigned integer – such as when a number intended to be a signed integer is interpreted as an unsigned integer by another subsystem.

Debugging

Debugger using single step and memory access.

Affected resources

None

Tags

basic programming issue; performance bug

b.5.14 Representativ bug 81

Observation

segfault

Cause

C++ pointers

Debugging

gdb, valgrind

Affected resources

none

Tags

basic programming issue

b.5.15 Representativ bug 96

Observation

segmentation fault

Cause

logical errors, bad memory management

Debugging

using debuggers, looking and studying code

Affected resources

working memory, number of process and threads

Tags

basic programming issue

b.5.16 Representativ bug 128

Observation

Robot software is not working / partially working (recognizing and grasping an object)

Cause

Wrong configuration andor API changes that hasn’t been changes in all components (Problem with scripting languages like python)

Debugging
  • identify error message and component via log files / console output

  • Think about what could have caused the problem (look into source code, gitsvn commit messagesdiffs)

  • try to fix it directly or talk with other developers in case of bigger changes / out of my responsibility

Affected resources

none

Tags

b.5.17 Representativ bug 135

Observation

middleware communication stopped / was only available within small subsets of components

Cause

unknown

Debugging
Affected resources
Tags

notaccidentally solved; communication

b.5.18 Representativ bug 136

Observation
  1. Applicationprocess hang.

  2. core usage on idle

  3. Unbalanced load between cores (Monolithic code).

Cause
  1. Loose wirecouple (mostly USB)

  2. Active wait
    while(1) while(!flag); process(); flag = 0;

  3. A bad design. No threads were used, but time measurements to switch between tasks.

Debugging
  1. Check everything, realize that the file-device is open but device is no longer present or has different pointer or has reseted

  2. Check every code file. People use to make old-style structured programming when using C/C++

when you notice the performance go brick, check CPUmemory usage with OS tools and notice one process is using everything but is idle.

Affected resources

Mostly CPU

Tags

basic programming issue; performance bug

b.5.19 Representativ bug 156

Observation

Difficult to reproduce, random segmentation faults

Cause

of the time it has been either accessing unallocated memory (off-by-one errors) or threading issues

Debugging

When working with a system with many processes, threads, inter-process communications, , the standard tools (gdb, valgrind) are often not that helpful. If they can’t immediately point me to the error, I’ll often resort to print statement debugging.

Affected resources

Memory leaks, CPU usage

Tags

basic programming issue

b.5.20 Representativ bug 190

Observation

unforeseen system behavior, decreased system performance

Cause

misconfiguration of middleware

Debugging
  • monitoring middleware configuration of concerned components

  • checking log-files

  • sometimes debug print-outs

Affected resources

CPU, network load

Tags

communication; performance bug

b.5.21 Representativ bug 191

Observation

Software controlling the robot crashed immediately after started in robot or robot stop to move when has to perform certain operation

Cause

The error was caused by not checking range of allocated memory in some object’s constructor, we used sprintf instead of snprintf

Debugging
  • gdb – did not find anything

  • valgrind – did not find anything

Both tools were run on PC, where the error did not occur, but we did not use them on the robot’s pc. The bug was found accidentally.

Affected resources

access to non-allocated memory lead immediately to crash of program.

Tags

basic programming issue; notaccidentally solved

b.6 Interesting bugs

The following subsections present answers to the questions for intersting bugs (A.11). Answers have been processed the same way as for subsection B.5.

b.6.1 Interesting bug 5

Observation

There are too many to remember. A recent one got noticed by surprisingly high latency in a multithreaded processing and visualization pipeline.

Cause

Sync to vblank was enabled on a system and due to a possible bug in Qt multiple GL widgets contributed to the update frequency. The maximum display update frequency dropped below .

Debugging

Compare systems and analyze timing inside the application. Google the problem.

Affected resources

None

Tags

driver & OS

b.6.2 Interesting bug 21

Observation

On an arm and hand system, with hand and arm running on separate computers linked via an Ethernet bus, timestamped data got desynchronized. This was noticed on the internal proprioception when fingers moved on the display and the arm did not although both moved in physical world.

Cause

NTP not setup correctly. University had a specific NTP setting requirement that was not set on some computers. Could actually never synchronize.

Debugging

Looking at timestamps in the messages over rosbag or rostopic tools. Analysing system clock drift with command line tools.

Affected resources

working memory and CPU would be used more due to more interpolationextrapolation computation between unsynced data streams.

Tags

configuration

b.6.3 Interesting bug 32

Observation

PCL segfaulted on non-DebianUbuntu machines when trying to compute a convex hull.

Cause

The code was written to support Debian’s libqhull, ignoring the fact that Debian decided to deviate from the upstream module in one compile flag that changed a symbol in the library from struct to struct*. That way all non-Debian ports of libqhull failed to work with PCL, and instead segfaulted while trying to access the pointer.

Debugging
  • minimal example

  • printf within the PCL code

  • printf within an overlayed version of libqhull

  • gdb

  • Debian package build description for libqhull

  • upstream libqhull package

  • 12 hours of continuous debugging.

Affected resources

Well, segfault, the entire module stopped working. So basically everything was affected to some degree..

Tags

driver & OS; basic programming issue

b.6.4 Interesting bug 46

Observation

The robot kept asking someones name.

Cause

Background noise in the microphone

Debugging

The bug was obvious: no limit on the amount of questions asked. Simply drawingviewing the state machine made this very obvious.

Affected resources
Tags

coordination; environment-related

b.6.5 Interesting bug 60

Observation

signal processing in component chain gave different results after several months

Cause

unknown

Debugging
Affected resources
Tags

notaccidentally solved

b.6.6 Interesting bug 69

Observation

segfault

Cause

timing and location of allocated memory

Debugging

memory dumps…many many memory dumps

Affected resources

it did not affect resources constantly, but system stability in general; maybe CPU and memory

Tags

basic programming issue

b.6.7 Interesting bug 76

Observation

While operating, a robot system normally capable of autonomous obstacle avoidance would unexpectedly drop communication with its wireless base station and drive erratically with high probability of collision.

Cause

The main process was started in a Linux terminal and launched a thread that passed wheel velocity information from the main process to the robot controller. When the terminal was closed or otherwise lost, the main process was terminated but the thread continued to run, supplying old velocities to the robot controller.

Debugging

top, debugger, thought experiments

Affected resources

None

Tags

coordination

b.6.8 Interesting bug 83

Observation

Random segfaults throughout system execution.

Cause

Bad memory allocation: malloc for sizeof(type) rather than sizeof(type*).

Debugging

Backtrace with gdb, profiling with valgrind, eventual serendipity to realize the missing * in the code.

Affected resources

Memory

Tags

basic programming issue

b.6.9 Interesting bug 133

Observation

memory mismatch, random crashes

Cause

different components using different boost versions

Debugging

debugger, printf. Finally solved after hint from colleague

Affected resources
Tags

basic programming issue

b.6.10 Interesting bug 149

Observation

Erratic behaviour of logic

Cause

Error in mathematical modeling

Debugging

Unit tests

Affected resources

None

Tags

b.6.11 Interesting bug 150

Observation

An algorithm was implemented in both C++ and MATLAB exactly the same way. However, only the MATLAB implementation was working correctly.

Cause

Difference in storing the float point variables in MATLAB and C++. MATLAB rounded the numbers, however, C++ cut them.

Debugging

Step by step tracing and debugging, and watching variables. Then, comparing with each other.

Affected resources

Working memory

Tags

basic programming issue

b.6.12 Interesting bug 153

Observation

Control Program crash after a consistent length of time

Cause

Presumably memory leak. Never knew for sure.

Debugging

Not sure

Affected resources

Not sure

Tags

basic programming issue; performance bug

b.6.13 Interesting bug 156

Observation

Visualization window crashing of the time I open it. Running the program inside of gdb resulted in the program successfully running of the time.

Cause

??? Likely something internal to closed-source graphics drivers interacting with OpenGLOGRE

Debugging

Was able to eventually generate a backtrace that pointed to graphics drivers.

Affected resources

CPUMemoryGPU were all affected because I had to run the program inside of gdb

Tags

driver & OS

b.6.14 Interesting bug 162

Observation

bad localization of a mobile robot in outdoor campus environment. Jump of the estimation

Cause

Bad wheel odometry reading.

Debugging

Analyze log file

Affected resources

None. Loss of performance due to incorrect position tracking

Tags

[nottype=software] [type=software,title=Software packages]