Open Problems in Robotic Anomaly Detection

by   Ritwik Gupta, et al.
Carnegie Mellon University

Failures in robotics can have disastrous consequences that worsen rapidly over time. This, the ability to rely on robotic systems, depends on our ability to monitor them and intercede when necessary, manually or autonomously. Prior work in this area surveys intrusion detection and security challenges in robotics, but a discussion of the more general anomaly detection problems is lacking. As such, we provide a brief insight-focused discussion and frameworks of thought on some compelling open problems with anomaly detection in robotic systems. Namely, we discuss non-malicious faults, invalid data, intentional anomalous behavior, hierarchical anomaly detection, distribution of computation, and anomaly correction on the fly. We demonstrate the need for additional work in these areas by providing a case study which examines the limitations of implementing a basic anomaly detection (AD) system in the Robot Operating System (ROS) 2 middleware. Showing that if even supporting a basic system is a significant hurdle, the path to more complex and advanced AD systems is even more problematic. We discuss these ROS 2 platform limitations to support solutions in robotic anomaly detection and provide recommendations to address the issues discovered.



There are no comments yet.


page 3


Anomaly Detection in Smart Manufacturing with an Application Focus on Robotic Finishing Systems: A Review

As systems in smart manufacturing become increasingly complex, producing...

Paranom: A Parallel Anomaly Dataset Generator

In this paper, we present Paranom, a parallel anomaly dataset generator....

Revisiting Anomaly Detection in ICS: Aimed at Segregation of Attacks and Faults

In an Industrial Control System (ICS), its complex network of sensors, a...

An Immune Inspired Approach to Anomaly Detection

The immune system provides a rich metaphor for computer security: anomal...

Statistical Evaluation of Spectral Methods for Anomaly Detection in Networks

Monitoring of networks for anomaly detection has attracted a lot of atte...

End-To-End Anomaly Detection for Identifying Malicious Cyber Behavior through NLP-Based Log Embeddings

Rule-based IDS (intrusion detection systems) are being replaced by more ...

An Agent Based Classification Model

The major function of this model is to access the UCI Wisconsin Breast C...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Anomaly detection (AD) is an increasingly important area of study in the field of robotics as robotic systems tend towards higher levels of autonomy. Being able to predict, identify, and correct these anomalies is critical, especially when the robotic systems can have a direct or indirect impact on human life. Unfortunately, while all versions of anomaly detection seek to identify things that are anomalous, there is still considerable variation in precisely what this means:

  1. Extreme: The point lies above a threshold .

  2. Isolated: In some metric space, the distance to other points is greater than except for at most of other very nearby points (a point at the center of a highly bimodal distribution can be isolated and not extreme).

  3. Abnormal (or inconsistent with a trusted model): As an example, an auditor keeps track of the ratio of total income to total taxes paid for a collection of organizations. One organization is far larger than the others, with income and taxes being both extremely high. However, the ratio of taxes to income for this large organization is comparable to the ratio for smaller organizations, and the auditor considers it normal. Thus, a point can be both extreme and/or isolated and yet still fail to be abnormal.

The differences between the senses above are conceptually superficial. For any space containing an isolated point, there exists a simple transformation of the space that results in the isolated point becoming an extreme value. Similarly, the size (in terms of income and taxes) of an organization is really a distraction if the ratio of income to taxes is what matters, so why not just talk about that ratio? Unfortunately, while these kinds of conceptual connections between competing notions of anomalousness are trivial for simple examples, they become less trivial as the dimension of the space grows.

The anomaly detection task is especially challenging when we are asked to treat the data as a black box, with no a priori insight into what is “normal”. A general-purpose anomaly detection algorithm will require considerable sophistication to automatically notice the relationship between income and taxes without any prior knowledge of finance. Accordingly, varying techniques of anomaly detection in robotic monitoring focus on predefined relationships of what is a “normal range” of operation [1, 2, 3, 4, 5], however, as we show in this work, there are still several open problems in robotic anomaly detection that significantly degrade the assumption of being able to define that “normal range”.

Finally, we demonstrate the need for additional work in these areas by providing a case study which examines the limitations of implementing a basic anomaly detection (AD) system in the Robot Operating System (ROS) 2 middleware [6], which is an attempt to revise and improve many engineering decisions from the ROS 1 platform [7]. ROS has often been difficult to work with and requires specific engineering guidelines which are not conducive to real-time anomaly detection. Accordingly, we draw the conclusion that if even supporting a basic system is a significant hurdle, the path to more complex and advanced AD systems is even more problematic. We discuss these ROS 2 platform limitations to support solutions in robotic anomaly detection and provide recommendations to address the issues discovered.

Ii Open problems with regards to robotic AD systems

Ii-a Non-malicious faults present many false alarms.

False positives and false negatives have been well studied in AD and intrusion detection systems [8, 9, 10]. It is a long-held belief that an anomaly means a failure of a system directly. However, not all anomalies represent failures. A robot can behave anomalously frequently without ever failing, resulting in a large amount of false alarms that, in turn, results in operator overload and cognitive burden. Alarmingly, high false-alarm rates lead to lowered operator trust in robots [11].

A big issue with false alarms is that being able to handle them means that we have a priori information that a given alarm is a false positive, meaning the alarm should not have occurred in the first place. There are not many methods to eliminate false alarms beyond creating methods with better sensitivity and specificity. The current dominant method to handle operator burden due to these false alarms relies on alarm rate thresholding [12]. Tweaking the threshold at which an alarm persists on the operator’s alert panel, in practice, allows many false alarms to be handled at the cost of tuning this subjective parameter and possibly suppressing one-off true alarms. Methods to dynamically change the threshold exist [13], but they are brittle and are still prone to suppressing true alarms. A proposed method of handling false alarms is by using a pseudo human-in-the-loop (HITL) approach, in which all alerts are initially presented to the operator. As the operator dismisses alerts, a simple model can learn the operator’s preferences and subsequently use the feedback to suppress false alarms, or alarms that the operator deems unnecessary. Approaches like these have been explored in security contexts [14] but have yet to be seriously applied in robotics contexts.

It is infeasible to handle non-malicious faults as it is impossible to know ahead of time which faults are non-malicious. Since the idea of which alert is alarming is subjective based on the situation and the operator, HITL approaches present a dynamic and robust way to handle alerts related to non-malicious faults. HITL systems need further interest and development with regard to AD systems.

Ii-B When is invalid data anomalous, and when is it not?

Conceptually, one can separate the data ingestion pipeline into three parts: 1) the sensor, which is the source of the data, 2) the data processing pipeline, which does the initial data transformation/cleaning, and 3) the data interpretation pipeline, which provides meaning to the data in the context of the overall robotic system (often the data processing and interpretation pipelines are merged, but a distinction still exists). It is often the case that the beginning stage of the pipeline (the sensor) may be operating without anomalies but returning data that is “invalid” from the standpoint of the remaining parts of the pipeline that is interpreting the data or monitoring a user. In this case, it is hard to say that invalid data is “anomalous.”

There are two schools of thought on this matter:

  1. The data is never anomalous, but the interpretations are.

  2. The data can be flawed given a static interpretation framework.

The former, data-purist school of thought is compelling because it conditions us to replace human senses and intuition with robotic ones such that the available sensors and all their data (invalid or not) are the ground truth. Similar to an instance where a human believes they saw a UFO, the data itself (the light collected by the eyes) is always valid, but frameworks of interpretation (the way the brain processes the light collected by the eyes) may be incorrect or behave in anomalous fashions. In this model, the AD burden falls to the data interpretation pipeline.

The latter school of thought is arguably more practical because it allows the engineer to define limits on the normalcy of data. Imagine a use case where a LiDAR sensor is collecting point cloud data to map its surrounding region. A splash of water from a passing car covers the sensor for a few seconds. In this perspective, the data from the sensor needs to be ignored as valid data because the sensor’s operations are compromised. The AD burden in thsi scenario lies with the data processing pipeline.

Overall, invalid data cannot be directly correlated with anomalous behavior, though a correlation certainly exists. It is essential when building AD systems to take a softer stance towards invalid data and provide methods to filter it out without triggering alarms about anomalous behavior. Methods to filter invalid data vary [15, 16, 17], and there is no one right way to approach this problem. Robust sensor and domain-specific filtering methods need to be generalizable to handle unforeseen edge cases.

It is also important to consider that anomalous states are, again, often defined as some deviation from a typical behavior trend. This “normalcy” carries with it the idea that there are base assumptions about operation, environment, and intended behavior. That is, if there is some function that defines normal behavior, it is only valid under a set of assumptions . Current robotic AD systems do not monitor , even if is prone to change. For example, a robotic system that is meant to only operate on flat, 2D surfaces must suddenly operate on a 2D plane that exists on a slope. The base assumptions for this robotic system have changed, and this state should be detected as an anomaly. While this change in base assumptions may have an effect on the overall operation of the robot and still can be detected via the classical, deviation from definition of anomalous behavior, it is sometimes more effective to capture changes in .

By neglecting to account for the ever-changing set of assumptions about the environment, current robotic AD systems are severely lacking in their ability to handle the dynamic environments that current robotics research focuses on. In order to reason about changes in , we must reason about the environment with a non-static framework of “normalcy.” As robotic systems progress further into autonomous operation in unconstrained environments, future AD systems will need to capture changes in , which will require new ideas and research into unreliable surroundings.

Ii-C Intentional anomalous behavior and emergency stops

Robotics failures are far from monolithic. Some failures are internal, as when a part breaks or when an algorithm fails to anticipate a logical outcome. Some failures are external, when the power grid stops powering the robot or an operator drives the robot over a cliff. And some failures have no simple blame prescription. An operator and a robot may jointly confuse each other; a robot may stumble into an environment for which it was not designed to be successful; or an operator may push the robot to the edge of its limits in a completely rational act of desperation.

In order to characterize intentional anomalous behavior and how to apply emergency stops in particularly alarming cases, we must return to forming a definition of normal operating conditions with respect to the explicit physical bounds of the robotic system. Examples include the amount of weight an arm can lift, degrees of rotation, and the speed at which something can operate. We can imagine these bounds as a closed curve in some operation envelope. As a safety measure, most robots are rated to perform outside their normal operating conditions.

Fig. 1: Example of some arbitrary operation envelope

For some -dimensional set of all possible operating conditions , where each dimension represents an operating attribute (weight, rotation, etc.), we have , where is the subset of operating conditions that fall within normal operating bounds and represents the subset of operating conditions that take into account some Factor of Safety (FoS). represents the set of operating conditions that are outside of the FoS and are not accounted for by the engineers of the system.

Given some state , when can one say that it represents anomalous physical behavior? Certainly if , then is not anomalous. On the other hand, if is very much outside the bounds of , then it is anomalous. There exists this gray area in between. When is outside of

by some small margin, it may not necessarily be anomalous, yet when the margin is appreciable then the probability that the state is anomalous increases. An example here is redlining a car. Redlining by itself may not be anomalous behavior; however, the probability of identifying that state as anomalous increases as we maintain the redline.

Assume that there exists a risk surface outside the bounds of . The risk surface defines how “risky” any is as it moves further from . In the case of robotics, it is necessarily true that the steepness of the risk surface and the “distance” (for some abstract, non-continuous definition of distance) away from is directly correlated. This risk surface is more often than not unknown since is generally high-dimensional and many configurations are untested. Therefore, while some of the risk surface may be defined via explicit testing, a large portion of it must be learned or inferred. A risk model then defines a function over the risk surface that quantifies the amount of risk present in any given . It is important to state that a risk model may either be explicitly defined via rigorous testing if is small, but will have to be learned if is of any sizable cardinality (as in the case of autonomous driving, for example).

Once we have a quantified view of the “riskiness” of a state, we can then create an AD system by integrating time and count components to see how long or how often a risky state is maintained. Using either thresholding or a learned cut-off, it is then feasible — and most importantly, explainable — to trigger emergency stops at relevant times in the robotic system’s execution.

Intentional anomalous behavior is currently not well understood. As such, autonomous emergency stops are difficult to calculate and execute. It is imperative to define risk models ahead of time that can give scales of confidence with which to implement emergency stops; defining risk models in this is manner currently under-explored.

Ii-D AD systems built around hierarchies of systems with shared functionality

Common systems fail in similar fashions. For example, autonomous room cleaners have a shared mode of failure in that they may localize incorrectly, or different types of load bearing arms get stressed on common joints. Therefore, instead of building AD systems that are sensitive to each sensor and actuator, it is possible to create an ontology/hierarchy of sensors and actuators to simplify the AD burden. Such approaches have been explored [18] and demonstrate significant improvement in AD capability, but rely on explicitly defined groupings which may not exist in a robotics use-case. Most systems in a robotic platform are not fundamentally unique and failures on these systems do not need to be finely monitored. By clustering their behavior together and focusing on a subset of their messages features that are shared across these systems, we can reduce the number of topics a robotic AD system must focus on. This reduces computation load and operator burden, as tailored alerts for each system are no longer provided for non-critical systems.

Robotic systems can be described via their information-processing aspects terms in of a graph structure:

  • a collection of nodes , where some nodes are connected by directed edges variously representing physical anchoring, energy flow, or information flow of various kinds,

  • the graph is defined as ,

  • and all nodes are loyal to some pre-defined objective (at least implicitly, by design).

However, in order to discuss shared functionality, communication graphs are not an ideal representation. An alternate definition of a robotic system that may be more conducive is as follows:

  • a collection of nodes , where nodes are connected by directed edges representing one-way communication channels,

  • the graph is defined as ,

  • nodes can be grouped in the form of , where represents a predicate function that returns true if has a certain functionality, and represents the overall set of all groups in the robotic system,

  • and is a member of only one subset of

It is easy to imagine a sub-component of a robotic system as a monolith; if we can abstract away the finer details (e.g., imagine the combination of a robotic “wrist” and “elbow” as just a robotic “arm”) then we can reduce cognitive and computational burden. We call this ability to view things as a simple hierarchy composability — the ability to compose items together into a higher-level merger of objects. Let the behavior of nodes be denoted by , where

. Let there also be a vector of constants

, . Linear composability is then defined by . This necessarily means that the behavior of each node is independent from the behavior of other nodes, which means that this composition is decomposable.

Attributes of nodes that are linearly composable are well-suited to hierarchical treatment. However, most robotic attributes are not inherently decomposable. For example, the dynamics of a robotic “arm” is a non-linear composition of the “wrist” and “elbow”. The dynamics on the wrist interact with the elbow in a dependent fashion. Thus, there is sizable information lost when treating both components as one hierarchy. However, attributes such as individual CPU load can easily be treated as one hierarchy for AD purposes.

There are many gains to be had by abstracting the AD burden into hierarchies of systems, but there is not a clear understanding of what systems can be treated as a group. We present a minimal framework of thought with which to tackle this problem; however, there is not sufficient literature in this area to make definitive conclusions about the merit of this approach.

Ii-E Distribution of computation across hosts

Anomaly detection algorithms sometimes require complex computation that may not be possible on many robot platforms due to limited computational resources. With the prevalence of network connected devices, it is possible to distribute computation to nodes that possess sufficient resources to run computation and send results back to the robot platform.

Distributed computation will be the cornerstone of robotics in the future. With the large amount of sensor, log, and systems data that can be collected and generated by robotic systems, it will become infeasible to provide enough computation on one system to process all the data in a timely manner, especially with regard to AD systems.

There are many well-established methods to distribute computation. To this point, we will discuss three main architectures we believe are especially pertinent to robotics: peer-based, hub-and-spoke, and local reduction with a hub.

Fig. 2: Notional peer-to-peer architecture

Peer-based computation is a model that relies on all peers in a network performing computation together in order to quickly solve a given task. Many such models have been explored [19, 20, 21], but the central idea is that some function can be passed to all nodes in a system, and then some data can be processed by all nodes with results broadcast to all peers. Additional overhead is optional if results need to be merged locally. Alternatively, nodes can be assigned unique functions to compute and a pipeline can be created across nodes, such that no one system is responsible for all the work. In robotic AD systems, the peer-to-peer approach is viable as the AD computation could feasibly be passed around to nodes that are not actively utilizing their computational resources with the benefits being relayed to the entire system. Furthermore, a single node going out of contention does not harm the overall system. This approach has been explored [22, 23], but present many limitations, such as network flow problems associated with a slow peer, protocol issues, and more.

Fig. 3: Notional hub-and-spoke architecture

Hub-and-spoke (often called server/client architecture) models are simple: all computation is offloaded to a central server. The server receives the data, processes it, and returns results to all the “spokes” (nodes), ad infinitum. This removes the burden of computation from the robotic system entirely, which means that robots can be deployed with less computational power than otherwise necessary. This is ideal when AD systems have to deal with massive amounts of data. This idea has been widely embraced by the field of “cloud robotics”, for which [24] and [25] give a good background. This method, however, introduces multiple modes of failure, such as loss of autonomy when there is a loss of signal, computation server failures, and more.

Fig. 4: Notional local reduction with hub-and-spoke architecture

An extension of the hub-and-spoke architecture to reduce the computation load on the central hub is to introduce the concept of local reduction. Each robotic system or component can perform local computation within the bounds of its computational resources by itself, and then pass intermediate results to the hub, where it can process further with lower computational load. This is an exciting model, as robots can already process their data comfortably, so an AD system can still process massive amounts of data with less computational load across the entire system.

The central problem that needs to be addressed is that of autonomy vs. computation. By relying on systems outside of the robotic system itself, the robot loses autonomy. It now must account for communication failures, node dropouts, and countless other errors that could cause it to no longer have the ability to perform some computation. On the other hand, by allowing for distributed computation, the robotic system gains the ability to process more data than it can by itself. These trade-offs are critical and may vastly differ based on use case. Graceful degradation of communication abilities or node failures need to be studied in detail, as the fundamental operations of a robotic system must be maintained even when all distributed computation loads go awry.

For many existing and emerging robotic use cases, such as manufacturing or self-driving vehicles, distributed computation of AD computations will allow for the greatest flexibility in anomaly detection, correction, and recovery procedures.

Ii-F Fixing anomalies on the fly

Once anomalies have been identified, many systems do not take much action beyond providing the operator or the log with an alert that can be manually corrected, if the operator deems it fit. However, with autonomous systems, it is essential that automatic correction of errors and anomalous behavior be integrated as a component or extension of an AD system. Flight control systems include fault-tolerance and error correction as one of their core components [26]

A way of introducing fault tolerance into systems is to run multiple versions of the same system simultaneously and switch between the systems if anomalous behavior is detected. This method has been explored for visual odometry [27] and in other fields, such as optimization and distributed computing [28]. This method is successful but computationally inefficient, as computation would increase with a complexity of with respect to the number of redundant systems.

Non-blocking snapshots [29] of critical systems during periods in which no anomalous behavior is occurring constitute an alternative way of fixing anomalous behavior or errors within various systems. By reverting to a snapshot that is known to work, the system can recover from anomalous behavior, and may even avoid falling back to its anomalous state given that the initial anomaly was caused via non-deterministic causes. The downside to this is that the system may be stuck in a snapshot-recovery cycle if there is a deterministic and reproducible cause that the system is facing.

A middle ground to non-blocking snapshots and running multiple copies of a system is the concept of shadow computing and shadow replication [30], in which a process is duplicated with shadow processes. These shadow processes are an exact replica of the main process but are executed at a lower quality of service (QoS). In robotics, mission-critical software and hardware drivers can be shadow replicated, and upon the detection of anomalous behavior due to a non-deterministic fault, a healthy shadow process can be substituted. The anomalous process can then be shadowed and restarted as needed.

Robotic systems can also learn to correct anomalous behavior and incorporate error correction into various algorithms in the overall system. This approach is explored in the context of planning [31], map learning [32], and many other application domains [33, 34, 35]. Better error correction is a topic of active research in many fields [36, 37, 38]. A truly global method of error correction is out of scope given current technology since the combined dimensionality of all systems in a robot are too vast for learning algorithms to be able to map.

In order to properly address these concerns, it is imperative for anomaly correction to be seen as a basic research task rather than an applied one. Furthermore, borrowing existent research from other fields is critical, as a lot of fundamental work already exists and needs to be adapted for the robotics-specific domain. Finally, AD correction also needs to be quantified in terms of operational risk, i.e., it is currently not well understood how much the lack of anomaly correction affects missions in terms of cost and resources.

Iii Monitoring our own robot: a case study

We implemented a basic AD system in ROS 2 to demonstrate the need for more work in examining robotic AD systems. Due to the popularity and decisions behind the ground-up redesign of ROS 2 [6], we show that basic architectural constraints to implementing a real-time AD system reinforce the conclusion that supporting future AD systems is currently not possible without changes to the system.

The platform we chose for implementation is the Turtlebot 2111, which is a low-cost, open-source robot that provides a framework on which to build. Traditionally, the Turtlebot 2 runs on ROS. Efforts have been made222 to run ROS 2 on the base, but they largely rely on the ROS 1 bridge. Nevertheless, it provides us a target to start developing new ROS 2 technologies.

Iii-a Control

Tele-operation (tele-op) is a common way to control robots in many environments. In order to support full tele-op for the Turtlebot on ROS 2, we created a ROS 2 node that allows an Xbox controller to be used to control all degrees of motion for the Kobuki base.

  • This is a complete ROS 2 rclpy implementation and does not rely on the ROS 1 bridge.

  • Both publishing and subscription occur in the same node to listen to Joy messages and send messages to /cmd_vel.

Iii-B Capturing Messages

We explored two methods to capture the messages published on topics in the ROS system. Both methods have their upsides and downsides. For the sake of modularity, we focus only on the upsides here and capture the limitations later.

  1. We created an rclpy ROS 2 node that would automatically subscribe to all the topics in the system and write their messages to disk.

    • We can debug messages in real-time with existent Python tooling.

    • New topics can be subscribed to on the fly without modifying the ROS environment.

    • The behavior of the topic collector can be changed on the fly by publishing messages to it.

    • Can be extended to perform actions on the ROS 2 system in real-time.

  2. Using rosbag over the ROS 1 bridge.

    • Use existent ROS infrastructure without having to reinvent the wheel.

    • Easy tooling to analyze and simulate rosbag files.

Both methods give us files that we can interpret as a chronological feed of the robot’s operation, which we can analyze and perform actions on. Of the two, Method 1 is preferred since we can perform actions immediately when we detect an anomaly. However, as we will see later, there are significant limitations with ROS 2 to make this a reality.

Iii-C A basic AD system

Our data consists of timestamps for every message passed by each node in the system. A single data point is a sensor-timestamp pair, down to the millisecond. For intuition to drive our AD model, consider that we may define many different metrics or features from the data. Surprising values or combinations of values for any of these metrics could suggest the presence of anomalies:

  • Rate of messages per topic or across all topics

  • Rate of messages per time, day of the week, etc.

  • Autocorrelation of messages: How often does a message of one type tend to follow a message of another type within a particular time-lag window?

Iv Results

In order to test our basic AD system, we ran the Turtlebot along specified paths, injecting behaviors such as jerky direction changes, controller disconnects, counter-intuitive path following, and varying speeds. Each path was tested with a random combination of all behavior modifications multiple times. The basic AD system performed reasonably well, capturing anomalies  70% of the time. Due to limitations, discussed below, we were unable to verify how the AD system would react after correcting any observed anomalies in the behavior.

While building utilities to capture messages and a real-time AD system, we encountered several limitations with ROS 2, the most pressing of which we enumerate here.

  1. A node that is solely a Subscriber gets no results from pythonget_topic_names_and_types() in class Node. In order to get all topic names and types, the node must also be a Publisher.

  2. There are no easily available API methods to access the underlying ROS 2 DDS implementation (eProsima RTPS) that we could find for the Python interface.

  3. Writing messages to disk using the all-subscriber node (Method 1, III-B) causes system hangups and dropped messages in the pub-sub queues. This is a significant hurdle to real-time AD systems.

  4. rosbag cannot listen to new topics if they appear after the command to launch rosbag is executed.

  5. There is no simple utility to convert rosbags to formats such as JSON, YAML, CSV, and Protobufs. Data scientists may not have the easiest time handling the rosbag API, but their toolkits already support the listed formats.

Auxiliary limitations that were faced included the lack of good documentation for many basic ROS 2 tasks, an ever-changing development environment, and hard-to-find community support for the platform. We expect these problems to be solved naturally as ROS 2 matures as a platform.

V Recommendations for ROS 2

The decentralized (by default) graph structure of a ROS 2 robot sets it apart from its predecessor, as explained in [6]. The design of ROS 2 allows for easier development, faster and more understandable inter-node communication dynamics, and a closer alignment to the ideal “pub-sub” design architecture. However, this decentralization of ROS 2 raises questions about the available goals of an AD system implemented in ROS 2. As described in Section IV, the current design and implementation of ROS 2 prove to be a hindrance to simple, non-real-time AD applications. In order to be future-facing and to handle the open problems outlined in Section II, ROS 2 must evolve and implement some alterations to its design specification. Our recommendations relate to optimizing the design of ROS 2 to enable advanced and future AD methods to operate most effectively. The basis for our recommendations is a first-principles analysis, as there is not much of an open-source precedent for AD in robotics, and particularly not one for ROS 2.

The following are the core ingredients that we see as necessary for a successful AD system on a robot:

  1. Data: Sensor data of all kinds are the fundamental inputs to AD algorithms. If real-time AD is needed, real-time data must be collected. If OS temperature is an important indicator of a near-future node malfunction, temperature data must be collected.

  2. Data transport/centralization: Specifically, any node running AD software needs to have data feeds from all other nodes or processes that are relevant for the targeted class of anomalies. This introduces a communication overhead cost.

  3. Computational nodes: Once all data relevant for anomaly detection is being supplied to a central node, the challenge becomes how to derive insight from it. In general, such analysis is computationally costly. To maintain a strong AD capability on a complex robot, ROS needs to support large compute nodes.

Given these core ingredients, we make the following recommendations for ROS 2 based on our findings:

  1. Introduce strict value ranges into messages (II-B).

    • Messages define what data a listener can accept, but the listener considers only a set range of values valid for any given field. When values out of this range are received, the listener can error out.

    • Listeners must implement their own range-checking, which is a hurdle and can often be poorly managed.

    • A message type with built-in range checks can be denied on the publisher’s side before the listener ever has to interact with it.

  2. Provide automatic subscription to new topics and messages in rosbag (III-B).

    • In order to build a real-time AD system, we need to be able to detect when nodes enter and leave the system. In its current form, and its new proposed design [6], rosbag cannot detect when a node creates a new topic or an existent one is removed.

    • Having a utility that can track the dynamism in node publishing behavior is critical, especially during a node crash.

  3. Develop a better profiling environment

    • In order to understand the system and to collect valuable diagnostic data, a robust profiling environment needs to exist within the ROS ecosystem.

    • Currently, many people rely on language specific profiling tools, the /statistics topic, and self-created packages to collect system data.

  4. Integrate best-known-state tracking and recovery (II-F)

    • There are no ways to backup the state of a node and to restore it to a given point, a method which can quickly resolve many issues with an node without resorting to a complete restart.

    • System snapshots in general can help with debugging and profiling of performance.

  5. Introduce state introspection (II-D, II-F)

    • The ROS graph needs to be known and all states need to be dynamically verified in order to build and maintain a structure of hierarchy.

    • Knowing the initial (or best) structure and state of all nodes in the ROS system will provide a known target from which to recover from.

  6. Provide a safe mode (II-F)

    • When nodes fail, it is essential that core functionalities are still available.

    • In a safe mode, basic functionalities will be provided to recover the robotic system, such as when in tele-operation.

Vi Conclusion

Recently, the NTSB and BEA [39, 40] filed reports that discussed fatal accidents in autonomous systems on land and in air due to improper edge case and anomaly handling by autonomous components in both robotic systems. The reports highlight how these robotic systems were unable to react outside of well understood operating envelopes and unable to alert their operators about their failures in meaningful ways. These reports echos the necessity of understanding the various issues in anomaly detection when deploying a robotic so that spurious or intentional anomalous behavior can be understood and handled before the resulting actions are dangerous both to the robot and the objects around it.

As anomaly detection for robotic systems continues to grow as a point of interest, upcoming robotics platforms need to support these capabilities on a first-class basis. Furthermore, it is clear that robotic anomaly detection is still in its nascent stage. Much work needs to be done to fully understand and handle anomalous behaviors in robotic systems in order to achieve complete, trustworthy autonomy.


The authors thank Heather Evans, William Shaw, and Hollen Barmer at CMU SEI for their support of this research. Drs. John Dolan and David Bourne at The Robotics Institute at CMU gave us insight into various aspects of robotics.

DISTRIBUTION A. Approved for public release: distribution unlimited. Copyright 2018 Carnegie Mellon University. All Rights Reserved. This material is based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. References herein to any specific commercial product, process, or service by trade name, trade mark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. DM18-1034


  • [1] A. Siffer, P.-A. Fouque, A. Termier, and C. Largouet, “Anomaly Detection in Streams with Extreme Value Theory,” in Proc. of the 23rd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining.   ACM Press, 2017, pp. 1067–1075.
  • [2] A. Schein, J. Paisley, D. M. Blei, and H. Wallach, “Bayesian Poisson Tensor Factorization for Inferring Multilateral Relations from Sparse Dyadic Event Counts,” ArXiv e-prints, vol. 1506.03493, June 2015.
  • [3]

    L. Friedland, A. Gentzel, and D. Jensen, “Classifier-Adjusted Density Estimation for Anomaly Detection and One-Class Classification,” in

    Proc. of the SIAM Int. Conf. on Data Mining.   Society for Industrial and Applied Mathematics, April 2014, pp. 578–586.
  • [4] D. Bruns-Smith, M. M. Baskaran, J. Ezick, T. Henretty, and R. Lethin, “Cyber Security through Multidimensional Data Decompositions,” in Cybersecurity Symp. (CYBERSEC), April 2016, pp. 59–67.
  • [5] S. Speakman, S. Somanchi, E. McFowland III, and D. B. Neill, “Penalized Fast Subset Scanning,” Journ. of Computational and Graphical Statistics, vol. 25, no. 2, pp. 382–404, April 2016.
  • [6] B. Gerkey. (2018, Aug.) Why ROS 2.0? Open Source Robotics Foundation, Inc. San Francisco, CA, USA. [Online]. Available:
  • [7] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: An Open-Source Robot Operating System,” in ICRA Workshop on Open Source Software, vol. 3, 2009.
  • [8] A. Patcha and J.-M. Park, “An Overview of Anomaly Detection Techniques: Existing Solutions and Latest Technological Trends,” Computer Networks, vol. 51, no. 12, pp. 3448–3470, Aug. 2007.
  • [9] M. Grill, T. Pevný, and M. Rehak, “Reducing False Positives of Network Anomaly Detection by Local Adaptive Multivariate Smoothing,” Journ. of Computer and System Sciences, vol. 83, pp. 43–57, 2017.
  • [10] K. Timm. (2001, Sept.) Strategies to Reduce False Positives and False Negatives in NIDS. [Online]. Available:
  • [11] J. Y. C. Chen and P. I. Terrence, “Effects of Imperfect Automation and Individual Differences on Concurrent Performance of Military and Robotics Tasks in a Simulated Multitasking Environment,” Ergonomics, vol. 52, no. 8, pp. 907–920, Aug. 2009.
  • [12] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Computing Surveys, vol. 41, July 2009.
  • [13] C. H. R. Everett and G. A. Gilbreath, “A Supervised Autonomous Security Robot,” Robotics and Autonomous Systems, vol. 4, no. 3, pp. 209–232, Nov. 1988.
  • [14] L. F. Cranor, “A Framework for Reasoning About the Human in the Loop,” Usability, Psychology, and Security, p. 15, April 2008.
  • [15] R. R. Murphy and D. Hershberger, “Handling Sensing Failures in Autonomous Mobile Robots,” Int. Journ. of Robotics Research, vol. 18, no. 4, pp. 382–400, April 1999.
  • [16] L. F. D. Goncalves, L. N. Karlsson, P. Pirjanian, and E. D. Bernardo, “Systems and Methods for Filtering Potentially Unreliable Visual Data for Visual Simultaneous Localization and Mapping,” USA Patent US7 272 467B2, Sept., 2007.
  • [17] M. Pignati, L. Zanni, S. Sarri, R. Cherkaoui, J. Y. L. Boudec, and M. Paolone, “A Pre-Estimation Filtering Process of Bad Data for Linear Power Systems State Estimators Using PMUs,” in Power Systems Computation Conf., Aug. 2014, pp. 1–8.
  • [18] L. Xiong, B. Póczos, J. G. Schneider, A. J. Connolly, and J. VanderPlas, “Hierarchical Probabilistic Models for Group Anomaly Detection,” in AISTATS, 2011.
  • [19] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu, “Peer-to-Peer Computing,” HP Laboratories, Tech. Rep. HPL-2002-57, July 2003. [Online]. Available:
  • [20] D. Bickson, D. Dolev, G. Bezman, and B. Pinkas, “Peer-to-Peer Secure Multi-Party Numerical Computation,” in Int. Conf. on Peer-to-Peer Computing, Sept. 2008, pp. 257–266.
  • [21] V. King, J. Saia, V. Sanwalani, and E. Vee, “Towards Secure and Scalable Computation in Peer-to-Peer Networks,” in IEEE Symp. on Foundations of Computer Science (FOCS), Oct. 2006, pp. 87–98.
  • [22] A. Bijayendrayodhin, “Design of the Peer Agent for Multi-robot Communication in an Agent-based Robot Control Architecture,” Master’s thesis, May 2002.
  • [23] C. A. C. Parker and H. Zhang, “A Practical Implementation of Random Peer-to-Peer Communication for a Multiple-Robot System,” in IEEE Int. Conf. on Robotics and Automation, April 2007, pp. 3730–3735.
  • [24] Z. Du, W. Yang, Y. Chen, X. Sun, X. Wang, and C. Xu, “Design of a Robot Cloud Center,” in Int. Symp. on Autonomous Decentralized Systems, March 2011, pp. 269–275.
  • [25] G. Hu, W. P. Tay, and Y. Wen, “Cloud robotics: architecture, challenges and applications,” IEEE Network, vol. 26, no. 3, pp. 21–28, May 2012.
  • [26] C. Edwards, T. Lombaerts, and H. Smaili, Eds., Fault Tolerant Flight Control: A Benchmark Challenge, ser. Lecture Notes in Control and Information Sciences.   Springer-Verlag, 2010, vol. 399.
  • [27] K. Holtz, D. Maturana, and S. Scherer, “Learning a Context-Dependent Switching Strategy for Robust Visual Odometry,” in Field and Service Robotics, D. S. Wettergreen and T. D. Barfoot, Eds.   Springer International Publishing, 2016, vol. 113, pp. 249–263.
  • [28] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning,” ArXiv e-prints, vol. 1803.05397, March 2018.
  • [29] K. M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Trans. on Computer Systems, vol. 3, no. 1, pp. 63–75, Feb. 1985.
  • [30] B. Mills, T. Znati, and R. Melhem, “Shadow Computing: An Energy-Aware Fault Tolerant Computing Model,” in Int. Conf. on Computing, Networking and Communications (ICNC), Feb. 2014, pp. 73–77.
  • [31] J. P. Mendoza, “Regions of Inaccurate Modeling for Robot Anomaly Detection and Model Correction,” Ph.D. dissertation, Pittsburgh, PA, USA, April 2017.
  • [32] S. P. Engelson and D. V. McDermott, “Error Correction in Mobile Robot Map Learning,” in Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), vol. 3, May 1992, pp. 2555–2560.
  • [33] R. Mojarad, H. Kordestani, and H. R. Zarandi, “A Cluster-Based Method to Detect and Correct Anomalies in Sensor Data of Embedded Systems,” in Euromicro Int. Conf. on Parallel, Distributed, and Network-Based Processing (PDP), Feb 2016, pp. 240–247.
  • [34] H. Wu, H. Lin, S. Luo, S. Duan, Y. Guan, and J. Rojas, “Recovering from External Disturbances in Online Manipulation through State-Dependent Revertive Recovery Policies,” ArXiv e-prints, vol. 1708.00200, Aug. 2017.
  • [35] P. Guo, H. Kim, N. Virani, J. Xu, M. Zhu, and P. Liu, “Nonlinear Unknown Input and State Estimation Algorithm in Mobile Robots,” ArXiv e-prints, vol. 1804.02814, 2018.
  • [36] M. C. Stumpe, J. C. Smith, J. E. Van Cleve, J. D. Twicken, T. S. Barclay, M. N. Fanelli, F. R. Girouard, J. M. Jenkins, J. J. Kolodziejczak, S. D. McCauliff, and R. L. Morris, “Kepler Presearch Data Conditioning I-Architecture and Algorithms for Error Correction in Kepler Light Curves,” ArXiv e-prints, vol. 1203.1382, Sept. 2012.
  • [37] F. Santini, A. Palombo, R. J. Dekker, S. Pignatti, S. Pascucci, and P. B. W. Schwering, “Advanced Anomalous Pixel Correction Algorithms for Hyperspectral Thermal Infrared Data: The TASI-600 Case Study,” IEEE Journ. of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp. 2393–2404, June 2014.
  • [38] B. C. Bush, F. P. J. Valero, A. S. Simpson, and L. Bignone, “Characterization of Thermal Effects in Pyranometers: A Data Correction Algorithm for Improved Measurement of Surface Insolation,” Journ. of Atmospheric and Oceanic Technology, vol. 17, pp. 165–175, 2000.
  • [39] “Preliminary Report: Crash and Post-crash Fire of Electric-powered Passenger Vehicle​,” National Transportation Safety Board, Tech. Rep. HWY18FH011, June 2018. [Online]. Available:
  • [40] B. d’Enquêtes ed d’Analyses, “Final Report on the Accident on 1st June 2009 to the Airbus A330-203 Registered F-GZCP Operated by Air France flight AF 447 Rio de Janeiro - Paris,” Ministère de l’Écologies, Tech. Rep. BEA f-cp090601, July 2012. [Online]. Available: