With the widespread use of multicore systems, even in everyday phones, concurrent programming has become mainstream. However, concurrent programming is known to be hard and error-prone. Unlike traditional sequential programs, concurrent programs often exhibit non-deterministic behavior which makes it difficult to reason about their behavior. Many bugs involving concurrent entities, e.g. processes, threads, actors, manifest themselves only in rare execution traces. Identifying and analyzing concurrency bugs is thus an arduous task, perhaps even an art.
When studying techniques to support the development of complex concurrent programs, our first research question is what types of concurrency bugs appear in such programs. The answer to this question depends on the concurrency model in which the program is written. Most existing studies about concurrency bugs focus on thread-based concurrency[6, 43, 10, 39, 56, 37, 1, 2].
The established frame of reference, however, does not directly apply to other concurrency models which are not based on a shared memory model such as the actor model, communicating sequential processes (CSP), etc. In this paper we study concurrency bugs in message passing concurrent software, in particular, in actor-based programs.
The actor model is attractive for concurrent programming because it avoids by design some concurrency bugs associated with thread-based programs. Since actors do not share mutable state, programs cannot exhibit memory-level race conditions such as data races. In addition to that, deadlocks can be avoided if communication between actors is solely based on asynchronous message passing. However, this does not mean that programs are inherently free from concurrency issues.
This paper surveys concurrency bugs in the literature on actor-based programs and aims to answer three research questions: (1) which kind of concurrency bugs can be avoided by the actor model and its variants, (2) what kind of patterns cause concurrency bugs in actor programs, and (3) what is the observable behavior in the programs that have these bugs?
To provide a common frame of reference to distinguish different types of concurrency bugs that appear in actor-based programs, we propose a taxonomy of concurrency bugs in actor-based programs (in Section 3). The taxonomy aims to establish a conceptual framework for concurrency bugs that facilitates communication amongst researchers. It is also meant to help practitioners in developing, testing, debugging, or even statically analyzing programs to identify the root cause of concurrency bugs by offering more information about the types of bugs and their observable properties.
Based on our taxonomy of bugs, we analyze actor literature that reports concurrency bugs and map them to the proposed classification. Furthermore, we identify which types of bugs have been addressed in literature so far, and which types have been studied less.
The contributions of this paper are:
A systematic study of concurrency bugs in actor-based programs based on a literature review. To the best of our knowledge it is the first taxonomy of bugs in the context of actor-based concurrent software.
An analysis of the patterns and observable behaviors of concurrency bugs found in different actor-based programs.
A review of the state of the art in static analysis, testing, debugging, and visualization of actor-based programs to identify open research issues.
2 Terminology and Background Information
Before we delve into the classification of concurrency bugs in actor-based programs, we discuss the terminology used in this paper and the basic concepts on actor-based programs and concurrency issues.
A concurrency bug is a failure related to the interactions among different concurrent entities of a system. Following Avizienis’s terminology, a failure is an event that occurs when the services provided by a system deviate from the ones it was designed for. The discrepancy between the observed behavior and the theoretically correct behavior of a system is called an error. Hence, an error is an event that may lead to a failure. Finally, a fault is an incorrect step in a program which causes an error (e.g. the cause of a message transmission error in a distributed system may be a broken network cable). A fault is said to be active when it causes an error, and dormant when is present in a system but has not yet manifested itself as an error. Throughout this paper, we use the terms concurrency bug and issue interchangeably.
Although actors were originally designed to be used in open distributed environments, they can be used on a single machine, e.g. in multicore programming. This paper analyses concurrency bugs that appear in actor-based programs used in either concurrent or distributed systems. However, bugs that are only observable in distributed systems (e.g. due to network failures) are out of the scope of this paper.
3 Classification of Concurrency Bugs in Actor-based Programs
While there is a large number of studies for concurrency bugs in thread-based programs, there are only few studies on bugs in the context of message passing programs. Zhang et al.  study bug patterns, manifestation conditions, and bug fixes in three open source applications that use message passing. In this context, literature typically uses general terms to refer a certain issue, for example ordering problems. For actor-based programs however, there is so far no established terminology for concurrency bugs.
This section introduces a taxonomy of concurrency bugs for the actor model derived from bugs reported in literature and from our own experience with actor languages. LABEL:tab:taxonomy
first summarizes the well-known terminology for thread-based programs from literature, and then introduces our proposed terminology for concurrent bugs in actor-based programs. Our overall categorization starts out from the distinction of shared-memory concurrency bugs in literature, which classifies bugs in two general categories: lack of progress issues and race conditions.
Depending on the guarantees provided by a specific actor model, programs may be subject to different concurrency bugs. Therefore, not all concurrency bugs are applicable to all actor variants. In the rest of the section we define each type of bug, and detail in which variants it cannot be present.
3.1 Lack of Progress Issues
Two different kinds of conditions can lead to a lack of progress in an actor-based program: deadlocks and livelocks. However, these issues manifest themselves differently in actor-based programs compared to thread-based programs.
3.1.1 Communication Deadlock.
A communication deadlock is a condition in a system where two or more actors are blocked forever waiting for each other to do something. This condition is similar to traditional deadlocks known from thread-based programs. We base the terminology on the work of  in Erlang concurrency bugs.
Communication deadlocks can only occur in variants of the actor model that feature a blocking receive operation. This is common in variants of the actor model based on processes. Examples of such actor systems include Erlang and the Scala Actors framework. A communication deadlock manifests itself when an actor only has messages in its inbox that cannot be received with the currently active receive statement. LABEL:lst:pingpong shows a communication deadlock example in Erlang. The fault is in listing 1, where the pong process is blocked because it is waiting for a message that is never sent by the ping process. Instead the ping process returns ok.
3.1.2 Behavioral Deadlock.
A behavioral deadlock happens when two or more actors conceptually wait for each other because the message to complete the next step in an algorithm is never sent. In this case, no actor is necessarily suspended or otherwise unable to receive messages. We call this situation a behavioral deadlock, because the mutual waiting prevents local progress. However, these actors might still process messages from other actors. Since actors do not actually block, detecting behavioral deadlocks can be harder than detecting deadlocks in thread-based programs.
We illustrate a behavioral deadlock in an implementation of the dining philosophers concurrency problem written in Newspeak which is shown in LABEL:lst:philosophers. The behavioral deadlock has the effect that some philosophers cannot eat (as they never acquire two consecutive forks), preventing global progress. Listing 2 shows that the left fork has the same value as the id of the philosopher, but for the right fork the program computes its value. For example, philosopher 1 will eat with fork 1 and 2 and so on. The error occurs when the philosopher puts down its forks: the right fork gets a wrong value (listing 2) because the implementation swapped numForks and leftForkId variables. This programming mistake is the fault that causes fork 2 and 4 to be always taken. Consequently, there is no global progress since philosopher 2 and 4 never eat and philosopher 1 and 3 eat only once. Philosopher 5 can always eat showing local progress, however.
3.1.3 Bad message interleaving.
We define a bad message interleaving as the condition when a message is processed between two messages which are expected to be processed one after the other, causing some misbehavior of the application or even a crash.
In the original actor model, when an actor sends a message to a recipient actor, the message is placed in a mailbox and is guaranteed to be eventually delivered by the actor system. All messages are thus expected to be delivered in the order in which the sender actor sent them. However, there are two sources of bad interleavings. First, messages from different senders may be interleaved in between messages from one sender. In other words, even if the actor model enforces that messages from a sender actor are received in a FIFO order, messages from different sender actors may occur between them. The second source of bad interleavings of messages occurs in variants of the actor model which do not guarantee in-order delivery of the messages. This can be found in actor models used to build distributed systems, like Scala or ActorFoundry  in which communication between actors is not enforced to work in a FIFO manner.
3.2 Comparison with Existing Terminology in Actor Literature
As pointed out in the introduction, the goal of establishing a taxonomy is to provide a common vocabulary for concurrency bugs in actor-based programs. In what follows we relate our terminology to the one presented in other efforts tackling concurrency bugs for actor-based programs.
Bad message interleavings have been denoted as ordering problems by Lauterburg et al.  and Long et al.  and as atomicity violation by Zheng et al.  and Hong et al. . We consider ordering problems to be too coarse-grained terminology. We decided to use the term bad message interleaving to avoid confusion with atomicity violations in thread-based concurrent programs due to low-level memory accesses errors.
Message order violations have been collected under many different names in literature: data races by Petrov et al. , harmful races by Raychev et al. , order violations by Hong et al. , and message ordering bugs by Tasharofi et al. . We consider message order violations to be a descriptive name while avoiding confusion with low-level data races present in thread-based programs.
Memory inconsistency problems have been denoted as race conditions by Hughes and Bolinder . D’Osualdo  tackled this problem by proving a correctness property referred to as “mutual exclusion”.
In literature, the term orphan messages  refers to messages that an actor sends but that the receiver actor(s) will never handle. Rather than a kind of concurrency bug, we consider orphan messages as an observable property of an actor system which may be a symptom of a concurrency bug like communication deadlocks or message ordering violations. We use this terminology in the next section when we classify concurrency bugs reported in literature with our taxonomy. Orphan messages can for example be present in actor languages that allow flexible interfaces such as Erlang, the Scala Actors framework and the Akka library . An actor may change the set of messages it accepts after another actor has already sent a message which can only be received by an interface which is no longer supported.
4 Concurrency Bugs in Actor-based Programs
In this section, we review various concurrency bugs reported in literature, and classify them according to the taxonomy introduced in Section 3. The goal is twofold: (1) to classify concurrency bugs collected in prior research in the bug categories according to our taxonomy and (2) to identify bug patterns and observable behaviors that appear in programs exhibiting a particular concurrency bug. The latter is useful to design mechanisms for testing, verification, static analysis, or debugging of such concurrency issues.
Table 2 shows the catalog of analyzed concurrency bugs collected from literature. In the first column we categorized these bugs according to the taxonomy presented in LABEL:tab:taxonomy. For each bug scenario we describe the bug pattern as a generalized description of the fault by identifying the actions that trigger the error. In the remainder, we highlight the identified bug patterns in italic. We also describe the observable behavior of the program that has the concurrency issue, i.e. the failure.
4.1 Lack of Progress Issues
To the best of our knowledge, the literature reports on communication deadlocks mostly in the context of Erlang programs. Bug-4 in Table 2 is an example of a communication deadlock collected by Christakis and Sagonas , which corresponds to the example depicted in LABEL:lst:pingpong. Christakis and Sagonas  distinguish two causes for communication deadlocks in Erlang programs:
receive-statement with no messages i.e. empty mailbox,
receive with the wrong kind i.e. the messages of the mailbox are different to the ones expected by the receive statement.
We classify these conditions as bug patterns for orphan messages, which can lead to communication deadlocks in Erlang.
Christakis and Sagonas  mention also other conditions that can cause mailbox overflows or potentially indicate logical errors. Such conditions include no matching receive, i.e. the process does not have any receive clause matching a message in its mailbox, or receive-statement with unnecessary patterns, i.e. the receive statement contains patterns that are never used.
Bug-9 is similar in kind to bug-4. Bug-9 was identified by Gotovos et al.  when implementing a test program in Erlang which has a server process that receives and replies to messages inside a loop. The server process blocks indefinitely because it waits for a message that is never sent. They also identify it as problematic, when a message is sent to an already finished process, which is exhibited by bug-10. This can happen due to two possible situations. First, if a client process sends a message to an already finished server process, the client process will throw an exception. Second, if the server process exits without replying after the message was received, the client process will block waiting for a reply that is never sent. We categorize bug-4, bug-9, and bug-10 as communication deadlocks and the observable behaviors as orphan messages.
D’Osualdo et al.  identified three other bug patterns leading to abnormal process termination in Erlang programs, which might cause deadlocks: sending a message to a non-pid value, applying a function with the wrong arity and spawning a non-functional value. These bug patterns could result in a communication deadlock or in a message order violation if the termination notification is not handled correctly.
Aronis and Sagonas  studied built-ins operations that can cause races in Erlang programs. Because the studied built-ins can access memory that is shared by processes, races can be observed in form of different outputs. Their classification on observable interferences of Erlang/OTP built-ins can help to diagnose communication deadlocks, message order violations, and memory inconsistencies.
4.2 Message Protocol Violations
4.2.1 Message order violation.
In Erlang, updating certain resources such as the global name registry requires careful coordination to avoid concurrency issues. For example, we categorize bug-1 as a message order violation, which as a result makes a race on the global process registry visible. The bug is caused because two processes try to register processes for the same global name more than once, which is done with non-atomic operations. For correctness, these processes would need to coordinate with each other.
Bug-11 reported by Christakis et al.  is another example of a message order violation exhibited when a spawned process terminates before the parent process registers its process id. The application expects the parent process to register the id of the spawned process before the spawned process is finalized, but as the execution of spawn and register functions are not atomic, an unexpected termination can cause a message order violation.
Tasharofi et al.  identified twelve bugs in five Scala projects using the Akka actor library, which we categorize as message ordering problems. Bug-13 gives details of one of these bugs. The study found two bug patterns in Scala and Akka programs that can cause concurrency bugs in actors. First, when changing the order of two receives in a single actor (consecutive or not), which can provoke a message order violation. Second, when an actor sends a message to another actor which does not have the suitable receive for that message. This last issue corresponds to an orphan message, and can also lead to other misbehaviors such as communication deadlocks.
4.2.2 Bad message interleaving.
Bug-12 corresponds to the example of bad message interleaving collected by Lauterburg et al.  which was shown in LABEL:lst:bad-interleaving. The bug pattern occurs when an actor executes a third message between two consecutive messages due to the actor model implementation being not FIFO.
4.2.3 Memory inconsistency.
To the best of our knowledge, memory inconsistency issues have only been reported in the context of Erlang programs. Christakis and Sagonas  shows an example of high-level races between processes using the Erlang Term Storage in bug-2. In this case the error is due to inserting and lookup in tables that have public access, thus it is possible that two or more processes try to read and write from them simultaneously. A second example detailed in bug-3, shows a similar issue that can happen when accessing tables of the Mnesia database. The cause is due to the use of reading and writing operations that can cause race conditions. We categorize both issues as memory inconsistency problems.
Hughes and Bolinder  detected four bugs corresponding to memory inconsistencies in dets, the disk storage back end used in the Erlang database Mnesia. Bug-5 refers to insert operations that run in parallel instead of being queued in a single queue. They can cause inconsistent return values or even exceptions. The observable behavior of bug-6 corresponds to an inconsistency of visualizing the dets content. This issue can occur when reopening a file that is already open and executing insert and get_contents operations in parallel
. Bug-7 and bug-8 are caused due to failure on integrity checks. Of the four bugs that were found, these two are the ones that can occur with the least probability. Bug-7 is reproduced only in one specific scenario whenrunning three processes in parallel, and bug-8 can occur only in those languages implementations that can keep new and old versions of the server state.
Huch  and D’Osualdo et al.  conducted studies to verify mutual exclusion in Erlang programs. LABEL:lst:memory shows an example. The bug pattern identified corresponds to the wrong definition of the behavior of the actor, and the observable property is that two actors can store different values for the same key which leads to inconsistencies, i.e. the actors can share the same resource.
4.3 Actor Variants and Possible Bugs
Based on our review of concurrency bugs above, we summarize which concurrency bugs can occur for each variant of the actor model. Furthermore, we identify the patterns that can cause a concurrency bug and the behavior that can be observed in the programs that have these bugs.
In languages that implement the process actor model, e.g. Erlang and Scala, programs can exhibit communication deadlocks because the actor implementation provides blocking operations. A common observable behavior of this concurrency bug are the orphan messages. This means an actor with this issue is blocked, i.e. the process is in a waiting state. These languages can also suffer from message order violations and memory inconsistencies. For message order violations possible bug patterns are the delays in managing responses, or the unsupported interleaving of messages i.e. the actor protocol does not correspond to the executed message interleavings. These can result in a program crash or inconsistent computational results. Memory inconsistencies are typically caused by a wrong message order when accessing shared resources.
Similarly to the process actor variant, event-loop based programs can suffer from message order violations and bad message interleavings. Generally, message order violations, bad message interleaving, and memory inconsistencies are race conditions that can happen in all actor-based programs including in programs using the class or active object actor model variants.
5 Advanced Development Techniques
This section surveys the current state of the art of techniques that support the development of actor-based programs. The goal is to identify the relevant subfields of study and problems in the literature. Furthermore, for each of these techniques we analyzed based on the literature how they relate to the bug categories of our taxonomy to identify open issues.
Specifically, we survey techniques for static analysis, testing tools, debuggers, and visualization. Table 1 gives an overview of the categories of bugs that static analysis and testing techniques address. It leaves out debugging and visualization techniques, since they are typically not geared towards a specific set of bugs.
5.1 Static Analysis
The static analysis approaches surveyed in this section include all approaches that identify concurrency issues without executing a program. This includes approaches based on typing, abstract interpretation, symbolic execution, and model checking. The following descriptions are organized by the category of concurrency bugs these approaches address.
5.1.1 Lack of progress issues.
In the field of actor languages, Erlang has been subject to extensive studies. Dialyzer is a static analysis tool that uses type inference in addition to type annotations to analyze Erlang code. The static analysis uses information on control flow and data flow to identify problematic usage of Erlang built-in functions that can cause concurrency issues. Dialyzer also has support for detecting message order violations as well as memory inconsistencies[48, 13]. Christakis and Sagonas  extended Dialyzer to also detect communication deadlocks in Erlang using a technique based on communication graphs.
Another branch of work uses type systems to prevent concurrency issues. For actor languages, this includes for instance the work of Colaço et al.. Based on a type system for a primitive actor calculus, they can prevent many situations in which messages would be received but never processed, i.e., so-called orphan messages. However, static analysis cannot detect all possible orphan messages. Therefore, the approach relies on dynamic type checks to detect the remaining cases. Similar work was done for Erlang, where orphan messages are also detected based on a type system.
Dam and Fredlund  proposed an approach using static analysis to verify properties such as the boundedness of mailboxes. The verification of this property can avoid the presence of orphan messages in a program. Their technique applies local model checking in combination with temporal logic and extensions to the -calculus for basic Erlang systems.
Similarly, Stiévenart et al.  used abstract interpretation techniques to statically verify the absence of errors in actor-based programs and upper bounds of actor mailboxes. As mentioned before the verification of mailbox bounds can avoid the presence of orphan messages. The proposed technique is based on different mailbox abstractions which allows to preserve the order and multiplicity of the messages. Thus, this verification technique can be useful to avoid message order violations.
5.1.2 Message protocol violation.
D’Osualdo et al.  also worked on Erlang and used static analysis and infinite-state model checking. Their goal is to check specific properties for programs that are expressed with annotations in the code. With this approach, they are able to verify for instance correct mutual exclusion semantics modeled with messages. However, their current approach cannot model arbitrary message order violations, because the used analysis abstracts too coarsely from messages.
Garoche et al.  verify safety properties statically for an actor calculus by using abstract interpretation. Their work focuses on orphan messages and specific message order violations. Their technique is especially suited for detecting unreadable behavior, detecting unboundedness of resources, and determining whether linearity constraints hold.
5.2 Testing Tools
This section describes work on testing actor based-programs to identify concurrency bugs. Some of the approaches are based on recording the interleaving of messages, the usage of state model checkers, and techniques to analyze message schedules.
5.2.1 Lack of progress issues.
Sen and Agha  present an approach to detect communication deadlocks in a language closely related to actor semantics. They use a concolic testing approach that combines symbolic execution for input data generation with concrete execution to determine branch coverage. The key aspect of their technique is to minimize the number of execution paths that need to be explored while maintaining full coverage.
Concuerror is a systematic testing tool for Erlang that can detect abnormal process termination as well as blocked processes, which might indicate a communication deadlock. To identify these issues, Concuerror records process interleavings for test executions and implements a stateless search strategy to explore all interleavings.
5.2.2 Message protocol violation.
Claessen et al.  use a test-case-generation approach based on QuickCheck in combination with a custom user-level scheduler to identify race conditions. The focus is specifically on bad message interleavings and process termination issues. To make their approach intuitive for developers, they visualize problematic traces. Hughes and Bolinder  use the same approach and apply it to a key component of the Mnesia database for Erlang. They demonstrate that the system is able to find race conditions at the message level that can occur when interacting with the shared memory primitives used by Mnesia.
Basset[35, 36] is an automated testing tool based on Java PathFinder, a state model checker, that can discover bad message interleavings in Scala and ActorFoundry programs.  improve Basset with a technique to reduce schedules to be explored, which improves the performance of Basset. Their key insight is to exploit the transitivity of message send dependencies to prune the search space for relevant execution schedules. For the Scala-Akka programs there is another testing tool called Bita, which can also detect message order violations. Their proposal is based on a technique called schedule coverage, which analyzes the order of the receive events of an actor.
The Setac framework for the Scala Actors framework enables testing for race conditions on actor messages, specifically message order violations. A test case defines constraints on schedules and assertions to be verified, while the framework identifies and executes all relevant schedules on the granularity of message processing. The Akka actor framework for Scala also provides a test framework called TestKit.111Akka.io: Testing Actor Systems, Lightbend Inc., access date: 8 February 2017, http://doc.akka.io/docs/akka/current/scala/testing.html However, it does not seem to provide any sophisticated automatic testing capabilities, which seems to indicate that the current techniques might not yet be ready for adoption in industry.
Cassar and Francalanza  investigate how to minimize the overhead of instrumentation to detect race conditions. Instead of relying exclusively on synchronous instrumentation, they use asynchronous monitoring in combination with a logic to express correctness constraints on the resulting event traces.
|Communi.||Behav.||Live-||Message Or.||Bad Msg.||Mem.|
|Christakis and Sagonas ||X|
|Christakis and Sagonas ||X||X|
|Colaço et al. ||p|
|Dagnat and Pantel ||p|
|Dam and Fredlund ||p|
|Stiévenart et al. ||p||p|
|Garoche et al. ||p||p|
|Zheng et al. ||p||p|
|Petrov et al. ||X||X|
|Raychev et al. ||X|
|Sen and Agha ||X|
|Claessen et al. ||X|
|Christakis et al. ||X|
|Lauterburg et al. ||X|
|Tasharofi et al. ||X|
|Tasharofi et al. ||p||p|
|Tasharofi et al. ||p||X|
|Hughes and Bolinder ||p||X|
|Hong et al. ||X||X|
|Cassar and Francalanza ||p||p||p|
This section reviews the main features provided by current debuggers for actor-based systems. It includes techniques for both online and postmortem debugging.
Causeway is a postmortem debugger for distributed communicating event-loop programs in E. It focuses on displaying the causal relation of messages to enable developers to determine the cause of a bug. Causality is modeled as the partial order of events based on Lamport’s happened-before relationship. We consider that this approach can be useful for detecting message protocol violations.
REME-D is an online debugger for distributed communicating event-loop programs written in AmbientTalk. REME-D provides message-oriented debugging techniques such as the state inspection, in which the developer can inspect an actor’s mailbox and objects, while the actor is suspended. It also supports a catalog of breakpoints, which can be set on asynchronous and future-type messages sent between actors. Like Causeway, REME-D allows inspecting the history of messages that were sent and received when an actor is suspended, also known as causal link browsing. Therefore, we consider debugging techniques provided in REME-D to be helpful for detecting message order violations. Also the technique of inspecting the state of the actor can facilitate debugging any lack of progress issues such as behavioral deadlocks and livelocks.
Kómpos is an online debugger for SOMns. For debugging actor-based programs, Kómpos provides a wide set of message-oriented breakpoints and stepping operations. For example, Kómpos breakpoints allow developers to inspect the program state before a message is sent or after the message is received, but before it is processed on the receiver side. Moreover, is possible to pause the program execution before a promise is resolved with a value or before the first statement of a callback to that promise is executed, i.e. once the promise has been resolved. Breakpoints to pause on the first and last statement of methods activated by an asynchronous message sent can be also set. Stepping operations can be triggered from the mentioned breakpoint locations. Furthermore, one can continue the actor’s execution and pause in the next turn or pause before the execution of the first statement of a callback registered to a promise. This set of debugging operations gives more flexible tools to actor developers to deal with lack of progress issues such as behavioral deadlocks and livelocks. In addition, a specific actor visualization is offered that shows actor turns and messages sends. This can be useful when trying to identify the root cause of a message protocol violation.
Erlang also has an online debugger444Debugger, Ericsson AB, access date: 14 February 2017, http://erlang.org/doc/apps/debugger/debugger_chapter.html that supports line, conditional, and function breakpoints. The Erlang processes can be inspected from a list and for each process a view with its current state as well as its current location in the code can be opened, which allows one to inspect and interact with each process independently. It also supports stepping through processes and inspecting their state. We consider that process inspection information could help finding both message protocol violations and lack of progress issues.
The ScalaIDE also includes facilities for debugging of actor-based programs.555Asynchronous Debugger, ScalaIDE, access date: 14 February 2017, http://scala-ide.org/docs/current-user-doc/features/async-debugger/index.html It is a classic online debugger with support for stepping, line and conditional breakpoints. Furthermore, one can follow a message send and stop in the receiving actor. Additionally, the debugger supports asynchronous stack traces similar to Chrome. We consider these techniques useful for debugging message protocol violations. They can also be used to identify behavioral deadlocks and livelocks when inspecting the state of the receiving actor.
The recently proposed Actoverse debugger enables reverse debugging of Akka programs written in Scala. It uses snapshots of the state of actors to enable back-in-time debugging in a postmortem mode. Furthermore, Actoverse provides message-oriented breakpoints and a message timeline that visualizes the messages exchanged by actors similar to a sequence diagram. The authors aim to ease finding the cause of message protocol violations in Akka programs.
This section discusses mechanisms and approaches to visualize actor-based systems for debugging. Some of the techniques represent actor communication flow with petri nets. Other techniques detail an actor’s state, its mailbox, and the traces of causal messages that are sent and received.
Miriyala et al.  proposed the use of predicate transition nets for visualizing actors execution. Based on the classic model of actors the proposal focus on the representation of the actor behavior and sent messages. The activation of each transition in the petri net corresponds to a behavior execution. The authors emphasize that the order of net transitions should be represented in the same order as the execution of messages of the actor system. The main idea is that the user interacts with a visual editor for building the execution of an actor system in the petri net.
Coscas et al.  present a similar approach in which the predicate transition nets are used to simulate actors execution in a step by step mode. When a user fires a specific transition he or she only observes a small part of whole net. The approach also verifies messages that do not match with the ones expected by the actor, i.e. messages that do not match the actor’s interface.
The Causeway debugger also visualizes the program’s execution based on views for process order, message order, stack and source code view. The process order view shows all messages executed for each actor in chronological order, e.g. a parent item with asynchronous message sends. The message order view shows the causal messages for a message sent, i.e. other messages that have been executed before the message was sent and provoked the send of the message we want to debug. In this view it is also possible to distinguish processes by color, which helps users to visualize when a message flow (known as activation order) corresponds to a different process. The stack view shows a partial causality of messages. It is considered partial because the call chain shown in the stack only visualizes the messages that have been executed, it does not show the other possible messages that can cause the invocation of a message (known as happened-before relation). The source code view shows the code where the message was sent in the code. Thanks to the synchronization achieved between all the views it is possible to transit through the messages related to the execution of the actor’s behavior that led to the bug.
Gonzalez Boix et al.  show the actor state in their REME-D debugger. The actor view shows messages that are going to be executed in the actor’s mailbox. At the same time it is also shown the state of the actor and its objects. This view is useful for the user in order to be able to interact with the objects and messages of the actor that is inspected. One of the main advantages of this online debugger is the possibility of pausing and resuming the actor’s execution.
Recently, Beschastnikh et al.  developed ShiViz, a visualization tool where developers can visualize logs of distributed applications. The mechanism is based on representing happens-before relationships of messages through interactive time-space diagrams. The tool also offers search fields by which messages can be searched in the diagram using keywords. Additionally, it is possible to find ordering patterns, which could help to identifying wrong behaviors in an execution.
6 Conclusion and Future Work
To enable research on debugging support for actor-based programs, we proposed a taxonomy of concurrency bugs for actor-based programs. Although the actor model avoids data races and deadlocks by design, it is still possible to have lack of progress issues and message-level race conditions in actor-based programs.
Our literature review shows that actor-based programs exhibit a range of different issues depending on the specific actor model variant. In languages like Erlang and Scala programs can suffer from communication deadlocks because the actor implementation uses blocking operations. In languages that implement the event-loop concurrency model this issue cannot occur. However, they can suffer from other lack of progress issues such as behavioral deadlocks and livelocks. Behavioral deadlocks and livelocks are really hard to identify because actors are not blocked, but still do not make any progress. Both lack of progress issues can be seen in all variants of the actor model. Message order violations, bad message interleaving and memory inconsistencies are race conditions that can happen also in programs that implement any of the variants of the actor model.
Most work on identifying concurrency bugs is done in the fields of static analysis and testing. Current techniques are effective for some specific cases, but often they are not yet general and do not necessarily scale to the complexity of modern systems. Debugging support for actor languages currently provides features such as message-oriented breakpoints, inspecting the history of messages together with recording their casual relations, and support for asynchronous stack traces. However, better tools are needed to identify the cause of complex concurrency bugs.
6.0.1 Future work.
For future work, there seems to be an opportunity for debuggers that combine strategies such as recording the causality of messages with message-oriented breakpoints and rich stepping. Today, few debuggers support a full set of breakpoints that for example, allows one to debug messages stepping on the sender and on the receiver side. From the debuggers investigated in Section 5.3 only Kómpos allows us to set breakpoints on promises to inspect the computed value before it is used to resolve the promise. We argue that the implementation of flexible breakpoints that adjust to the needs of actor-based programs is needed. For instance, a breakpoint set on the sender side of the message will suspend an actor’s execution before the message is sent. This can be useful when debugging lack of progress issues such as livelocks and behavioral deadlocks because the developer will be able to see whether the message has the correct values. Ideally, a debugger does not only allow us to inspect the turn flow, but to also combine the message stepping with the possibility of seeing the sequential operations that the actor executes inside of a turn. This gives developers better ways to identify the root cause of a bug.
Currently, only few debuggers allow developers to track the causality of messages. However, we consider this an important debugging technique. Recording the causal relationships of messages can help diagnosing, e.g., message protocol violations. Back-in-time debugging techniques could be of great benefit for this. They are often used for postmortem debugging, because they allow developers to identify message order violations.
Moreover, visualization techniques could be explored to give developers a better understanding of the debugging information. To offer better visual support for actor systems, a combination of information about the actor’s state and its objects, visualizing the order of execution of messages and including the happens-before relation between them, together with stack information should give the user better comprehension about the program that is debugged. Nevertheless, further research is needed that supports the tooling for identifying complex concurrency bugs. For example, a visualization is needed to distinguish between the stepping of messages that are exchanged by actors and stepping through the sequential code of each actor. Ideally, a visualization could also highlight, based on the source code, that certain messages are independent of each other, because there is no direct ordering relationship between them.
This research is funded by a collaboration grant of the Austrian Science Fund (FWF) with the project I2491-N31 and the Research Foundation Flanders (FWO Belgium).
Appendix: Table 3 Catalog of Bugs Found in Actor-based Programs
|Bug Type||Id||Bug Pattern||Observable Behavior||Source Reporting the Bug||Language|
|Message order violation||bug-1||incorrect execution order of two processes when registering a name for a pid in the Process Registry||runtime exception||Fig. 1 in ||Erlang|
|Memory inconsistency||bug-2||insert and write in tables of Erlang Term Storage with public access||inconsistency of values in the tables||Fig. 2 in ||Erlang|
|Memory inconsistency||bug-3||insert and write in tables (dirty operations in Mnesia database)||inconsistency of values in the tables||Fig. 2 in ||Erlang|
|Communi-cation deadlock||bug-4||receive statement with no messages||process in waiting state due to an orphan message||Fig. 1 in ||Erlang|
|Memory inconsistency||bug-5||testing insert operations in parallel (Mnesia database)||exception or inconsistent return values||Sec. 5 of ||Erlang|
|Memory inconsistency||bug-6||testing open_file in parallel with other operations of dets API (Mnesia database)||inconsistency when visualizing the table’s contents||Sec. 5 of ||Erlang|
|Memory inconsistency||bug-7||open, close and reopen the file, besides running three processes in parallel (Mnesia database)||integrity checking failed due to premature_eof error||Sec. 5 of ||Erlang|
|Memory inconsistency||bug-8||changes in the dets server state||integrity checking failed (Mnesia database)||Sec. 5 of ||Erlang|
|Communi-cation deadlock||bug-9||receive statement with no messages||process in waiting state due to an orphan message (server waits for ping requests)||Program 2 and Test code 2 in ||Erlang|
|Communi-cation deadlock||bug-10||message sent to a finished process, the finished process exit without replying||process blocks due to an orphan message||Test code 5 in ||Erlang|
|Message order violation||bug-11||spawned process that terminates before its Pid is register by the parent process||process will crash and exits abnormally due to an orphan message||Fig. 1 in ||Erlang|
|Bad message interleaving||bug-12||actor execute a third message between two consecutive messages||inconsistent values of variables||Fig. 2 in ||Actor-Foundry|
|Message order violation||bug-13||incorrect order of execution of two message receives||the program throws an exception because of a null value||Listing 1 in ||Scala|
-  Abbaspour, S., Sundmark, D., Eldh, S., Hansson, H., Afzal, W.: 10 years of research on debugging concurrent and multicore software: a systematic mapping study. Software Quality Journal pp. 1–34 (2016)
-  Abbaspour, S., Sundmark, D., Eldh, S., Hansson, H., Enoiu, E.P.: A study of concurrency bugs in an open source software. In: IFIP International Conference on Open Source Systems. pp. 16–31. Springer (2016)
Agha, G.: Actors: A model of concurrent computation in distributed systems. Ph.D. thesis, MIT, Artificial Intelligence Laboratory (Jun 1985)
-  Armstrong, J., Virding, R., Wikström, C., Williams, M.: Concurrent Programming in ERLANG. Prentice Hall (1993)
-  Aronis, S., Sagonas, K.: The shared-memory interferences of erlang/otp built-ins. In: Chechina, N., Fritchie, S.L. (eds.) Erlang Workshop. pp. 43–54. ACM (2017), http://dblp.uni-trier.de/db/conf/erlang/erlang2017.html#AronisS17
-  Artho, C., Havelund, K., Biere, A.: High-level data races. Softw. Test., Verif. Reliab. 13(4), 207–227 (2003), http://dblp.uni-trier.de/db/journals/stvr/stvr13.html#ArthoHB03
-  Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 1(1), 11–33 (Jan 2004)
-  Beschastnikh, I., Wang, P., Brun, Y., Ernst, M.D.: Debugging distributed systems. Commun. ACM 59(8), 32–37 (Jul 2016)
-  Bracha, G., von der Ahé, P., Bykov, V., Kashai, Y., Maddox, W., Miranda, E.: Modules as Objects in Newspeak. In: ECOOP 2010 – Object-Oriented Programming, Lecture Notes in Computer Science, vol. 6183, pp. 405–428. Springer (2010)
-  Brito, M., Felizardo, K.R., Souza, P., Souza, S.: Concurrent software testing: A systematic review. On testing software and systems: Short papers p. 79 (2010)
-  Cassar, I., Francalanza, A.: On Synchronous and Asynchronous Monitor Instrumentation for Actor-based Systems. In: Proceedings 13th International Workshop on Foundations of Coordination Languages and Self-Adaptive Systems. pp. 54–68. FOCLASA 2014 (September 2014)
-  Christakis, M., Gotovos, A., Sagonas, K.: Systematic testing for detecting concurrency errors in erlang programs. In: Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. pp. 154–163. IEEE (2013)
-  Christakis, M., Sagonas, K.: Static Detection of Race Conditions in Erlang. pp. 119–133. PADL 2010 (January 2010)
-  Christakis, M., Sagonas, K.: Detection of Asynchronous Message Passing Errors Using Static Analysis. In: Rocha, R., Launchbury, J. (eds.) Practical Aspects of Declarative Languages: 13th International Symposium,. pp. 5–18. PADL 2011, Springer (January 2011)
-  Christakis, M., Sagonas, K.: Static Detection of Deadlocks in Erlang. Tech. rep. (Jun 2011)
-  Claessen, K., Palka, M., Smallbone, N., Hughes, J., Svensson, H., Arts, T., Wiger, U.: Finding Race Conditions in Erlang with QuickCheck and PULSE. In: Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming. pp. 149–160. ICFP ’09, ACM (2009)
-  Colaço, J.L., Pantel, M., Sallé, P.: A Set-Constraint-based analysis of Actors, pp. 107–122. Springer (1997)
-  Coscas, P., Fouquier, G., Lanusse, A.: Modelling Actor Programs using Predicate/Transition Nets. In: Proceedings Euromicro Workshop on Parallel and Distributed Processing. pp. 194–200 (Jan 1995)
-  Dagnat, F., Pantel, M.: Static analysis of communications in erlang programs (November 2002), http://rsync.erlang.org/euc/02/dagnat.ps.gz
-  Dam, M., Fredlund, L.å.: On the Verification of Open Distributed Systems. In: Proceedings of the 1998 ACM Symposium on Applied Computing. pp. 532–540. SAC ’98, ACM (1998)
-  De Koster, J., Van Cutsem, T., De Meuter, W.: 43 years of actors: A taxonomy of actor models and their key properties. In: Proceedings of the 6th International Workshop on Programming Based on Actors, Agents, and Decentralized Control. pp. 31–40. AGERE 2016, ACM (2016)
-  Dedecker, J., Van Cutsem, T., Mostinckx, S., D’Hondt, T., De Meuter, W.: Ambient-oriented programming in ambienttalk. In: European Conference on Object-Oriented Programming. pp. 230–254. Springer (2006)
-  Dijkstra, E.W.: Cooperating sequential processes. In: Genuys, F. (ed.) Programming Languages: NATO Advanced Study Institute, pp. 43–112. Academic Press (1968)
-  D’Osualdo, E., Kochems, J., Ong, C.H.L.: Automatic verification of erlang-style concurrency. In: Logozzo, F., Fähndrich, M. (eds.) 20th International Symposium on Static Analysis. pp. 454–476. SAS 2013, Springer (June 2013)
-  Dragos, I.: Stack Retention in Debuggers For Concurrent Programs (July 2013), http://iulidragos.com/assets/papers/stack-retention.pdf
-  Garoche, P.L., Pantel, M., Thirioux, X.: Static safety for an actor dedicated process calculus by abstract interpretation. In: Gorrieri, R., Wehrheim, H. (eds.) Formal Methods for Open Object-Based Distributed Systems. pp. 78–92. FMOODS 2006, Springer (June 2006)
-  Gonzalez Boix, E., Noguera, C., De Meuter, W.: Distributed debugging for mobile networks. Journal of Systems and Software 90, 76–90 (2014)
-  Gotovos, A., Christakis, M., Sagonas, K.: Test-driven development of concurrent programs using concuerror. In: Proceedings of the 10th ACM SIGPLAN workshop on Erlang. pp. 51–61. ACM (2011)
-  Haller, P., Odersky, M.: Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410(2-3), 202–220 (Feb 2009)
-  Hewitt, C., Bishop, P., Steiger, R.: A universal modular actor formalism for artificial intelligence. In: Proceedings of the 3rd International Joint Conference on Artificial Intelligence. pp. 235–245. IJCAI’73, Morgan Kaufmann Publishers Inc. (1973)
-  Hong, S., Park, Y., Kim, M.: Detecting Concurrency Errors in Client-Side Java Script Web Applications. In: 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation (ICST). pp. 61–70. IEEE (Mar 2014)
-  Huch, F.: Verification of erlang programs using abstract interpretation and model checking. In: Proceedings of the Fourth ACM SIGPLAN International Conference on Functional Programming. pp. 261–272. ICFP ’99, ACM, New York, NY, USA (1999), http://doi.acm.org/10.1145/317636.317908
-  Hughes, J.M., Bolinder, H.: Testing a database for race conditions with quickcheck. In: Proceedings of the 10th ACM SIGPLAN Workshop on Erlang. pp. 72–77. Erlang ’11, ACM (2011)
-  Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978)
-  Lauterburg, S., Dotta, M., Marinov, D., Agha, G.A.: A Framework for State-Space Exploration of Java-Based Actor Programs. In: 2009 IEEE/ACM International Conference on Automated Software Engineering. pp. 468–479 (Nov 2009)
-  Lauterburg, S., Karmani, R.K., Marinov, D., Agha, G.: Basset: A Tool for Systematic Testing of Actor Programs. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 363–364. FSE ’10, ACM (2010)
-  Leesatapornwongsa, T., Lukman, J.F., Lu, S., Gunawi, H.S.: Taxdc: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In: Conte, T., Zhou, Y. (eds.) ASPLOS. pp. 517–530. ACM (2016), http://dblp.uni-trier.de/db/conf/asplos/asplos2016.html#Leesatapornwongsa16
-  Long, Y., Bagherzadeh, M., Lin, E., Upadhyaya, G., Rajan, H.: On ordering problems in message passing software. In: Proceedings of the 15th International Conference on Modularity. pp. 54–65. ACM (2016)
-  Lu, S., Park, S., Seo, E., Zhou, Y.: Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. pp. 329–339. ASPLOS XIII, ACM, New York, NY, USA (2008)
-  Marr, S., Torres Lopez, C., Aumayr, D., Gonzalez Boix, E., Mössenböck, H.: A concurrency-agnostic protocol for multi-paradigm concurrent debugging tools. In: Proceedings of the 13th ACM SIGPLAN International Symposium on on Dynamic Languages. pp. 3–14. DLS’17, ACM (2017)
-  Miller, M.S., Tribble, E.D., Shapiro, J.: Concurrency among strangers. In: International Symposium on Trustworthy Global Computing. pp. 195–229. Springer (2005)
-  Miriyala, S., Agha, G., Sami, Y.: Visualizing actor programs using predicate transition nets. Journal of Visual Languages & Computing 3(2), 195–220 (1992)
-  Peierls, T., Goetz, B., Bloch, J., Bowbeer, J., Lea, D., Holmes, D.: Java Concurrency in Practice. Addison-Wesley Professional (2005)
-  Petrov, B., Vechev, M., Sridharan, M., Dolby, J.: Race detection for web applications. In: ACM SIGPLAN Notices. vol. 47, pp. 251–262. ACM (2012)
-  Prasad, S.K., Gupta, A., Rosenberg, A.L., Sussman, A., Weems, C.C.: Topics in Parallel and Distributed Computing: Introducing Concurrency in Undergraduate Courses. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edn. (2015)
-  Raychev, V., Vechev, M., Sridharan, M.: Effective race detection for event-driven programs. In: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications. pp. 151–166. OOPSLA ’13, ACM (2013)
-  Sagonas, K.: Experience from developing the dialyzer: A static analysis tool detecting defects in erlang applications. In: Proceedings of the ACM SIGPLAN Workshop on the Evaluation of Software Defect Detection Tools (2005)
Sagonas, K.: Using static analysis to detect type errors and concurrency defects in erlang programs. In: International Symposium on Functional and Logic Programming. pp. 13–18. Springer (2010)
-  Sen, K., Agha, G.: Automated Systematic Testing of Open Distributed Programs. In: Baresi, L., Heckel, R. (eds.) 9th International Conference on Fundamental Approaches to Software Engineering. pp. 339–356. FASE 2006, Springer (2006)
-  Shibanai, K., Watanabe, T.: Actoverse: A reversible debugger for actors (2017)
-  Stanley, T., Close, T., Miller, M.: Causeway: A message-oriented distributed debugger. Tech. rep., HP Labs (Apr 2009)
-  Stiévenart, Q., Nicolay, J., De Meuter, W., De Roover, C.: Mailbox abstractions for static analysis of actor programs (artifact). DARTS 3(2), 11:1–11:2 (2017), http://dblp.uni-trier.de/db/journals/darts/darts3.html#StievenartNMR17
-  Tasharofi, S., Gligoric, M., Marinov, D., Johnson, R.: Setac: A Framework for Phased Deterministic Testing Scala Actor Programs (2011), https://days2011.scala-lang.org/sites/days2011/files/ws1-2-setac.pdf
-  Tasharofi, S., Karmani, R.K., Lauterburg, S., Legay, A., Marinov, D., Agha, G.: TransDPOR: A Novel Dynamic Partial-Order Reduction Technique for Testing Actor Programs. In: Giese, H., Rosu, G. (eds.) Formal Techniques for Distributed Systems: Joint 14th IFIP WG 6.1 International Conference, FMOODS 2012 and 32nd IFIP WG 6.1 International Conference, FORTE 2012, Stockholm, Sweden, June 13-16, 2012. Proceedings. pp. 219–234. Springer (2012)
-  Tasharofi, S., Pradel, M., Lin, Y., Johnson, R.E.: Bita: Coverage-guided, automatic testing of actor programs. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering. pp. 114–124. ASE’13 (Nov 2013)
-  Tchamgoue, G.M., Kim, K.H., Jun, Y.K.: Testing and debugging concurrency bugs in event-driven programs. International Journal of Advanced Science and Technology 40, 55–68 (2012)
-  Van Cutsem, T., Mostinckx, S., Gonzalez Boix, E., Dedecker, J., De Meuter, W.: Ambienttalk: object-oriented event-driven programming in mobile ad hoc networks. In: Inter. Conf. of the Chilean Computer Science Society (SCCC). pp. 3–12. IEEE Computer Society (2007)
-  Yonezawa, A., Briot, J.P., Shibayama, E.: Object-oriented concurrent programming in abcl/1. In: Conference Proceedings on Object-oriented Programming Systems, Languages and Applications. pp. 258–268. OOPSLA ’86, ACM, New York, NY, USA (1986)
-  Zhang, M., Wu, Y., Chen, K., Zheng, W.: What is wrong with the transmission? a comprehensive study on message passing related bugs. In: ICPP. pp. 410–419. IEEE Computer Society (2015), http://dblp.uni-trier.de/db/conf/icpp/icpp2015.html#ZhangWCZ15
-  Zheng, Y., Bao, T., Zhang, X.: Statically Locating Web Application Bugs Caused by Asynchronous Calls. In: Proceedings of the 20th International Conference on World Wide Web. pp. 805–814. WWW ’11, ACM (2011)