Usability Methods for Designing Programming Languages for Software Engineers

12/10/2019 ∙ by Michael Coblenz, et al. ∙ University of Michigan Carnegie Mellon University The University of British Columbia 0

Programming language design requires making many usability-related design decisions. We explored using user-centered methods to make languages more effective for programmers. However, existing HCI methods expect iteration with appropriate users, who must learn to use the language to be evaluated. These methods were impractical to apply to programming languages: they have high iteration costs, programmers require significant learning time, and user performance has high variance. To address these problems, we adapted HCI methods to reduce iteration and training costs and designed tasks and analyses that mitigated the high variance. We evaluated the methods by using them to design two languages for professional developers. Glacier extends Java to enable programmers to express immutability properties effectively and easily. Obsidian is a language for blockchains that includes verification of critical safety properties. Summative usability studies showed that programmers were able to program effectively in both languages after short training periods.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Programming languages serve as interfaces through which programmers and software engineers can create software. The ability of these users to achieve their goals, as with other kinds of interfaces, depends on the usability of the languages in which they do their work. For example, the presence of null in Java results in a particular kind of error-proneness, since programmers can easily accidentally write code that dereferences null (Hoare, 2009). These kinds of mistakes persist in spite of the training and experience that professional software engineers have.

Our overall research question, then, is: How should programming languages be designed so that designers can predict the effects of their design decisions on programmers and software engineers (Stefik and Hanenberg, 2014)? Some authors, such as Stefik and Hanenberg, have focused on using quantitative approaches (Stefik and Hanenberg, 2017). Others have proposed using interdisciplinary methods to design programming languages (Coblenz et al., 2018; Myers et al., 2016) in order to integrate user research into many different stages of the design process.

We were interested in adapting traditional HCI methods to the context of the design of programming languages that target professional software engineers. Our high-level research question can be refined in terms of three research questions:


How can we obtain insights as to what language designs will be natural for programmers, given that we are trying to obtain particular static safety guarantees in the language?


How can we iterate on a particular design to make it more effective for users?


How can we compare two language designs to see which is more effective for users?

We wanted to apply known HCI methods, such as natural programming (Myers et al., 2004), Wizard of Oz (Dahlbäck et al., 1993), interviews, and rapid prototyping. However, we found that the study design process was very challenging due to the nature of programming and the complexity of the design space. These challenges included:


how could we train participants in a new programming language in a short enough amount of time to make studies practical?


how could we recruit participants who have sufficient programming skill and whose results would generalize beyond the population of students, despite having limited access to professional software engineers?

High prototyping cost::

how could we conduct user studies on programming languages that have only informal designs and no implementations, since cost of building working prototypes is high?

Variance and external validity::

how would we mitigate high variance, which is typical in programming tasks, without constraining the tasks so much that they were no longer representative of real-world programming tasks?

The problems of variance and external validity were particularly relevant for quantitative studies, which needed to be practical in the context of our university setting. Programming tasks that are not extremely constrained tend to produce results with high variance, making statistical significance hard to obtain. On the other hand, tasks that are highly constrained suffer from low external validity, since real-world programming tasks are typically long and complex.

We developed a collection of study design innovations across a variety of HCI methods that address the above problems, and applied them in the design of two programming languages: Glacier and Obsidian.

  • By adapting the natural programming technique to allow progressive prompting, we were able to obtain both unbiased responses as well as data that were relevant to the particular designs we were considering.

  • By back-porting language design questions to languages with which participants were familiar and by using the Wizard of Oz evaluation technique, we were able to obtain usability insights on incomplete designs, and isolate the design questions of interest from confounding variables.

  • By dividing large tasks into multiple, smaller tasks, and by using pilot studies to set task time limits effectively in quantitative studies, we were able to reduce variance sufficiently to obtain meaningful results in complex programming tasks, which otherwise would have had very high variance.

  • By recruiting participants who were representative of at least some junior-level professional developers, we were able to maximize external validity in our studies while still conducting them practically at a university setting. We were also able to show usability impacts of the language designs under consideration.

  • By developing incremental tutorials with integrated practice opportunities, we were able to teach the languages to participants in a short period of time (for Obsidian, about 90 minutes was typical).

Together, we call our methods PLIERS: Programming Language Iterative Evaluation and Refinement System. We evaluated PLIERS by using it to create two different languages. Then, we observed how the methods helped us create and iterate on the language designs.

Glacier (Coblenz et al., 2017) is an extension to Java that supports transitive class immutability. Although security experts had recommended expressing state in an immutable way whenever possible, it was unclear how programming languages should support immutability. For example, Java includes the final keyword, but because final only restricts assignment to variables and not mutation of referenced state, actually enforcing immutability in Java is very difficult. We found as a result that our lab study participants were unable to successfully express immutability using only final, but they were able to do so with Glacier.

Obsidian is targeted at programming blockchains (Herlihy, 2019), in which a decentralized network of computers maintains system state and executes transactions. Blockchains support deploying smart contracts, which are programs that maintain state. Typically, each deployment is an instance of a class, though in a blockchain context, the keyword contract is used instead of class. In contrast to most of the existing user-centered programming language work, which often focuses on novice or end-user programmers (Kelleher and Pausch, 2005), Obsidian is intended for use by professional software engineers. This presents additional challenges, since we are interested in evaluating how the language will be used in the long term despite being limited in our ability to recruit software engineers to work for extended periods of time in our studies. After our design was complete, however, we found in a summative study that most of the participants were able to complete programming tasks successfully in Obsidian.

In this paper, we describe how we integrated both formative and summative human-centered methods into the design processes for Glacier and Obsidian. The Obsidian work is new in this paper. We have two main contributions:

  1. We show by example how we have adapted several formative study techniques, such as natural programming, Wizard of Oz, rapid prototyping, cognitive dimensions of notations analysis, and interview studies to inform the design of Glacier and Obsidian. We informed the results of these techniques with implications of the theory of programming languages to develop languages that were effective for users and also achieved our safety objectives. We found that our adapted methods were effective when used with particular kinds of study designs, which we describe in §3.

  2. We show how we conducted summative usability studies on new programming languages. By developing ways of teaching the languages efficiently, effectively, and consistently, we were able to conduct usability studies of programmers using novel programming languages.

We summarize recommendations for others who want to use these methods in their own programming language design work. We believe that these recommendations may also be useful in other domains. Some properties of programming distinguish it from many kinds of tasks, but those same properties are shared with some other domains:


Programming languages exist to facilitate problem-solving. However, problem-solving can be unpredictable (Loksa et al., 2016); in user studies, some participants typically complete tasks almost instantly whereas others can spend hours working and still not finish. This large variance makes running quantitative user studies very challenging.

Range of working styles::

Bergström and Blackwell described a diverse collection different approaches to programming problems (Bergström and Blackwell, 2016), such as bricolage/tinkering and engineering. These different styles may be used even by different people using the same language, impeding a designer’s attempts to anticipate a user’s strategy or behavior.

High stakes::

Errors when programming can contribute to serious real-world safety problems, e.g., in avionics or health care systems.

For example, CAD tools affect their users’ creative processes (Robertson and Radcliffe, 2009); likewise with process engineering tools (Braunschweig and Gani, 2002) and even drug design tools (Stewart et al., 2006). All of these domains involving expert problem-solving by a variety of different people with high costs of failure.

1.1. Glacier

In order to contextualize the methods we describe in this paper, we now explain the two languages that we used to develop the methods.

In designing Glacier, we sought to show how a language design might support the use of immutability in practical programming languages. Immutability means that objects cannot be changed after they are created. Several organizations recommend the use of immutability to prevent security vulnerabilities in software. For example, Oracle’s Secure Coding Guidelines for Java (Oracle Corp., [n.d.]) and Microsoft’s Framework Design Guidelines (Microsoft Corp., [n.d.]) both recommend using immutability for security reasons. However, we found that there were hundreds of different possible designs for immutability protection in programming languages, and it was unclear which approaches might be usable by programmers and which might actually support programmers’ needs (Coblenz et al., 2016).

To determine a point in the design space that might be useful and effective, we conducted semi-structured interviews with eight software engineers. We combined their input with an analysis of expert recommendations to propose that transitive class immutability might be a useful point in the design space to pursue. Transitive means that the restriction applies not just to a class, but recursively to all of its fields. Class refers to the fact that the restriction applies to all instances of a given class, not to individual objects or references. Immutability means that objects to which the restriction applies cannot have any of their data modified through any reference, as opposed to the restriction only applying to certain references to a given object.

Our initial prototype was based on IGJ (Zibin et al., 2007), but we found in a usability study that there were significant usability challenges with this approach. This led us to create Glacier, which we were able to show (a) could be used effectively by our Java study participants to specify immutability; (b) detected improper-mutation bugs that participants frequently inserted in the codebase when they were using regular Java.

In this paper, we describe how we used qualitative methods to ground our initial design in user data and how we conducted a quantitative summative study to compare Glacier to Java’s final feature.

1.2. Obsidian

Blockchains, which have been proposed for high-stakes applications such as financial transactions, health care (Harvard Business Review, 2017), supply chain management (IBM, 2019), and others (Elsden et al., 2018), are an ideal testbed for a new language design process. The need for a safer language is motivated by the history of security vulnerabilities, through which over $80 million worth of virtual currency has been stolen (Sirer, 2016; Graham, 2017). However, ordinary programmers and software engineers need to be able to write blockchain applications; it does not suffice to assume that the developers will be experts in formal verification or that companies will invest the resources required to formally verify that their programs are correct. Instead, we seek a more lightweight approach that provides additional safety guarantees at low cost to developers.

We established several objectives in our design of Obsidian:

  1. Improve safety by detecting more bugs than current smart contract languages do, preferably at compile time, to prevent deployment of buggy programs.

  2. Maximize usability by ensuring that programmers can complete domain-appropriate programming tasks, ideally with little training.

  3. Advance the science of programming language design by developing user-centered methods that can contribute to a more usable language.

In this paper, we describe techniques we used to triangulate data about individual language changes. Then, we describe the final design of Obsidian and explain the methods that we used to create and evaluate it.

2. Related Work

Newell and Card argued for the use of HCI methods in programming language design in 1985: “Millions for compilers but hardly a penny for understanding human programming language use (Newell and Card, 1985).” Morrisett reiterated this problem in 2009, arguing that a programming language is a medium for communication among humans, but we lack principles for evaluating this aspect of languages (Pierce and Mandelbaum, 2009). Our earlier essay argued for using many different methods in language design (Coblenz et al., 2018). Although that article promoted the use of formative methods (among others), this paper describes the methods in much more detail, giving recommendations for how other designers might use them in their own work. This paper also includes our experiences with Obsidian, including techniques we developed during that work.

A substantial amount of prior work on the usability of programming languages focuses on novices. For example, HANDS (Pane et al., 2002), Helena, (Chasins, 2017) and Scratch (Resnick et al., 2009) aimed to make it easier for novices to write programs. HANDS, in particular, introduced the Natural Programming technique, which we leveraged and adapted in this work. Stefik et al. also focused on novices, collecting quantitative data on their error rates (Stefik et al., 2011). Designing languages for novices is substantially different from designing languages for experienced programmers. Languages for novices typically focus on learnability. Languages for professionals commonly include additional complexity, in part resulting from the kinds of safety properties that are beneficial when building real systems. We adapt methods developed for novices to obtain insights from experienced programmers.

Other work has focused on programming tools for end-user programmers, whose primary goal is not to write software but rather to accomplish goals in some particular domain (Ko et al., 2011). For example, Peyton Jones et al. used Cognitive Dimensions (Green and Petre, 1996) and Attention Investment to provide a new kind of user-defined functions in Excel (Jones et al., 2003). Separately, Blackwell and Burnett applied Attention Investment to a research spreadsheet tool, Forms/3 (Blackwell and Burnett, 2002). Our work extends these methods for use with languages targeted at professional programmers and software engineers.

Quantitative methods have been used to compare different programming language designs. For example, Uesbeck et al. investigated the impact of lambdas in C++ (Uesbeck et al., 2016), and Endrikat et al. (Endrikat et al., 2014) looked at static typing. That work is a useful complement to this work, but the focus here is on using low-cost, practical qualitative methods to inform the entire language design process, not just on quantitative summative studies, which can only be used once a design has been implemented.

The HCI literature includes many different language designs as well as other kinds of tools for programmers. For example, Dog/Jabberwocky (Ahmad et al., 2011), Protovis (Bostock and Heer, 2009), Reactive Vega (Satyanarayan et al., 2015), and InterState (Oney et al., 2014) are all languages or APIs that make it easier for programmers to accomplish their goals. Those papers describe only the final designs of those systems and summative usability studies. This paper focuses on methods that can be used during the design process and gives recommendations that are useful in preparing a summative evaluation.

Finally, there is a variety of methodological guidance in SE and HCI that is applicable to studies of programming languages. Ko et al. discussed techniques for doing empirical studies of tools for software engineers (Ko et al., 2015). Buse et al. conducted a systematic literature review, observing increasing use of user evaluations in software engineering research (Buse et al., 2011). Verner et al. gave guidelines for industrial case studies in software engineering research (Verner et al., 2009). Perry et al. gave a tutorial on case study methodology for software engineers at ICSE 2004 (Perry et al., 2004). Likewise, Shneiderman and Plaisant gave recommendations for using case studies for information visualization tools (Shneiderman and Plaisant, 2006).

The technical details of the Obsidian language are described in a separate paper (Coblenz et al., 2019); Glacier is also described in more detail separately (Coblenz et al., 2017). This paper focuses on the user-centered process that we used to design and evaluate the languages.

3. Study Design Challenges and Solutions

Our primary interest is in programmers’ abilities to achieve their goals after they have become proficient in the programming language, not on how easy it is for novices to learn the language. Thus, our evaluation approach requires first teaching people a language and then observing their performance on tasks.

When we initially tried to apply HCI methods in our language design work, we were thwarted by several challenges, described in the introduction: training, recruiting, high prototyping cost, and variance. We also encountered additional challenges related to high prototyping cost, interdependence of features, time management in studies, participant bias toward familiar languages, and unsound proposals by participants. In this section, we describe techniques we used when designing user studies in order to address each challenge.

3.1. Training

Evaluating a programming language requires first teaching the programming language. Many universities offer term-length courses in specific programming languages or techniques; requiring this kind of time commitment would make it extremely difficult to recruit participants, and the teaching effort would likely be considerable. Furthermore, most courses ensure a consistent experience for all students by having all students learn the material in parallel (for example, with one session per topic, where all students participate at the same time). When doing iterative design, however, it is preferable to teach one participant at a time, refining the prototype between participants (Dumas et al., 1999). We were interested in addressing a variant of our training challenge that asks: what would be an effective way to teach a programming language in a consistent way to many participants in sequence?

Initially, we created a textual guide to the new programming language, and asked participants to read it before doing the tasks relevant to each study. The guide was relatively short; it could be read thoroughly in under an hour. Unfortunately, this approach had very significant limitations. Although it was effective for some participants, others only skimmed the material and were then unable to complete the programming tasks. Because the guide was not structured as reference material and it included substantial conceptual information, skimming the guide was insufficient.

We were able to solve the problem with two adaptations: (1) break the guide into much smaller pieces; (2) ask participants to answer questions or complete small tasks to assure they had absorbed the material of each piece. For example, we broke the Obsidian tutorial into ten parts, and still the average participant completed it in under 90 minutes. We found that we were able to design tasks that checked understanding that were brief and did not require substantial experimenter intervention (helpful for ensuring consistency). We used a web survey tool (Qualtrics) to guide participants through the tutorial and ask questions to check understanding. The tool also offered automatic feedback on participants’ answers to multiple-choice questions. For example, Figure 1 shows a question about a code fragment with the correct answer selected. The relevant language details are explained in §5.1.

contract Money {
  int amount;
  transaction getAmount() returns int {
    return amount;
contract Wallet {
  Money@Owned m;
  Wallet@Owned() {
    m = new Money();
  transaction spendMoney() {
  transaction receiveMoney(Money@Owned >> Unowned mon) returns Money@Owned
    Money temp = m;
    m = mon;
    return temp;
  transaction checkMoney() returns Money@Owned {
    return m;
Figure 1. One question from the Obsidian tutorial. The question assesses whether the participant has understood that at the ends of transactions, fields must be have types that match their declarations, and that returning a variable consumes any ownership in the variable. If a participant submits an incorrect answer, the survey tool informs them of their error so they can fix their misunderstanding.

Although we originally wanted to make the tutorial stand alone so that every participant would have the same experience, we found that to be impractical; participants inevitably had questions about the materials, and forcing them to continue without having their questions answered resulted in them being unable to complete the tasks. However, we found that if an experimenter was available to answer questions, most participants asked only a small number of questions, which could be addressed rapidly. This approach is arguably more similar to a real-world language learning experience, in which learners can search the Internet for answers to their questions, ask friends for help, etc.

In summary, although our initial tutorial was not an effective way of teaching the language, and the final tutorial was not sufficient by itself, dividing the tutorial into small pieces, providing tasks to help participants check and reinforce their understanding, and having an expert who could answer questions allowed most of our participants to learn the needed material in a short period of time.

3.2. Recruiting

Evaluation requires participants who are sufficiently skilled that they can rapidly learn a new programming language and then complete tasks using the new language. This would seem to require lengthy user studies with skilled participants, who can be challenging to recruit and retain for the required period of time. Iterative evaluation requires a large number of participants, since participants who learned an earlier version of the language can no longer provide fresh perspectives on new ideas. Although some user interfaces for experts in other domains require recruiting members of a small population, many of those interfaces are for short-term, focused tasks rather than lengthy problem-solving tasks. Furthermore, although it is typical to conduct studies with students, this relates to our external validity challenge: to what extent do results from students apply to the professional software engineers that are the target of our language?

We found in our work on Glacier and Obsidian that we were able to usefully combine results from different populations. Rather than trying to exclusively obtain professional software engineers, we found that we could design studies that yield meaningful results from students; for other aspects of the research, we recruited limited numbers of professionals. When we wanted to interview software engineers to find out their experiences of using immutability constructs in the Glacier work, we recruited senior-level professional software engineers. However, for the other studies, we made three observations that enabled us to do our studies with various kinds of students.

First, about 41% of professional developers have been programming professionally for less than five years (Stack Overflow, 2019). Many graduate students have some professional experience. For example, students at the Professional Master’s program in Computer Science & Engineering at the University of Washington were reported to have an average of five years of professional experience (of Washington, 2019). Similarly, the Carnegie Mellon Master of Software Engineering program requires all students to have at least two years of experience (University, 2019). By recruiting from graduate students, we were able to attract a population that is similar to a significant fraction of professional programmers and software engineers.

Second, in usability studies, it is typical to assume that usability problems encountered by even one user may be experienced by many others. Not every usability problem can be addressed without risking introducing new usability problems, but our experience is that many can be. By addressing problems that student participants encounter, we prevent professionals from encountering those problems as well. Of course, some of the problems may not be ones that professionals would encounter, but nonetheless, addressing them may improve learnability, making the system better overall. When changes that would improve the system for the participants might degrade performance for experienced users, then the designer can make an informed tradeoff, potentially addressing the problem in training materials rather than in a design change.

Third, we developed a screening instrument so that we could include only participants with appropriate programming skills. The instrument, which is a web-based survey, takes most participants under ten minutes to complete. The instrument also included more difficult questions; because of the difficulty, we did not use this portion for screening. However, we found that performance on the more difficult portion of the instrument was negatively correlated with time required for one of our programming tasks even in a small, six-participant study.

We found that relatively small incentives were sufficient to motivate students to participate in our studies. For three-hour studies, we offered a $50 Amazon gift card; for four-hour studies, we offered a $75 Amazon gift card. For shorter studies, we paid $10/hour. We recruited professionals from among our personal networks and did not offer them a specific incentive to participate.

3.3. High prototyping cost

Programming language designers are accustomed to creating high-cost implementations, not low-cost prototypes, but traditional HCI methods assume that low-cost prototypes can be created. Traditional ways of evaluating programming languages typically require a compiler or interpreter as well as theoretical work to create a sound design (informally, one in which programs mean what they are supposed to mean and the safety guarantees that the type system claims to provide can actually be provided). If one insists on creating a sound, formal model of the language before evaluating it with users, iteration can require so much time that it is impractical. Furthermore, the cost is increased by the expectation of sophisticated language-dependent tooling in IDEs: syntax highlighting, autocomplete, high-quality error messages, and the like.

Instead, we do not insist on doing this work at the beginning. We outline a potentially sound underlying formalism without proving all the relevant properties. Then, we design a surface language and evaluate it with users so that we can obtain feedback early. In doing so, we accept the risk that the formal system cannot be made sound without invalidating the data we gathered from users, but in practice, we found that usually any mistakes are minor and can be corrected without having to redo the user studies.

Here are some of the techniques that enabled us to obtain insights in user studies of incomplete programming languages:


Back-porting design questions to existing languages::

To study the usability of a design decision in isolation, we start from an existing language with which participants would already be familiar. For example, rather than asking participants in our early formative studies to learn a whole programming language, we told them that we were adding certain features to Java, and then asked them to complete programming tasks in the Java variant. This substantially reduced the training time and allowed us to reason that any confusion was likely related to the new features, since our participants were already familiar with Java.

We also made high-level design decisions that allowed us to attract participants who had relevant background. If we had tried to teach participants a completely novel language paradigm even though the basic assumptions of the language paradigm were not the targets of our research, we would have needed to try to distinguish the relevant mistakes from all the novice-level mistakes that the new programmers would be likely to make.

Wizard of Oz::

Implementing a programming language is expensive. Rather than implementing a full compiler for each language variant we wanted to test, we adapted the Wizard of Oz technique (Dahlbäck et al., 1993). In a classic Wizard of Oz study, an experimenter pretends a system is working by remote-controlling it in order to obtain insights about potential designs without having to actually build the system.

In early Obsidian studies, we gave participants a text editor, documentation, and programming tasks to do. Then, an experimenter provided simulated compiler errors. Like a modern IDE, the experimenter could interject with errors, and could provide error messages on request. This allowed efficient iteration on our design ideas, since design changes only required updating the documentation, not a potentially complex implementation. Unlike in a traditional Wizard of Oz study, participants were aware that the feedback was being provided manually, but we observed that this did not present an obstacle to the effectiveness of the technique.

Late in the project, we found that designing and running user studies of low-level features typically required much more time than implementing the features; for those, it make sense to implement the alternatives rather than simulating them. On other other hand, early in the project, many high-level design decisions would have required substantial design and implementation work. Among those, we carefully selected questions for which user input would be the most impactful. A key approach in minimizing cost of language changes was to re-use training materials across phases of the studies to the extent possible, allowing us to amortize the cost of development across multiple studies. The training materials co-evolved with the implementation and represented a significant investment.

3.4. Interdependence of features

Suppose a comparison between two languages showed that one allowed participants to complete tasks faster or more successfully. If the two languages were very different from each other, it would be unclear which aspects of the new language were actually helpful. For example, a comparison between a particular functional language and a particular object-oriented language would not result in fine-grained, actionable design guidance for a new language. Furthermore, if the study was done in the context of a language that was new to participants, confusion might be due to unfamiliar aspects of the language that are unrelated to the design question of interest.

By using the back-porting approach described above, we isolated particular design questions in the context of an existing language. Although this does not enable us to address very high-level design questions, such as whether the language should be object-oriented or functional, it allowed us to obtain actionable data about particular design decisions.

Of course, it is still the case that the design choices are not orthogonal. To address this, we integrate the results into a new language and use quantitative methods to compare the new language to existing languages in user studies.

3.5. High variance and external validity

The nature of programming is that there is huge variance in performance on tasks among different programmers (Glass, 2001; Sanfilippo, 2017). When asking participants to complete programming tasks to help a designer iterate on a language design, participants frequently get stuck on problems that are not of interest to the designer. For example, in one study, a participant spent significant time writing code to recurse through a data structure, even though code had been provided to do exactly that. Issues involving the details of the data structure were intended to be out of scope for the study. On the other hand, constraining tasks too much may result in artificial tasks that do not represent the complexity of real-world programming problems, which limits the external validity of the studies.

We use three techniques to address these problems. First, we combine the results of different kinds of studies (triangulation). Qualitative studies of varied tasks with varied participants, in which timing is not an important dependent variable, can identify usability problems, and an experimenter can guide participants away from problems that are not intended to be part of the study. Quantitative studies typically involve fairly constrained tasks, but we can hope to obtain statistical significance in a comparison between two different designs. Finally, although this paper does not focus on our case study work, we also use case studies to address questions of expressiveness: elucidating what happens when the language is used to solve a larger programming problem, which cannot be completed in a single-session user study.

Second, particularly in quantitative studies (in which the experimenter cannot provide any guidance), we give several independent tasks rather than one long task. Figure 2 illustrates the benefit. Part (a) shows a long task. Although all participants start in the same configuration, as the task continues, the configurations diverge. Part (b) shows how dividing the task into three subtasks, each of which has a standard starting configuration, reduces the variance at the end of each subtask. Then, the tasks are analyzed separately, although of course the performance on the tasks is not independent because the same participant completed all three tasks. Furthermore, dividing tasks into subtasks enables separate analysis of complete vs. incomplete tasks.

Third, recruiting from a constrained population reduces the impact of uninteresting noise. The primary technique to use is a screening survey, which participants must complete before being selected to participate in the study. This allows the experimenter to ensure that programmers have sufficient programming skills and knowledge. Of course, one must be careful to avoid screening out participants that may, in fact, be representative of the population to which the results should generalize.

3.6. Time management

As a practical matter, one needs to keep each participant’s commitment brief in order to be able to recruit and retain enough participants and to minimize study cost. However, the experiment designer needs to allow enough time for most participants to finish the given programming tasks (at least in some of the experimental conditions). To address this problem, we conducted enough pilot studies (when preparing for a quantitative study, we found typically five or so pilots sufficed) that we could estimate the range of times that most participants would spend on each task. We found that we could allow participants enough time such that when the participant did not finish in the allotted time, the experimenter usually believed that even given substantial additional time, the participant would not have completed the task. This belief was driven by observing the difficulties that participants were facing at the end of the time window. Sometimes the problem was a design choice by the participant that made the problem much more challenging than anticipated; other times we believe it was due to lack of programming skill, since we observed some participants making basic programming errors. Of course, it is difficult to generalize about participants who do not finish particular tasks when we expected them to have enough time, but we found that the above approach resulted in studies that were practical to run and which yielded useful results.

The choice of study pre-screening method introduces a tradeoff. A lax pre-screening procedure makes it easier to obtain enough participants from a population that generalizes to a broader community. A strict pre-screening procedure that admits only the most expert participants may reduce times as well as variance, but may make it difficult to recruit participants and harder to generalize the findings. In university settings, with many novices, we advise erring on the stricter end of the spectrum, since most real practitioners will be more skilled than most students.

Rather than give fixed limits for each task in advance, we aimed to maximize effective use of participants’ time. When participants had additional time remaining in their commitment (for example, in one study, we told participants that the study would take four hours), we could let the participants spend longer than budgeted on the later tasks if their earlier tasks took less time than expected. Then, when reporting results, we could consider what the success rate would have been if everyone had had only the time available of the participant with the minimum time window for that task. In addition, we could report which participants succeeded given the additional time. This allowed us to make the best use of our participants’ time while maintaining experimental validity.

3.7. Bias toward familiar languages

In a user study of a new programming language in which the participants are experienced programmers, one might expect that the language that performs “best” might be one with which participants are already familiar. Furthermore, when asked to join in participatory design exercises, perhaps participants might be likely to guide the design toward languages with which they are already familiar.

We used two techniques to address this problem. First, to find out what approaches might be easily learnable and would make immediate sense to participants, we adapted the natural programming elicitation technique (Myers et al., 2004). In it, participants are given blank paper or a text editor and asked to write programs without being given a specific language to use. As a form of participatory design, the goal is to elicit from participants the way they would naturally express the ideas in question. Although traditional natural programming studies give the programmer no training at all, we took a staged approach. First, we asked participants to write programs on a blank screen with no training. Then, we told them information about the language design, and asked them to do additional programming tasks with the new (but still underspecified) design. For example, we gave participants a state transition diagram and asked them to write a program that expresses the state transitions. By scaffolding the participants’ work in stages, we were able to answer both questions about participants’ initial expectations as well as identifying what approaches might be most natural given our preliminary language design assumptions.

Second, in most of the studies, we constrained the participants’ work according to our design ideas. Because the languages were designed to provide particular formal safety guarantees, we were interested in the impact of the language features related to those properties. One might expect that the additional language complexity might make the tasks harder to complete because they require that the program be written so that the compiler can prove the program has certain safety properties. We were interested in whether participants could complete tasks in the language even though they were obtaining stronger safety guarantees. By basing our work on behavior rather than preferences, participants’ prior experience was not an obstacle to overcome, but instead a tool we could leverage in teaching participants our language.

To encourage innovative responses (rather than ones that merely reflected prior training), we used natural programming for situations in which commonly-used languages could not directly represent the requirements we gave participants and for low-level syntactic choices (e.g. keyword selection). We also instructed participants explicitly to be creative and not write in any particular existing language. Finally, we were careful to interpret the results in the context of participants’ prior knowledge. For example, when participants use curly braces to denote blocks, the content of the blocks may be interesting even though the choice of curly braces is not.

3.8. Unsound proposals by participants

Another common limitation of natural programming is that participants lack expertise in language design, resulting in unsound proposals. This problem occurs with participatory design in other domains as well, and the usual solution is to use participant ideas as input to an expert-led design process (Pernice and Whintenton, 2017), which applies here as well. We found it helpful to present participants with several options rather than expecting them to compose designs from scratch. We asked participants to complete tasks using the various options so that we could observe their behavior and come to an informed conclusion about which of the options were best, rather than merely asking participants for their opinions. This allowed us to focus the process on designs that would fulfill the technical requirements while still obtaining relevant design insights.

Figure 2. The horizontal axis represents time; the vertical axis represents a dependent variable measured in a study. Part (a) shows how the variance increases over time. Shading shows how frequently a particular point in the space might be reached over many participants. In part (b), the task has been divided into three subtasks to reduce the variance in each subtask.

3.9. Summary

Table 1 summarizes the approaches we have found effective when designing user studies of programming languages.

Challenge Approaches
  • [leftmargin=0pt, nosep, before=]

  • Include knowledge assessments and practice problems in guide

  • Divide guide into small pieces

  • Experimenter answers questions during training phase of study

  • Automatically provide feedback for wrong answers

  • [leftmargin=0pt, nosep, before=]

  • Recruit master’s students, who frequently have professional experience that may be representative of many practitioners

  • Recruit professionals, but only when their expertise is needed

  • Leverage professionals’ altruism for recruiting (they may not be incentivized by typical study budgets)

  • Screen participants carefully; set a high bar for student participation

  • Evaluate language design research questions in the context of a language with which many possible participants are familiar

High prototyping


  • [leftmargin=0pt, nosep, before=]

  • Back-port language design questions to existing languages (also helps isolate effects of independent variables)

  • Use Wizard of Oz to simulate tools that do not exist yet: use a plain text editor rather than a real IDE, and have an experimenter provide feedback in lieu of a real compiler or interpreter


of features

  • [leftmargin=0pt, nosep, before=]

  • Isolate design questions by back-porting them to a familiar language

  • Mitigate non-orthogonality risk with summative studies

Variance and

external validity

  • [leftmargin=0pt, nosep, before=]

  • Triangulate with multiple study types

  • Break tasks into subtasks

  • Recruit from populations with sufficient programming skills and knowledge; pre-screen participants.

Time management
  • [leftmargin=0pt, nosep, before=]

  • Pilot repeatedly to assess how long tasks usually take

  • Set cutoff times so that most people will succeed at most tasks

  • Allow participants extra time when possible, then report these successes separately from the “within time limit” results

Bias toward

familiar languages

  • [leftmargin=0pt, nosep, before=]

  • Staged natural programming approach: sequentially expose additional constraints for participants

  • Request that participants do tasks using specific language designs that are being evaluated

Unsound proposals
  • [leftmargin=0pt, nosep, before=]

  • Provide sound alternatives and ask participants to use them

  • Give expert feedback on design ideas

Table 1. Approaches we used to apply HCI methods to programming language design studies

4. Usability studies for Glacier

4.1. Formative studies

We used the Cognitive Dimensions of Notations framework (Green and Petre, 1996) to reason about some of the design choices. For example, including features that provided weaker guarantees than programmers actually needed could be error-prone if those features could be easily confused with stronger ones. Likewise, the inverse is error-prone too: if a programmer applied a weaker specification than could actually be applied, this could lead to undesirable tradeoffs. For example, if an interface is annotated to return a read-only object (to an object that could be mutated through other references), the programmer might add locks to ensure safety in a concurrent context. But if the object is actually immutable (that is, no reference could be used to mutate the object), then the locks would be unnecessary and reduce performance.

Although the Cognitive Dimensions analysis was lightweight, it did not answer some of our higher-level design questions. In order to narrow the space of possible language designs, we conducted semi-structured interviews with eight software engineers who were working on large software projects at several organizations. Our participants had an average of fifteen years of experience, with a minimum of seven years, and had worked on projects with millions of lines of code and hundreds of people.

In order to both obtain unbiased data on problems with mutability in general as well as to obtain feedback on concrete language designs, we carefully ordered the interview questions. First we asked general questions, such as “How do you make sure that state in running programs remains valid?” We got wide-ranging answers, including ones such as “We’ve essentially done away with mutability to avoid security and concurrency problems” as well as recommendations for regular use of testing and assertions. Afterward, we asked about existing language features, such as const and final and their use. Then we asked about specific related areas, including concurrency and security. Finally, we asked about our own language design ideas, including immutable classes.

Our interview participants said that bugs in which state changes when it is not supposed to are frequent. They also described how the language features they had available did not provide guarantees that were sufficient for their purposes. For example, when reusing existing code, participants could not typically tell whether the code was thread-safe, so they had to assume that it was not. If a component came with an appropriate compiler-checked immutability specification, then they could be confident of safety, but languages did not provide such a feature. We concluded that transitive immutability provided the strong safety properties that our interview participants requested: a transitively immutable object can be shared safely among threads without locks.

An interesting observation that came out of the interview studies is that typically, for a given class, either all instances are mutable or all instances are immutable. In contrast, some prior systems, such as IGJ (Zibin et al., 2007), supported immutability at the object level of granularity (object immutability). We evaluated an initial prototype, IGJ-T, that extended IGJ with transitivity. We found that participants had great difficulty managing the complexity, which was in part because IGJ’s syntax focused on object immutability, not class immutability. We reasoned that if we designed our system to support class immutability only, our system would be simpler and therefore likely easier to use without sacrificing much expressiveness. This motivated our new tool, Glacier, which was centered around transitive class immutability.

Triangulation was a key aspect of the design process. In addition to the above user-centered methods, we also considered the theory of programming languages, which we used to ensure that our language would provide the guarantees that our design intended to achieve.

4.2. Summative studies

In addition to doing two case studies to evaluate expressiveness, we conducted a lab study to answer two research questions relating to our comparison question in §1:

  1. Can participants express immutability more successfully in Glacier than with Java’s final keyword?

  2. Without Glacier (using only standard Java), are programmers likely to accidentally insert the kinds of bugs that Glacier detects?

We recruited 20 Java programmers. We randomly assigned participants to use either Glacier or final, and we gave participants a tutorial in their given tool (two pages for Glacier, three pages for final). In addition, we gave the final participants a page from Effective Java (Bloch, 2008) explaining how to safely enforce immutability with final. Then, to address the first question, we asked them to change one class in each of two small projects so that those classes were immutable. None of the participants in the final condition were able to do their task successfully because it was too easy to forget to do one of the changes required, such as copying mutable inputs to constructors. Of the 20 Glacier tasks attempted, participants completed 19 correctly.

To address the second question, we asked our participants to do two programming tasks (Person and Accounts) on two small immutable classes. Although we did not verbally tell them that the classes were immutable, the classes were adapted from real-world code, and the participants had just completed the tasks above pertaining to immutability. In the Glacier condition, each task was completed successfully by seven participants; of course, no one accidentally mutated immutable state because Glacier disallowed it. In the final condition, however, four of eight participants who finished the first task completed it successfully, and only three of ten participants who finished the second task completed it successfully. These results are summarized in Table 2.

final Glacier
Correctly enforced immutability in Person 0/10 10/10
Correctly enforced immutability in Accounts 0/10 9/10
FileRequest.execute() tasks
without security vulnerabilities
4/8 7/7
HashBucket.put() tasks without bugs 3/10 7/7
Table 2. Summary of summative study results for Glacier

These results seems surprising: although we tried to design the experiment to be as unbiased as possible, the programming tasks were actually biased toward the control condition (final) in that participants had just been trained to consider immutability. One would expect, then, that in a real-world scenario, programmers might perform even more poorly. The success of this study teaches us some lessons about study design:

Errors are frequent::

Programming is so difficult that participants are likely to make errors very frequently, consistent with the variance challenge. Some of these errors will be ones that the experiment designer was hoping to observe, but many of them will be irrelevant. To mitigate this, ensure that participants are given enough time to correct their mistakes and actually finish tasks. Any task can be made difficult enough that participants will not finish it within a given amount of time, so it is imperative to pilot studies to identify an appropriate amount of time to allocate. A corollary, however, is that it is not difficult to run a study in which at least some participants make a particular error of interest.

Training may have limited effectiveness::

In the Glacier study, participants in the final condition were unable to correctly follow the advice we gave them on using final, despite having both documentation and a relevant page from a textbook. This is an example of the training challenge. Likewise, in the second part of the study, they frequently failed to identify that the class they were working on was immutable, despite having just spent time studying immutability. This leads to two lessons. First, attempts to change programmer behavior with only training materials, without actually modifying the tools programmers use, may have limited effectiveness. Second, in retrospect, considering our observations teaching Obsidian (§6), the training might have been more effective if we had required participants to do exercises with the new knowledge rather than assuming that they could read documentation and follow directions.

Bias toward control condition may be acceptable::

The effect of tool-based interventions can be so dramatic that it is much better to potentially bias the study toward the control condition than it is to introduce threats to validity or make the study harder to execute. For example, we might have seen even more errors in the second part of the experiment if we had not previously trained the participants in immutability, but that would have required either getting a second set of participants or changing the task order. If we had conducted the programming tasks first, then those tasks could have served to bias the other set of tasks. Because recruiting participants is challenging (recruitment challenge), we opted to do both the immutability-specification and the immutable-class-programming tasks with one set of participants.

5. Formative studies for Obsidian

5.1. Obsidian Language Design

Detecting bugs was our initial objective, so we considered bugs, such as the DAO hack (Daian, 2016), which resulted from a reentrant invocation in which a contract allowed itself to be invoked while in an inconsistent state. We also analyzed characteristics of proposed blockchain applications. In general, we observed that proposed blockchain applications typically maintain high-level state, which governs which operations are safe.

For example, a Casino can accept bet invocations only before the Game has already been played. Similarly, the authors of Solidity observed that many contracts implement state machines (Ethereum Foundation, 2017). Unfortunately, in Solidity, users must define states via enumerated types and then manually ensure that methods are only invoked when the target object is in an appropriate state. Although methods that can only be invoked in particular states are common (Beckman et al., 2011), writing programs that only invoke methods when appropriate has been shown to be hard for users (Sunshine et al., 2015), and Solidity includes no mechanism to ensure safety.

Smart contracts commonly manipulate assets, which are objects that have value (such as cryptocurrencies). In Solidity, it is possible to lose track of money and other assets (Delmolino et al., 2015), resulting in their value being permanently irretrievable. We were interested in designing a language in which many kinds of loss of assets could be detected by the compiler.

In order to leverage those observations, we became interested in a typestate-oriented approach (Aldrich et al., 2009), in which states of objects are incorporated into types. For example, rather than merely having a LightSwitch type, we can have LightSwitch@On be the type of a reference to an object that is in the On state. Then, if the user attempts an invalid operation, such as turning on a switch that is already on, the compiler can issue an error.

Typestate-based types are in a class called linear types. Unlike traditional types, linear types can change as operations are performed. For example, invoking turnOff() on a reference of type LightSwitch@On changes the type of the reference to LightSwitch@Off. Conveniently, linear types are also what are needed to ensure that assets are never lost. Obsidian includes owned objects: for each owned object, there is an object that owns it via an owning reference. If a local variable that owns an asset goes out of scope, the compiler emits an error message. Fields that own assets can only exist in contracts that are themselves assets. This way, each asset always has an owner.

We selected an object-oriented approach, being well-suited for representing state and corresponding updates. We avoided inheritance, since we wanted to avoid the fragility that results (Mikhajlov and Sekerinski, 1998). A full description of the language cannot fit in this paper; for that, please refer to the formal language specification in the supplement. Instead, Figure 3 shows some of the key features of Obsidian using the example of a tiny vending machine (TVM). TVM is a main contract, so it can be deployed independently to a blockchain. A TVM has a very small inventory: just one candy bar. It is either Full, with one candy bar in inventory, or Empty. Clients may invoke buy on a vending machine that is in Full state, passing a Coin as payment. When buy is invoked, the caller must initially own the Coin, but after buy returns, the caller no longer owns it. buy returns a Candy to the caller, which the caller then owns. After buy returns, the vending machine is in state Empty.

1// TVM is a Tiny Vending Machine.
2main asset contract TVM {
3  Coins @ Owned coinBin;
5  state Full {
6    Candy @ Owned inventory;
7  }
9  // No candy if the machine is empty.
10  state Empty;
12  TVM() {
13    // Start with no coins, and go to the Empty state.
14    coinBin = new Coins();
15    ->Empty;
16  }
18  // restock transitions from Empty to Full by taking ownership of candy.
19  transaction restock(TVM @ Empty >> Full this,
20                      Candy @ Owned >> Unowned candy)
21  {
22    ->Full(inventory = candy);
23  }
25  // buy transitions from Full to Empty by taking ownership of a coin.
26  // buy returns the purchased candy.
27  transaction buy(TVM @ Full >> Empty this,
28                  Coin @ Owned >> Unowned coin)
29                  returns Candy @ Owned
30  {
31    coinBin.deposit(coin);
32    Candy result = inventory;
33    ->Empty;
34    return result;
35  }
37  // withdraw removes any accumulated coins and returns them to the caller.
38  transaction withdraw() returns Coins @ Owned
39  {
40    Coins result = coinBin;
41    coinBin = new Coins();
42    return result;
43  }
Figure 3. A tiny vending machine that shows key features of Obsidian.

In this section, we describe studies that helped us identify a suitable design and iterate on our initial design ideas for Obsidian. For each study, we identify our research questions, methodology, and results. We started by assuming that we would use typestate to achieve the desired safety guarantees but that expressing typestate in a usable way would require substantial iteration with users. The latter assumption was based on past work on typestate systems, such as Plural (Bierhoff and Aldrich, 2008) and Plaid (Sunshine et al., 2011), which researchers had found were difficult for users to use. All of the studies were approved by our IRB. Because we needed skilled programmers, we recruited from appropriate academic programs, by posting flyers, and by contacting our acquaintances. Except where noted below, we paid participants $10/hour for participating. Materials used in the studies can be found in the supplement.

Although Fig. 3 uses the final version of the language, because the formative studies were done earlier, they use code from earlier versions of the language. In this way, the reader can see how we changed the language as a result of the user studies. For example, Figure 5 shows different approaches that we considered using for declaring local variables.

5.2. Basic design of typestate

In order to minimize assumptions regarding how Obsidian should best represent typestate, we conducted a natural programming study, which was described by Barnaby et al. (Barnaby et al., 2017). This is an example of the naturalness research question in §1. Here, we summarize the part of the study that investigated whether states are a natural way of approaching the challenges that arise in blockchain programming, and which of several syntaxes for representing these features is most understandable by programmers. The study investigated the lexical relationships between states and transactions available in those states. Experimenters obtained a convenience sample of seven participants. Each participant was given a description of a program to implement and one hour to complete the implementation. In initial tasks, participants implemented the program using pseudocode; in later tasks they were given a brief tutorial about the current version of Obsidian and an Obsidian program to complete.

Only two participants invented syntax denoting states and state transitions; the rest used a conventional approach, such as an enumerated type. However, many of the approaches the remaining five participants used were unsafe, helping to justify using typestate to improve safety. Although most of this study focused on participant behavior, we took the opportunity to also ask participants for their syntactic preferences. Five participants preferred a syntax where all the actions of a state must be lexically encapsulated in that state, as in the first alternative in Figure 4. Likewise, P4 felt it should not be permitted to call transactions from one state while lexically in another state: “I’m calling S1’s transaction from code for Start.”

Transaction lexically nested inside state declaration:

1contract Wallet {
2    state Empty;
3    state Full {
4        int balance;
6        // spend() is nested inside the declaration of the state it belongs to.
7        transaction spend(Wallet@Full this) {
8            // use ’balance’...
9        }
10    }

Transaction lexically outside state declaration:

1contract Wallet {
2    state Empty;
3    state Full {
4        int balance;
5    }
6    // spend() is not nested inside Full, even though it can only be called in Full state
7    transaction spend(Wallet@Full this) {
8        // use ’balance’...
9    }

Transitions when inside a state could be confusing

1contract Wallet {
2    state Empty {
3        transaction fill(int amount) {
4            ->Full(balance = amount);
5            // Now balance should be in scope, since new state is Full
6        }
7    }
9    state Full {
10        int balance;
11        // ...
12    }
Figure 4. Although participants preferred to have transactions nested inside state declarations (the first alternative), this desire conflicted with the need for transactions only reference fields that were in lexical scope.

This conflicted with the need for transactions that could be executed in several different states and the fact that code could run after transitions had executed. This conflict exemplifies the interdependence of features challenge. For example, in the third example in Figure 4, line 5 may reference balance, even though line 5 is lexically enclosed in the Empty state, in which balance is not in scope. We addressed the problem by requiring that transactions are lexically outside of state declarations, like the second example in Figure 4. Future IDE tools could show all transactions that are possible for an object in a given state, even though their declarations are lexically outside that state’s declaration.

5.3. Fields in states

States in contracts can have different sets of fields, so transitioning can cause some fields to exit scope and others to enter scope. For example, in Fig. 3, the Full state has the inventory field, but the Empty state has no fields. This study used natural programming and code understanding methods to investigate how users users specify cleanup of old fields and initialization of new fields when invoking state transitions.

We recruited four participants, which was enough to provide substantial and useful feedback. All were Ph.D. students studying software engineering. They had an average of seven years of programming experience (ranging from three to fifteen years) and an average of 1.5 years of Java experience.

In Part 1, we gave participants a state transition diagram for a Wallet object, which could hold a license and money, and which had four states corresponding to the possible combinations of contents. Participants were also given code partially implementing the Wallet, with several TODO comments asking participants to invent code to add money to the Wallet, remove money from the Wallet, etc. Participants were told that the money and license should be thought of as assets, so they could not be duplicated, used more than once, or lost. The code they were given was in a language similar to Obsidian but which used some keywords that would be more familiar to a Java programmer, such as class instead of contract. As such, this was a staged natural programming study, since we progressively gave participants more detail about the language we were designing.

All four participants prepared assets for a state transition before making the state transition (corresponding to option (2) in Part 2 below, S::x = a1; ->S). Two participants felt they need to write code to handle failures during the asset preparation stage, which might lead to an improperly initialized state upon transition. One of them suggested a try-catch type wrapper for the asset preparation and transition phases.

In Parts 2 through 4 of the study, participants were given several options. Then they were asked to implement each of the options within a given partially-implemented transaction. Finally, they were asked for their preferences.

Part 2 compared approaches for initializing fields in states during transitions. Options were:

  1. Assets are assigned to fields in the transition, e.g. ->S(x = a1) assigns the value of a1 to field x of state S.

  2. Assets are assigned to fields before the transition, e.g. S::x = a1; ->S.

  3. Assets are assigned to fields before the transition, but the fields are in local scope even though the state has not changed yet, e.g. x = a1; ->S.

  4. Assets are assigned to fields after the transition, e.g. ->S; x = a1.

The participants successfully used all the approaches, but most of the participants preferred assigning assets to fields before the transition with destination state scoping (option 2). Before the study, Obsidian supported only atomic assignment (option 1, shown in Figure 3 on line 22). The results of these two parts motivated a language change: Obsidian now also supports option 2.

Part 3 presented two options for handling assets when transitioning from a state with an asset to a state without it:

  1. The transition evaluates to a collection containing the old assets, e.g. x = ->S indicates that x is assigned the leftover assets after the transition to state S. If the current state is unknown statically, the contents of the collection are determined dynamically.

  2. The transition evaluates to a tuple, e.g. (x = a1) = ->S indicates that x will be assigned the asset a1 which is not present in state S.

There was consistent confusion about which leftover assets are assigned to option 1’s collection after a transition. All participants understood the need for both options in certain cases, but would choose the tuple-like collection for more control and explicitness when the use of either approach is acceptable. We would like to implement this approach in the future but so far have not prioritized it, since the existing approach (described in Part 4, option 1), which requires that ownership of assets be surrendered before transitioning, has been effective for participants.

Part 4 focused on releasing assets owned by state fields when transitioning to states in which those fields do not exist. In contrast to part 3, this approach added the option of releasing assets before the transition. The choices were:

  1. Assets must be released before the transition, e.g. release(a1); ->S.

  2. The transition evaluates to a tuple of assets that are no longer owned, e.g. a1 = ->S.

All the participants understood the options and implemented them without mistakes. Implementing using option 2 (evaluating to a tuple) enables both approaches, so participants were asked to indicate scenarios where one option would be preferred over the other. The participants consistently indicated that assets should be released before a transition if they are no longer needed; otherwise, they should evaluate to a tuple. This helped us prioritize our features, since releasing assets before the transition seemed to suffice.

5.4. Permissions: a qualitative study

Soundly enforcing typestate requires knowledge about all references to an object, which is afforded by a permission system. (Bierhoff and Aldrich, 2007). Permissions systems allow the programmer to express what a particular reference can be used for (and therefore also what it cannot be used for). Is there a permission system that users can understand and use effectively (a question of naturalness)? If so, what can we learn from users about how to design it (a question of iteration)? In this work, we conducted the first studies (of which we are aware) in which people other than the designers of the system were asked to use a permissions system to restrict references in a programming language. We found that our initial system design was surprisingly difficult to use, and iterated the design until it was more successful.

In order to study permissions while mitigating the interdependency of features, training and recruiting challenges, we extracted the permission system from Obsidian and re-cast it in Java as a set of annotation. We conducted a Wizard of Oz study where participants received documentation on a Java extension and the experimenter gave simulated compiler error messages. This approach minimized training time for participants, minimized implementation cost for ourselves, and allowed us to isolate this design decision from many others that would have otherwise distinguished the language from Java. At this point in the development of Obsidian, we assumed that it would be best to separate the notions of permissions and typestate; this approach was reflected in the study materials but may surprise a reader who has studied Fig. 3, which reflects the final Obsidian version, which combines the two. The training materials explained the annotations: @Asset, which applied to classes; and @Owned, @Shared (no restrictions but could not co-exist with typestate-specifying references), and @ReadOnlyState (restricting state modification), which applied to references. We recruited six participants (P14–P19), which was enough to provide substantial and useful feedback. They had a mean six years of programming experience (ranging from three to nine years) and a mean two years of Java experience.

The study included five parts. Since our goal was to identify as many usability problems as possible, we revised the design and instructions after each participant. The first three participants were given 1.5 hours to do the first four parts; the last three were given two hours to fit in a fifth part of the study. An experimenter was available to answer questions.

Part 1. To motivate the need for language features to prevent bugs, we gave participants a 163-line Java medical records system and asked the first two participants to find a bug in which a patient could refill the prescription more times than specified. The first participant did not find the bug within 30 minutes; the second did so just as time expired. To conserve time, we gave the other participants five minutes to inspect the code and explained the problem to them.

We conclude that at least some programmers who use traditional languages would have difficulty detecting the kind of bug that Obsidian prevents. This provides further evidence that if users use Obsidian, the compiler will help them detect bugs that otherwise might go undetected.

Part 2. We told participants we would prevent the previous bug by distinguishing between two kinds of references. “Considering an object o: Kind #1: There is only one reference of kind #1 to o at a time. Kind #2: There may be many references of kind #2 to o at a time.” We asked participants to propose names for the two kinds of references. Note the careful language avoiding bias toward specific vocabulary. Participants’ name suggestions included:

Kind #1::

KeyReference, UniqueReference, Owned, Singleton reference, Resource handle, @default

Kind #2::

DuplicateReference, ForeignKeyReference, Borrowed, Flyweight pattern reference, const pointer

The results were too inconsistent to justify an particular choice in the language; all the suggestions were distinct, and some of them were not appropriate in context (unsound proposals challenge). Obsidian uses Owned, which is at least consistent with one suggestion, and Unowned.

Part 3. To evaluate the usability of ownership, we gave participants an ownership tutorial and told them we had chosen [no annotation, @ReadOnly] (first participant) or [@Owned, no annotation] (later participants) as keywords. We asked them to modify the code from Part 1 to fix the bug. We hoped participants would require that Prescriptions deposited in a Pharmacy be owned and that the Pharmacy take ownership; thus, a deposited Prescription could not be deposited in a second Pharmacy. Completion times ranged from 3 minutes to 40 minutes (variance challenge). Two participants did not finish, one of whom we stopped after 38 minutes to prioritize other tasks.

We were surprised that many of the participants found this task very difficult. We expanded the tutorial to include a practice section for later participants. In general, participants were not prepared to use a type system to address a bug that they thought of in a dynamic way. For example, P16 wrote if (@Owned prescription), attempting to indicate a dynamic check of ownership. We asked participants who wanted to use dynamic approaches for enforcement to use the language feature instead. P14 commented “I haven’t seen…types that complex in an actual language …enforced at compile time.”

P17 had trouble guessing what the compiler could know, expecting an interprocedural analysis (which would be non-modular). For example, in a case where an owned object was being consumed twice, P17 expected the compiler to give an error on the second spend invocation. Instead, because the second invocation was inside a helper method, the compiler reported the error on the invocation to the helper method, which took an owned argument and invoked the second spend.

P17, P18, and P19 had difficulty determining which variables should be annotated @Owned. In one case, a lookup method took an object to search for, but P17 specified that it should take an owned reference. Then he was stuck after invoking it: “How can I get the annotation back?” But this was impossible except via adding another method, since he had already given away ownership. Likewise P17 was confused by whether accessors should return owned references. Mistakes could be costly. For example, P19 unnecessarily annotated as @Owned a class that was contained in a collection, which caused a problem iterating through the collection. He made the reference to the current list element @Owned, which would require removing each item from the collection when iterating over it in code that was not supposed to modify the container at all.

Parameter-passing and assignment were common points of confusion. P18 asked what happens when passing an @Owned object to a method with an unowned formal parameter (ownership was not passed in this case). P19 said, “when I [annotate this constructor type @Owned], I’m not sure if I’m making a variable owned or I’m transferring ownership.” P17 was surprised that assignment from an owned reference to an unowned-type variable did not transfer ownership. We later addressed this confusion by making assignment always transfer ownership; participants in later studies were generally not confused about which assignments transfer ownership.

From this portion of the study, we came to two general conclusions. First, the semantics of ownership needed to be as explicit and as simple as possible. This likely generalizes to many different kinds of complex language constructs: implicit behavior, although sometimes convenient for experts, can be baffling to novices. When the behavior can be made explicit without making the language inconvenient for experts, that should be done. Second, language design decisions that have structural implications (as is the case for ownership) require substantial high-level training; we refined the training materials in future studies to give more explanation and examples.

Part 4 introduced the notion of assets. After a tutorial explaining the properties of assets, participants were asked to invent code that could indicate a particular owned reference was intentionally going out of scope. Two participants suggested @Disown and free to abandon owned references; the rest did not have time to answer or had no suggestions. We chose disown for Obsidian, since free has additional memory management connotations that are not relevant here.

Part 5 introduced typestate, starting with the fourth participant. Participants read 2.5 pages on typestate in Obsidian (as it existed then), including @ReadOnlyState, @Shared, and @Borrowed (which was for temporary ownership transfer in invocations). Ownership was the default, so no @Owned was needed. The tutorial also explained available in and ends in, which at the time specified state assumptions and guarantees for methods (before we changed to using this parameters instead, e.g., as on lines 19 and 27 of Figure 3). Then, they were asked to annotate uses of Bond in a 212-line Java program implementing a financial market. They were told to use ownership and state specifications whenever possible.

Consistent with Part 3, some participants were more comfortable with a dynamic perspective on ownership rather than a static one. P18 felt that ends in declarations were redundant with the transition code already in the method implementations, but these declarations allow separation of interface and implementation and modular checking. P19 wanted to use borrowing to represent the notion that the BondMarket owns a Bond, but an Investor borrows it for a while. In fact, borrowing was only appropriate for the duration of a method invocation. We later changed the design of the formal parameter syntax to remove the need for @Borrowed; now, if no ownership change is specified (via the >> operator), ownership remains unchanged.

P19 required significant prompting by the experimenter to make maximum use of typestate. First, P19 added annotations on methods but not on any variables. After prompting, he added dynamic checks in one place but required prompting to add static typestate specifications. This suggests that tools may be needed to help users obtain the most benefits from the language. On the other hand, P18 specified @Asset on Bond without being asked to do so, explaining “because it’s something important and I don’t want to get it out of scope…”

Overall, understanding the limitations of the type system and compiler may be an obstacle for some people. Users will need training to reason about what typestate can do, but the observations above motivated language changes that simplified the design without lowering the expressivity or safety. Tools could mitigate the limitations of traditional type systems by providing sophisticated static analyses rather than taking a traditional type checking approach (as Obsidian does), and by providing detailed, explanatory errors.

5.5. Comparing typestate and ownership approaches

We were interested in evaluating a new approach we invented, which was motivated by the confusion we observed in the prior study (in part a question of naturalness and in part a challenge of training). We invented a new approach: fuse the notions of ownership and typestate in order to simplify the type system, and the next study refined this design. This design has the benefit of eliminating Shared references that also specify typestate, which would then have to be disallowed to preserve soundness. Thus, the type Bond@S is always implicitly an owned reference for any state S, and users can write any permission instead of S, as in Bond@Unowned.

We were also interested in another usability concern. Consider Approach 1 in Fig. 5. A reader of line 1 might expect that the type of bond would always be Bond@Offered. In fact, after line 2, the type is Bond@Sold due to the call to buy. A fundamental aspect of Obsidian is that ownership can change, so if a variable declaration includes any ownership information, the variable’s ownership status may later be inconsistent with its declaration.

We initially invented two possible solutions to address this problem. One idea for addressing this involved incorporating types into variable names, shown in Approach 2. The annotations pertain to the current type rather than the new type. The reader would have to look at only the most recent operation to infer the new type of a variable rather than having to potentially read the whole sequence since the declaration.

Approach 3 represents another idea: adding static assertions. Line 3 shows a static assertion that bond references an object in state Sold, which serves as documentation. Unlike traditional assertions, however, the compiler checks correctness. The intent is to make it easier for programmers to determine the types of variables.

We conducted studies with participants in the first three conditions. Inspired by observations of those participants, we invented approach 4. This approach is like approach 3 except that it removes state specifications from local variable declarations. The removal was not part of the original design but was inspired by early results of this study.

Approach 1: traditional declarations

1Bond@Offered bond = new Bond();;

Approach 2: types in variable names

1Bond bond@Offered = new Bond();;

Approach 3: static assertions

1Bond@Offered bond = new Bond();;

Approach 4: no states in local variable declarations

1Bond bond = new Bond();;
Figure 5. Variable declaration approaches

5.5.1. Participants

We required that participants be familiar with Java and we administered a simple Java pre-test. We recruited five students (P21–P25). Based on self-reports, they had an average of about four years of Java experience (ranging from one to ten years) and an average of one year of professional (paid) software development experience (ranging from zero to three years).

5.5.2. Procedure

Participants spent between 1 and 1.5 hours on the study. We used a Qualtrics survey to ask participants a series of questions regarding Obsidian programs, but the study took place in a lab and an experimenter was available to answer questions. The survey both taught aspects of the language and provided an opportunity for assessment. Most of the questions were typical code understanding questions, which gave snippets of code and asked whether the compiler would give an error or what the code meant. We assigned participants to one of the four conditions above according to what we hoped to learn from each trial: approach 1 for P22, approach 2 for P21, approach 3 for P23 and P24, and approach 4 for P25.

Figure 6. An example question assessing understanding of ownership transfer. The correct answer is selected, since assignment transfers ownership.

5.5.3. Results and Discussion

P22, who was given approach 1 (with permissions and states specified only in declarations), tried to guess the compiler’s behavior, saying things like “If the compiler was smart…". For example, P22 expected that the language would infer an implicit @Off in the declaration LightSwitch s1 = new LightSwitch(). P22 also expected that although changes of state were permitted via transactions, state-mismatching assignment to variables would be forbidden, even though approach 1 assumes that states can be inconsistent with type declarations. This approach would be inconsistent and P22’s confusion suggests that the type-declaration approach is problematic.

Including types in variables names seemed to be confusing as well. P21 expected that ownership was not passed into method calls even when an owned reference was passed. P21 was also surprised that no ownership annotation meant that there was no ownership, instead expecting this to mean that ownership was unknown.

Participants in condition 3 seemed to do much better. For example, although the materials did not use the word assertion, P23 observed that the annotations were assertions. P23 liked the system, commenting “Perfect, I like this, this is very nice. I wish Java had this; it would have saved me a lot of bugs." As we obtained additional confidence in the value of approach #3, we added additional material. For P24, we changed assertions to use @ rather than the initial >> so that we could use >> to specify type changes in transaction parameters. With P25, we used ? to indicate lack of static state knowledge. We later simplified the system because this approach was ambiguous, leaving notations Owned, Unowned, Shared, and unions of specific states (separated with |).

P24 was confused because state specifications on local variables were redundant. For example, in LightSwitch@Off s = new LightSwitch(), the @Off portion is redundant because the compiler already knows the state of the new object due to the constructor’s declaration. To resolve this, we added approach 4, removing typestate and permission annotations from local variable declarations; in contrast, permissions are always specified for fields and formal parameters. In those cases, the annotations are important because they constrain types of variables at the end and beginning of transactions.

In summary, this study motivated the removal of state specifications from local variable declarations and provided initial evidence that static assertions are likely to be a convenient way for programmers to specify states and permissions of local variables. We also obtained evidence that with these other changes, static state assertions are understandable by current Java users with little extra training.

5.6. Threats to validity

The studies share common threats to validity, many of which correspond to the external validity challenge: our participants may not be representative of the population of blockchain programmers; we had limited numbers of participants in each trial; and our tasks may not reflect the reality of blockchain programming. We believe, however, that the population of likely language users is more skilled than our participant population, which mostly consisted of students, so if the students are successful in completing tasks, that aspect of the result is likely to generalize. We did not seek to identify all possible usability problems, but rather to identify the most common and severe ones associated with particular design decisions so that we could try to address them. Because there were so many different design decisions, we focused on those for which we had prior evidence that there might be usability problems.

6. Summative usability study of Obsidian

In order to assess whether our changes to Obsidian had resulted in a language in which programmers could be effective, we designed a summative usability study. We gave the participants the complete Obsidian language, including its compiler, and asked them to complete three programming tasks. We were interested in whether the participants experience the same usability problems as the prior participants and whether there were sufficiently serious usability problems left to prevent them from completing their tasks. We found that given enough time, most of the participants were effective at completing the tasks we gave them.

The design of the study was informed by several of our methodological contributions, and thus also served to assess their value. We trained the participants in Obsidian as part of the study; we recruited local students to participate; we used multiple programming tasks, rather than one long one; and we we managed task times according to our guidelines. We found that the methodology we describe in this paper was useful in designing and executing the study.

6.1. Participants

We solicited experienced Java programmers to take a short screening test online, which took an average of about 9 minutes to complete. We accepted into the three-hour study only those who answered at least five of six basic Java questions correctly, including one question about aliasing. Of 18 completed surveys, 11 people met our screening criteria. We got six participants (P35-P40), whom we compensated with $50 Amazon gift cards. The participants had an average of 9 years of programming experience, 2 years of professional experience, and 2 years of Java experience. One self-identified as female; the rest identified as male.

6.2. Procedure

The previous studies focused on particular aspects of the design, in many cases by giving participants languages that were not precisely Obsidian. To evaluate our final design, we conducted a usability evaluation. Because Obsidian provides stronger safety guarantees than existing languages such as Solidity, and because of our prior experience showing that it would be very challenging to develop a linear type system that would be usable at all, we focused our work on whether people could effectively complete tasks, not on whether people could complete tasks faster than in existing languages. We based our work on pilot studies that we previously conducted (Kambhatla et al., 2019).

The experimenter gave low-level guidance, such as how to invoke the compiler. Finally, the experimenter provided assistance that simulated more mature tools. For example, when a participant attempted to debug an error that was reported on line 38 by examining line 33, the experimenter pointed out the discrepancy, since the IDE we provided did not highlight the appropriate line.

After completing the tutorial, which included seven programming exercises, we gave participants starter code for the three main tasks, described below. Although participants used the compiler, they were not given tests or a runtime environment, since the focus of our usability study was the type system (recall that Obsidian is designed to detect as many bugs as possible at compile time, since runtime detection may be too late to ensure safety). Although the first two tasks were short in order to reduce variance, we allowed the third task to be more open-ended to see whether participants would be able to complete a more challenging task.

The first task, Auction, simulated an English auction, in which bids are public, and the bidder who offers the highest price must pay that price for the item. We added the additional constraint that bids were required to come with Money so that bids could be guaranteed to be viable (a bidder could not issue a bid and then fail to pay for the item). As a starter task, we asked participants to finish implementing createBid, requiring them to invoke a constructor. They also needed to finish implementing makeBid, which records a new bid from a client. In makeBid, we were interested in whether they initially wrote code that accidentally lost the previous bid, which held the associated Money (before receiving a compiler error), indicating that Obsidian’s typechecker had helped them avoid losing track of an asset.

The second task, Prescription, corresponded to the medical records system in the Permissions study section (§5.4); we were interested in whether our improvements enabled participants to reason more effectively about the code than we had observed in the previous studies. We asked participants to fill in the type signature for the consumeRefill and depositPrescription transactions, as well as completing the implementation of fillPrescription.

The Casino task was more open-ended and included directions and requirements for what operations should be supported, as well as low-level starter code, such as implementations of Money and Bet. It asked participants to implement a Casino that takes bets on games. When games are complete, the casino enables winners to collect their winnings. We were primarily interested in participants’ abilities to reason about ownership and typestate and to design architectures that could effectively use ownership.

6.3. Results and Discussion

Results for the tasks are summarized in Table 3. With P38, to assess to what extent the tutorial materials stood alone, the experimenter declined to answer Obsidian-related and debugging-related questions. However, this made the first task perhaps unrealistically difficult and lengthy, resulting in insufficient time for the other tasks.

Task completion times (hours:minutes)
Tutorial Auction Prescription Casino
P35 1:31 0:13 0:18 1:01
P36 2:12 0:28 N/A N/A
P37 1:03 0:33 0:46 0:36*
P38 2:18 0:46 N/A N/A
P39 1:14 0:22 0:27 0:51*
P40 1:11 0:12 0:22 0:58
Table 3. Usability test results. * indicates insufficient time to finish the task. N/A indicates insufficient time to start the task.

In the Auction exercise, two of the six participants accidentally introduced a bug in which an asset was lost: they overwrote maxBid, which held money. The compiler gave an error message and they corrected their mistake, but if they had been using Solidity, its compiler would not have caught the bug. After P36, we slightly simplified the Auction exercise by removing a subtask and refactoring to inline a TODO that had been put in a helper transaction. The above times are adjusted to remove the extra time P35 and P36 spent on the removed task (1 and 8 minutes, respectively).

Some participants seemed to think carefully about ownership and wrote the correct code quickly. Others seemed to focus on satisfying the compiler, and their work took longer. For example, P38 got an error message after overwriting the owned maxBid reference, and “fixed” it with disown. This choice may be a result of weaker programming skills and lack of help in the tutorial; P38 took the longest on the tutorial, and was surprised to not be given a design diagram for the () Auction starter code. We changed the tutorial to emphasize that disown should be used to throw away assets.

In the Prescription task, as with other tasks, variance was large. For example, one reason for P38’s long completion time was that P38 had used Python most recently and, despite the tutorial, sometimes wrote Python-like syntax, which did not parse (one example took four minutes to fix). At the time, we were hoping that participants would be able to complete the tasks entirely on their own, but in retrospect, we may obtain more relevant results by carefully providing appropriate help (which we provided to all the other participants).

We were interested in participants’ ability to reason effectively about ownership. All of the participants who started Prescription were able to complete it. P37 encountered some difficulties due to shortcomings in Obsidian’s support for dynamic state tests. Currently, Obsidian does not allow dynamic state tests to be used as arbitrary Boolean expressions, e.g. if (x in S && e) where e is an arbitrary Boolean expression. Likewise, if (x not is Owned) is not supported (perhaps this was inspired by Python’s is operator). In the latter case, P37 developed some intuition: “Ownership doesn’t feel like something I should be using in this way…” and restructured the code to check if (maybeRecord in Full), which was correct. In another case, the compiler found a bug in which the code assumed that a collection must contain an element, a benefit of not allowing null in the language.

The Casino task was substantially more open-ended than the other tasks, requiring substantially more time, but participants who had a full hour for the task were able to finish it. Some participants defined states in the Casino contract (P35, P39), whereas others relied only on the states in the Game contract (P37, P40). Both approaches led to a lot of dynamic state tests, since the Casino object had to check to make sure the Game object was in an appropriate state. These checks could have been avoided if the different states of Casino had different typestate specifications for their references to the Game, an idea that occurred to P40 in retrospect. This observation represents an opportunity for a future version of Obsidian in which states of owning objects are coupled to states of owned objects, reducing the need for dynamic checks.

We noticed that participants who did better on the “advanced Java” portion of our screening test seemed to complete tasks faster. We found that those test scores were negatively correlated with completion times of the Auction task (, ). We regard this result as tentative because there were minor differences among the trials; we defer a definitive conclusion to a future quantitative study. The correlation between the scores and the tutorial completion times was not significant, likely an artifact of the small number of participants. This suggests that much of the variance (91%) in Auction completion times is explained by prior programming background. We hypothesize, then, that participants who have sufficient OOP understanding can learn Obsidian and use the language effectively in only about 90 minutes.

7. Future work

These language design methods have been used by the authors on two different projects. In the future, we would like to show that these methods can be used by many different kinds of people to design many different kinds of languages. One particularly challenging aspect of this is that the theoretical aspects of the design work require substantial background. Perhaps in the future, mechanized tools could help those who are not programming language experts design safe languages in their own application domains.

Another limitation of the methods we describe in this paper is that although one can do studies that assess the usability of particular language design choices, in some cases design choices interact with each other. As a result, it is not clear that designers can combine the results of different studies and expect that the resulting language will be usable. In this work, we mitigate this threat with summative studies, but future method development work may be able to address this problem more directly.

8. Conclusion

PLIERS represents a new approach to designing programming languages for software engineers. PLIERS is exemplified in Glacier and Obsidian, which reflect a new way of designing programming languages that integrates user-centered techniques into many stages of the design process. By incorporating feedback from users, we obtained insights that led to a language in which programmers can be effective at obtaining stronger safety guarantees than prior languages provided. We expect our new approach to language design is applicable to the design of a wide variety of different kinds of problem-solving tools.

9. Acknowledgments

We appreciate the help of Eliezer Kanal at the Software Engineering Institute, who helped start this project; Jim Laredo, Rick Hull, Petr Novotny, and Yunhui Zheng at IBM, who provided useful technical and real-world insight; and David Gould and Georgi Panterov at the World Bank, with whom we worked on the insurance case study.

This material is based upon work supported by the Sponsor National Science Foundation Rl under Grants Grant #3 and Grant #3, by the U.S. Department of Defense, and by Sponsor Ripple Rl In addition, the first author is supported by an IBM PhD Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.


  • (1)
  • Ahmad et al. (2011) Salman Ahmad, Alexis Battle, Zahan Malkani, and Sepander Kamvar. 2011. The jabberwocky programming environment for structured social computing. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 53–64.
  • Aldrich et al. (2009) Jonathan Aldrich, Joshua Sunshine, Darpan Saini, and Zachary Sparks. 2009. Typestate-oriented Programming. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09). ACM, New York, NY, USA, 1015–1022.
  • Barnaby et al. (2017) Celeste Barnaby, Michael Coblenz, Tyler Etzel, Eliezer Kanal, Joshua Sunshine, Brad Myers, and Jonathan Aldrich. 2017. A User Study to Inform the Design of the Obsidian Blockchain DSL. In PLATEAU ’17 Workshop on Evaluation and Usability of Programming Languages and Tools.
  • Beckman et al. (2011) Nels E. Beckman, Duri Kim, and Jonathan Aldrich. 2011. An Empirical Study of Object Protocols in the Wild. In Proceedings of the 25th European Conference on Object-oriented Programming (ECOOP’11). Springer-Verlag, Berlin, Heidelberg, 2–26.
  • Bergström and Blackwell (2016) Ilias Bergström and Alan F Blackwell. 2016. The practices of programming. In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 190–198.
  • Bierhoff and Aldrich (2007) Kevin Bierhoff and Jonathan Aldrich. 2007. Modular Typestate Checking of Aliased Objects. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications (OOPSLA ’07). ACM, New York, NY, USA, 301–320.
  • Bierhoff and Aldrich (2008) Kevin Bierhoff and Jonathan Aldrich. 2008. PLURAL: Checking Protocol Compliance Under Aliasing. In Companion of the 30th International Conference on Software Engineering (ICSE Companion ’08). ACM, New York, NY, USA, 971–972.
  • Blackwell and Burnett (2002) Alan Blackwell and Margaret Burnett. 2002. Applying Attention Investment to End-User Programming. In Proceedings of the IEEE 2002 Symposia on Human Centric Computing Languages and Environments (HCC ’02). IEEE Computer Society, Washington, DC, USA, 28–.
  • Bloch (2008) Joshua Bloch. 2008. Effective Java, Second Edition. Addison-Wesley.
  • Bostock and Heer (2009) Michael Bostock and Jeffrey Heer. 2009. Protovis: A graphical toolkit for visualization. IEEE transactions on visualization and computer graphics 15, 6 (2009), 1121–1128.
  • Braunschweig and Gani (2002) Bertrand Braunschweig and Rafiqul Gani. 2002. Software architectures and tools for computer aided process engineering. Computer Aided Chemical Engineering, Vol. 11. Elsevier.
  • Buse et al. (2011) Raymond PL Buse, Caitlin Sadowski, and Westley Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. ACM SIGPLAN Notices 46, 10 (2011), 643–656.
  • Chasins (2017) Sarah Chasins. 2017. Helena: Web Automation for End Users.
  • Coblenz et al. (2018) Michael Coblenz, Jonathan Aldrich, Brad A. Myers, and Joshua Sunshine. 2018. Interdisciplinary Programming Language Design. In Onward! 2018 Essays (SPLASH ’18).
  • Coblenz et al. (2017) Michael Coblenz, Whitney Nelson, Jonathan Aldrich, Brad Myers, and Joshua Sunshine. 2017. Glacier: Transitive Class Immutability for Java. In Proceedings of the 39th International Conference on Software Engineering - ICSE ’17.
  • Coblenz et al. (2019) Michael Coblenz, Reed Oei, Tyler Etzel, Paulette Koronkevich, Miles Baker, Yannick Bloem, Brad A. Myers, Joshua Sunshine, and Jonathan Aldrich. 2019. Obsidian: Typestate and Assets for Safer Blockchain Programming. In Submission (2019). arXiv:cs.PL/1909.03523
  • Coblenz et al. (2016) Michael Coblenz, Joshua Sunshine, Jonathan Aldrich, Brad Myers, Sam Weber, and Forrest Shull. 2016. Exploring Language Support for Immutability. In International Conference on Software Engineering.
  • Dahlbäck et al. (1993) Nils Dahlbäck, Arne Jönsson, and Lars Ahrenberg. 1993. Wizard of Oz studies - why and how. Knowledge-based systems 6, 4 (1993), 258–266.
  • Daian (2016) Phil Daian. 2016. Analysis of the DAO exploit. Retrieved August 21, 2018 from
  • Delmolino et al. (2015) Kevin Delmolino, Mitchell Arnett, Ahmed E Kosba, Andrew Miller, and Elaine Shi. 2015. Step by Step Towards Creating a Safe Smart Contract: Lessons and Insights from a Cryptocurrency Lab. IACR Cryptology ePrint Archive 2015 (2015), 460.
  • Dumas et al. (1999) Joseph S Dumas, Joseph S Dumas, and Janice Redish. 1999. A practical guide to usability testing. Intellect books.
  • Elsden et al. (2018) Chris Elsden, Arthi Manohar, Jo Briggs, Mike Harding, Chris Speed, and John Vines. 2018. Making Sense of Blockchain Applications: A Typology for HCI. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). ACM, New York, NY, USA, Article 458, 14 pages.
  • Endrikat et al. (2014) Stefan Endrikat, Stefan Hanenberg, Romain Robbes, and Andreas Stefik. 2014. How Do API Documentation and Static Typing Affect API Usability?. In International Conference on Software Engineering. ACM, New York, NY, USA, 632–642.
  • Ethereum Foundation (2017) Ethereum Foundation. 2017. Common Patterns. (2017). Retrieved November 6, 2017 from
  • Glass (2001) Robert L Glass. 2001. Frequently forgotten fundamental facts about software engineering. IEEE software 3 (2001), 112–110.
  • Graham (2017) Luke Graham. 2017. $32 million worth of digital currency ether stolen by hackers. Retrieved November 2, 2017 from
  • Green and Petre (1996) Thomas R. G. Green and Marian Petre. 1996. Usability analysis of visual programming environments: a ‘cognitive dimensions’ framework. Journal of Visual Languages & Computing 7, 2 (1996), 131–174.
  • Harvard Business Review (2017) Harvard Business Review. 2017. The Potential for Blockchain to Transform Electronic Health Records. (2017).
  • Herlihy (2019) Maurice Herlihy. 2019. Blockchains from a distributed computing perspective. Commun. ACM 62, 2 (2019), 78–85.
  • Hoare (2009) C. A. R. Hoare. 2009. Null References: The Billion Dollar Mistake.
  • IBM (2019) IBM. 2019. Blockchain for supply chain. Retrieved March 31, 2019 from
  • Jones et al. (2003) Simon Peyton Jones, Alan Blackwell, and Margaret Burnett. 2003. A User-centred Approach to Functions in Excel. In Proceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming (ICFP ’03). ACM, New York, NY, USA, 165–176.
  • Kambhatla et al. (2019) Gauri Kambhatla, Michael Coblenz, Reed Oei, Joshua Sunshine, Brad Myers, and Jonathan Aldrich. 2019. A Pilot Study of the Safety and Usability of the Obsidian Blockchain Programming Language. PLATEAU Workshop (2019).
  • Kelleher and Pausch (2005) Caitlin Kelleher and Randy Pausch. 2005. Lowering the barriers to programming: A taxonomy of programming environments and languages for novice programmers. ACM Computing Surveys (CSUR) 37, 2 (2005), 83–137.
  • Ko et al. (2011) Amy J Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, et al. 2011. The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43, 3 (2011), 21.
  • Ko et al. (2015) Amy J Ko, Thomas D Latoza, and Margaret M Burnett. 2015. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering 20, 1 (2015), 110–141.
  • Loksa et al. (2016) Dastyni Loksa, Amy J. Ko, Will Jernigan, Alannah Oleson, Christopher J. Mendez, and Margaret M. Burnett. 2016. Programming, Problem Solving, and Self-Awareness: Effects of Explicit Guidance. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA, 1449–1461.
  • Mikhajlov and Sekerinski (1998) Leonid Mikhajlov and Emil Sekerinski. 1998. A Study of The Fragile Base Class Problem. In European Conference on Object-Oriented Programming. Springer-Verlag, London, UK, UK, 355–382.
  • Myers et al. (2016) B. A. Myers, A. J. Ko, T. D. LaToza, and Y. Yoon. 2016. Programmers Are Users Too: Human-Centered Methods for Improving Programming Tools. Computer 49, 7 (July 2016), 44–52.
  • Myers et al. (2004) Brad A. Myers, John F. Pane, and Andy Ko. 2004. Natural Programming Languages and Environments. Commun. ACM 47 (2004), 47–52. Issue 9.
  • Newell and Card (1985) Allen Newell and Stuart K. Card. 1985. The Prospects for Psychological Science in Human-computer Interaction. Hum.-Comput. Interact. 1, 3 (Sept. 1985), 209–242.
  • of Washington (2019) University of Washington. 2019. Professional Master’s Program.
  • Oney et al. (2014) Stephen Oney, Brad Myers, and Joel Brandt. 2014. InterState: a language and environment for expressing interface behavior. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 263–272.
  • Pane et al. (2002) John F Pane, Brad A Myers, and Leah B Miller. 2002. Using HCI techniques to design a more usable programming system. In Proceedings IEEE 2002 Symposia on Human Centric Computing Languages and Environments. IEEE, 198–206.
  • Pernice and Whintenton (2017) Kara Pernice and Kathryn Whintenton. 2017. How to Deal With Bad Design Suggestions.
  • Perry et al. (2004) Dewayne E Perry, Susan Elliott Sim, and Steve M Easterbrook. 2004. Case studies for software engineers. In Proceedings. 26th International Conference on Software Engineering. IEEE, 736–738.
  • Pierce and Mandelbaum (2009) Benjamin C. Pierce and Yitzhak Mandelbaum. 2009. PL Grand Challenges.
  • Microsoft Corp. ([n.d.]) Microsoft Corp. [n.d.]. Framework Design Guidelines. ([n. d.]). Accessed Feb. 8, 2016.
  • Oracle Corp. ([n.d.]) Oracle Corp. [n.d.]. Secure Coding Guidelines for the Java SE, version 4.0. ([n. d.]). Accessed Feb. 8, 2016.
  • Resnick et al. (2009) M. Resnick, J. Maloney, A. Monroy-Hernández, N. Rusk, E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, J. Silver, B. Silverman, et al. 2009. Scratch: Programming for all. Commun. Acm 52, 11 (2009), 60–67.
  • Robertson and Radcliffe (2009) BF Robertson and DF Radcliffe. 2009. Impact of CAD tools on creative problem solving in engineering design. Computer-Aided Design 41, 3 (2009), 136–146.
  • Sanfilippo (2017) Salvatore Sanfilippo. 2017. The mythical 10x programmer. (2017).
  • Satyanarayan et al. (2015) Arvind Satyanarayan, Ryan Russell, Jane Hoffswell, and Jeffrey Heer. 2015. Reactive vega: A streaming dataflow architecture for declarative interactive visualization. IEEE transactions on visualization and computer graphics 22, 1 (2015), 659–668.
  • Shneiderman and Plaisant (2006) Ben Shneiderman and Catherine Plaisant. 2006. Strategies for evaluating information visualization tools: multi-dimensional in-depth long-term case studies. In Proceedings of the 2006 AVI workshop on BEyond time and errors: novel evaluation methods for information visualization. ACM, 1–7.
  • Sirer (2016) Emin Gün Sirer. 2016. Thoughts on The DAO Hack. (2016).
  • Stack Overflow (2019) Stack Overflow. 2019. Developer Survey Results 2019.
  • Stefik and Hanenberg (2014) Andreas Stefik and Stefan Hanenberg. 2014. The Programming Language Wars: Questions and Responsibilities for the Programming Language Community (Onward! 2014). ACM, New York, NY, USA, 283–299.
  • Stefik and Hanenberg (2017) Andreas Stefik and Stefan Hanenberg. 2017. Methodological irregularities in programming-language research. Computer 50, 8 (2017), 60–63.
  • Stefik et al. (2011) Andreas Stefik, Susanna Siebert, Melissa Stefik, and Kim Slattery. 2011. An Empirical Comparison of the Accuracy Rates of Novices Using the Quorum, Perl, and Randomo Programming Languages. In Proceedings of the 3rd ACM SIGPLAN Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU ’11). ACM, New York, NY, USA, 3–8.
  • Stewart et al. (2006) Kent D Stewart, Melisa Shiroda, and Craig A James. 2006. Drug Guru: a computer software program for drug design using medicinal chemistry rules. Bioorganic & medicinal chemistry 14, 20 (2006), 7011–7022.
  • Sunshine et al. (2015) Joshua Sunshine, James D. Herbsleb, and Jonathan Aldrich. 2015. Searching the State Space: A Qualitative Study of API Protocol Usability. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension (ICPC ’15). IEEE Press, Piscataway, NJ, USA, 82–93.
  • Sunshine et al. (2011) Joshua Sunshine, Karl Naden, Sven Stork, Jonathan Aldrich, and Éric Tanter. 2011. First-class state change in Plaid. In ACM SIGPLAN Notices, Vol. 46. ACM, 713–732.
  • Uesbeck et al. (2016) Phillip Merlin Uesbeck, Andreas Stefik, Stefan Hanenberg, Jan Pedersen, and Patrick Daleiden. 2016. An Empirical Study on the Impact of C++ Lambdas and Programmer Experience. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 760–771.
  • University (2019) Carnegie Mellon University. 2019. Master’s Programs.
  • Verner et al. (2009) June M Verner, Jennifer Sampson, Vladimir Tosic, NA Abu Bakar, and Barbara A Kitchenham. 2009. Guidelines for industrially-based multiple case studies in software engineering. In 2009 Third International Conference on Research Challenges in Information Science. IEEE, 313–324.
  • Zibin et al. (2007) Yoav Zibin, Alex Potanin, Mahmood Ali, Shay Artzi, Adam Kielun, and Michael D. Ernst. 2007. Object and reference immutability using Java generics. In Foundations of Software Engineering. ACM, 75–84.