The programming languages and software engineering research communities are grappling with a fundamental question: how should programming languages be designed so that designers can predict the effects of their design decisions [LanguageWars]? Some authors, such as Stefik and Hanenberg, have focused on using quantitative approaches [stefik2017methodological]. Others have proposed using interdisciplinary methods to design programming languages [Coblenz18:Interdisciplinary, Myers2016:Programmers] in order to integrate user research into many different stages of the design process.
Programming languages offer a particularly interesting and challenging combination of design criteria:
- Problem-solving domain:
Programming languages exist to facilitate problem-solving. However, problem-solving can be unpredictable [Loksa2016:Programming]
; in user studies, some participants typically complete tasks almost instantly whereas others can spend hours working and still not finish. This large variance makes running quantitative user studies very challenging.
- Range of working styles:
Bergström and Blackwell described a diverse collection different approaches to programming problems [bergstrom2016practices], such as bricolage/tinkering and engineering. These different styles may be used even by different people using the same language, impeding a designer’s attempts to anticipate a user’s strategy or behavior.
- High stakes:
Programming languages are a fundamental tool in creating software that maintains critical state. Error-prone languages can contribute to serious real-world safety problems, e.g., in avionics or health care systems.
In order to develop user-centered methods that are applicable to the design of programming languages, we designed a new programming language, which we use as a testbed for our proposed design methodology. Obsidian, our new programming language, is targeted at programming blockchains [herlihy2019blockchains], in which a decentralized network of computers maintains system state and executes transactions. Blockchains support deploying smart contracts, which are programs that maintain state. Typically, each deployment is an instance of a class, though in a blockchain context, the keyword contract is used instead of class. In contrast to most of the existing user-centered programming language work, Obsidian is intended for use by professional software engineers. This presents additional challenges, since we are interested in evaluating how the language will be used in the long term despite being limited in our ability to recruit software engineers to work for extended periods of time.
In this paper, we describe how we integrated both formative and summative human-centered methods into the design process for Obsidian. We have three main contributions:
We describe the design methodology that we used in Obsidian, which can be used in the design of other complex problem-solving tools. With 40 participants among all of the studies, we show how to adapt wizard-of-oz, natural programming, and formative usability studies to prototypes of languages intended for professional software engineers.
We show how to use summative usability studies on a new programming language. By developing a way of teaching the language efficiently, effectively, and consistently, we conducted a usability study of programmers in a novel programming language.
Obsidian integrates sophisticated features into its type system so that the compiler can detect serious bugs. Although prior languages that included similar features have been complex and difficult to use, we show that our approach enables skilled programmers to be effective at using the language after only a 90-minute tutorial.
Blockchains, which have been proposed for high-stakes applications such as financial transactions, health care [HealthCare], supply chain management [SupplyChain], and others [Elsden18:Making], are an ideal testbed for a new language design process. The need for a safer language is motivated by the history of security vulnerabilities, through which over $80 million worth of virtual currency was stolen [DAO, CNBC]. However, ordinary programmers and software engineers need to be able to write blockchain applications; it does not suffice to assume that the developers will have PhDs in formal verification or that companies will invest the resources required to formally verify that their programs are correct. Instead, we seek a lightweight approach that provides additional safety guarantees at low cost to developers.
We established several objectives in our language design:
Improve safety by detecting more bugs than current smart contract languages do, preferably at compile time to prevent deployment of buggy programs.
Maximize usability by ensuring that programmers can complete domain-appropriate programming tasks, ideally with little training.
Advance the science of programming language design by showing that user-centered methods can contribute to a more usable language.
The last objective motivates an important methodological question. Suppose one invents a new programming language and wants to claim that it is more effective for users than an existing language. First, one must teach users the new language and show that they can do tasks successfully. This process could be very challenging in a research study, considering that universities commonly have full-semester courses on new programming paradigms. It is obviously not practical to recruit participants to work with us for a semester. Even if we could, this approach is high-cost and may not be practicable for other researchers. Second, even if one completed this process, it would be unclear which aspects of the new language were actually helpful. For example, although some people claim functional programming languages are better, even if this were true, a comparison between a particular functional language and a particular object-oriented language would not result in fine-grained, actionable design guidance for a new language.
Instead, our approach is to start with an existing language, make particular changes that we hypothesize will be helpful, and then evaluate those hypotheses. One practical advantage of this approach is that we can start with a language for which we can find expert programmers. Then, we can minimize training time by only teaching the differences. As a scientific advantage, we can relate the behavioral differences we observe from our participants to the isolated changes we made.
Nonetheless, in complex programming languages, design decisions are not necessarily orthogonal. In this paper, we describe techniques we used to triangulate data about individual language changes. Then, we will describe the final design of Obsidian and explain the methods that we used to create and evaluate it.
2 Related Work
A substantial amount of prior work on the usability of programming languages focuses on novices. For example, HANDS [Pane2002:Using], Helena, [Chasins2017:Helena] and Scratch [resnick2009scratch] aimed to make it easier for novices to write programs. Stefik et al. also collected quantitative data on error rates made by novices [Stefik:2011:ECA:2089155.2089159]. In contrast, this work focuses on languages targeted at experienced programmers. Coblenz et al. used user-centered design techniques to create Glacier [ICSE2017], which extended Java to support transitive class immutability. However, that was a small extension of an existing language, not an entirely new language.
Other work used quantitative methods to compare different programming language designs. For example, Uesbeck et al. investigated the impact of lambdas in C++ [Uesbeck:2016:ESI:2884781.2884849], and Endrikat et al. [Endrikat:2014:ADS:2568225.2568299] looked at static typing. That work is a useful complement to this work, but the focus here is on using low-cost, practical qualitative methods to inform the entire language design process, not on quantitative summative studies, which can only be used once a design has been implemented.
3 A Summary of the Obsidian Language
Detecting bugs was our initial objective, so we considered bugs, such as the DAO hack [DAO-details], which resulted from a reentrant invocation in which a contract allowed itself to be invoked while in an inconsistent state. We also analyzed characteristics of proposed blockchain applications. In general, we observed that proposed blockchain applications typically maintain high-level state, which governs which operations are safe.
For example, a Casino can accept bet invocations only before the Game has already been played. Similarly, the authors of Solidity observed that many contracts implement state machines [Solidity-Patterns]. Unfortunately, in Solidity, users must define states via enumerated types and then manually ensure that methods are only invoked when the target object is in an appropriate state. Although methods that can only be invoked in particular states are common [Beckman:2011:ESO:2032497.2032501], writing programs that only invoke methods when appropriate has been shown to be hard for users [Sunshine:2015:SSS:2820282.2820295], and Solidity includes no mechanism to ensure safety.
Smart contracts commonly manipulate assets, which are objects that have value (such as cryptocurrencies). In Solidity, it is possible to track of money and other assets [Delmolino2015:Step], resulting in their value being permanently irretrievable. We were interested in designing a language in which many kinds of loss of assets could be detected by the compiler.
In order to leverage those observations, we became interested in a typestate-oriented approach [Aldrich:2009:TP:1639950.1640073], in which states of objects are incorporated into types. For example, rather than merely having a LightSwitch type, we can have LightSwitch@On be the type of a reference to an object that is in the On state. Then, if the user attempts an invalid operation, such as turning off a switch that is already off, the compiler can issue an error.
Typestate-based types are in a class called linear types. Unlike traditional types, linear types can change as operations are performed. For example, invoking turnOff() on a reference of type LightSwitch@On changes the type of the reference to LightSwitch@Off. Conveniently, linear types are also what are needed to ensure assets are never lost. Obsidian includes owned objects: for each owned object, there is an object that owns it via an owning reference. If a local variable that owns an asset goes out of scope, the compiler emits an error message. Fields that own assets can only exist in contracts that are themselves assets. This way, each asset always has an owner.
We selected an object-oriented approach, being well-suited for representing state and corresponding updates. We avoided inheritance, since we wanted to avoid the fragility that results [Mikhajlov:1998:SFB:646155.679700]. A full description of the language cannot fit in this paper; for that, please refer to [coblenz2019obsidian]. Instead, Fig. 1 shows some of the key features of Obsidian using the example of a tiny vending machine (TVM). TVM is a main contract, so it can be deployed independently to a blockchain. A TVM has a very small inventory: just one candy bar. It is either Full, with one candy bar in inventory, or Empty. Clients may invoke buy on a vending machine that is in Full state, passing a Coin as payment. When buy is invoked, the caller must initially own the Coin, but after buy returns, the caller no longer owns it. buy returns a Candy to the caller, which the caller then owns. After buy returns, the vending machine is in state Empty.
4 Usability methods
We regard programming languages as interfaces that people use to create programs. As such, programming languages for professional software engineers would seem to be a prime candidate for application of HCI principles, since people use them for long periods of time to develop critical software systems at great expense. Unfortunately, traditional HCI approaches are not directly applicable to programming language design:
Evaluating a programming language requires first teaching the programming language. Many companies hire only programmers with years of experience in specific languages; clearly one cannot train participants for anywhere near that amount of time.
Evaluation requires participants who are sufficiently skilled that they can rapidly learn a new programming language and then complete tasks in the new language. This requires lengthy user studies with skilled participants, who can be challenging to recruit and retain for the required period of time. Frequent evaluation requires a large number of participants, since participants who learned an earlier version of the language can no longer provide fresh perspectives on new ideas. Although some user interfaces for experts in other domains require recruiting members of a small population, many of those interfaces are for short-term, focused tasks rather than lengthy, problem-solving tasks.
Programming language designers are accustomed to creating high-cost implementations, not low-cost prototypes, but traditional HCI methods assume that low-cost prototypes can be created. Traditional ways of evaluating programming languages typically require a compiler or interpreter as well as theoretical work to create a sound design (informally, one in which programs mean what they are supposed to mean and the safety guarantees that the type system claims to provide can actually be provided). If one insists on creating a sound, formal model of the language before evaluating it with users, iteration can require so much time that it is impractical. Furthermore, the cost is increased by the expectation of sophisticated language-dependent tooling in IDEs: syntax highlighting, autocomplete, high-quality error messages, and the like.
The nature of programming is that there is huge variance in performance on tasks among different programmers [glass2001frequently, 10xprogrammers]. Although a typical usability researcher may hope to identify (and address) a large fraction of the barriers to success that the users of the system will face, the same is not true of designers of programming tools, since participants get stuck in a very wide variety of difficult-to-predict ways.
To address these problems, we adapted traditional HCI techniques to the domain of programming language design. Although each technique is commonly used in other areas of HCI, we found that changes were necessary to adapt the methods to apply to our context. Importantly, our primary interest is in programmers’ abilities to achieve their goals after they have mastered a programming language, not on how easy it is for novices to learn the language. Although the latter is an interesting research problem, our research focuses on trying to determine the effectiveness of professionals after they have already mastered the language.
4.1 Natural Programming
In the natural programming elicitation technique [naturalprogramming], participants are given blank paper or a text editor and asked to write programs without being given a specific language. As a form of participatory design, the goal is to elicit from participants the way they would naturally express the ideas in question.
In the traditional natural programming approach, the results may be biased toward systems with which participants are familiar. For example, one might expect that a programmer most comfortable with Java would likely produce Java code. In that case, the researcher would learn very little because the observed behavior would result only from training rather than from something more fundamental. Past natural programming work has focused on novices, who are not biased in this way. Instead, in Obsidian, we used natural programming for situations in which commonly-used languages could not directly represent the requirements we gave participants and for low-level syntactic choices (e.g. keyword selection). We also instructed participants explicitly to be creative and not write in any particular existing language. Finally, we were careful to interpret the results in the context of participants’ prior knowledge. For example, when participants use curly braces to denote blocks, the content of the blocks may be interesting even though the choice of curly braces is not.
Another usual limitation of natural programming is that participants lack expertise in language design, resulting in unsound proposals. This problem occurs with participatory design in other domains as well, and the usual solution is to use participant ideas as input to an expert-led design process, which applies here as well. We found it helpful to present participants with several options rather than expecting them to compose designs from scratch. We asked participants to complete tasks using the various options so that we could observe their behavior and come to an informed conclusion about which of the options were best, rather than merely asking participants for their opinions. This allowed us to focus the process on designs that would fulfill the technical requirements while still obtaining relevant design insights.
4.2 Wizard of Oz
In a Wizard of Oz study [dahlback1993wizard], an experimenter pretends a system is working by remote-controlling it, in order to obtain insights about potential designs without having to actually build the system. In early Obsidian studies, participants were given documentation about a design for the language and asked to write or edit programs in a text editor so that we could obtain feedback before bearing the cost of language implementation. An experimenter simulated a compiler; since IDEs typically provide live feedback, the experimenter would interject with simulated error messages.
4.3 Mock environments
Participants must first learn the language before they can accomplish programming tasks. This requires researchers to (1) create materials to teach the language; (2) recruit participants who are prepared to learn the new language and who can spend the necessary time to learn the language before doing the tasks; and, (3) separate errors made due to lack of experience from errors caused by an error-prone design. This last point seems nearly impossible to resolve without long-term studies, which are impractical for language prototypes. For example, Obsidian’s constructor syntax is similar to Java’s (e.g., new Square(5)). Some participants who knew Java but had recently programmed in Python omitted new, e.g., writing only Square(5). This mistake is uninteresting and is merely a result of habits acquired from Python.
To address this challenge, we sometimes back-ported our design ideas to a language with which participants were already familiar. By telling participants that we were merely modifying Java, participants could leverage their previous experience. Examples include using Java’s class keyword instead of Obsidian’s contract keyword, and using Java annotations to denote ownership. We also made high-level design decisions that allowed us to attract participants who had relevant background. If we had tried to teach participants a completely novel language paradigm even though the basic assumptions of the language paradigm were not the targets of our research, we would have needed to try to distinguish the relevant mistakes from all the novice-level mistakes that the new programmers would be likely to make.
4.4 Rapid prototyping
Iteration on programming language design can be slow because of the high cost of designing and developing tools and documentation. Academic language designers usually develop a core calculus, which describes the fundamental aspects of the language, and then prove key properties of the core calculus. This work alone can take many person-months. Instead, we do not insist on doing this work at the beginning. We outline a potentially sound underlying formalism without proving all the relevant properties. Then, we design a surface language and evaluate it with users so that we can obtain feedback early. In doing so, we accept the risk that the formal system cannot be made sound without invalidating the data we gathered from users, but usually any mistakes are minor and can be corrected without having to redo the user studies.
We found that designing and running user studies of features typically required much more time than implementing the features, so we carefully selected questions for which user input would be the most impactful. A key approach in minimizing cost of language changes was to re-use training materials across phases of the studies to the extent possible, allowing us to amortize the cost of development across multiple studies. The training materials co-evolved with the implementation and represented a significant investment.
5 Formative studies
In this section, we describe studies that helped us identify a suitable design and iterate on our initial design ideas. For each, we identify our research questions, methodology, and results. We started by assuming that we would use typestate to achieve the desired safety guarantees but that expressing typestate in a usable way would require substantial iteration with users. The latter assumption was based on past work on typestate systems, such as Plural [Bierhoff:2008:PCP:1370175.1370213] and Plaid [sunshine2011first], which authors had found were difficult for users to use. All of the studies were approved by our IRB. Because we needed skilled programmers, we recruited from appropriate academic programs, by posting flyers, and by contacting our acquaintances. Except where noted below, we paid participants $10/hour for participating.
Although Fig. 1 uses the final version of the language, because the formative studies were done earlier, they use code from earlier versions of the language. In this way, the reader can see how we changed the language as a result of the user studies.
5.1 Basic design of typestate
In order to minimize assumptions regarding how Obsidian should best represent typestate, Barnaby et al. conducted a natural programming study [barnaby]. Here, we summarize the part of the study that investigated whether states are a natural way of approaching the challenges that arise in blockchain programming, and which of several syntaxes for representing these features is most understandable by programmers. The study investigated the lexical relationships between states and transactions available in those states. Experimenters gave a convenience sample of seven participants one hour to implement a program with a given description. In initial tasks, participants implemented the program using pseudocode; in later tasks they were given a brief tutorial about the current version of Obsidian and an Obsidian program to complete.
Only two participants invented syntax denoting states and state transitions. However, in many cases the approaches the remaining five participants used were unsafe, helping to justify using typestate to improve safety. Five participants preferred a syntax where all the actions of a state must be lexically encapsulated in that state. This conflicted with their desire for transactions that executed transitions to only reference fields in lexical scope and the need for transactions that could be executed in several different states. To address both kinds of feedback, the authors modified Obsidian so that transactions were lexically outside of state declarations. Future IDE tools could show all available transactions when viewing a state.
5.2 Fields in states
States in contracts can have different sets of fields, so transitioning can cause some fields to exit scope and others to enter scope. For example, in Fig. 1, the Full state has the inventory field, but the Empty state has no fields. This study used natural programming and code understanding methods to investigate how users users specify cleanup of old fields and initialization of new fields when invoking state transitions.
We recruited four participants, which was enough to provide substantial and useful feedback. All were Ph.D. students studying software engineering. They had an average of seven years of programming experience (ranging from three to fifteen years) and an average of 1.5 years of Java experience.
In Part 1, we gave participants a state transition diagram for a Wallet object, which could hold a license and money, and which had four states corresponding to the possible combinations of contents. Participants were also given code partially implementing the Wallet, with several TODO comments asking participants to invent code to add money to the Wallet, remove money from the Wallet, etc. Participants were told that the money and license should be thought of as assets, so they could not be duplicated, used more than once, or lost. The code they were given was in a language similar to Obsidian but which used some keywords that would be more familiar to a Java programmer, such as class instead of contract.
All four participants prepared assets for a state transition before making the state transition. Two participants were concerned with failure during the asset preparation stage leading to an improperly initialized state upon transition; one of them suggested a try-catch type wrapper for the asset preparation and transition phases.
In Parts 2 through 4 of the study, participants were given several options. Then they were asked to implement each of the options within a given partially-implemented transaction. Finally, they were asked for their preferences.
Part 2 compared approaches for initializing fields in states during transitions. Options were:
Assets are assigned to fields in the transition, e.g. ->S(x = a1) assigns the value of a1 to field x of state S.
Assets are assigned to fields before the transition, e.g. S::x = a1; ->S.
Assets are assigned to fields before the transition, but the fields are in local scope even though the state has not transitioned yet, not in destination-state scope, e.g. x = a1; ->S.
Assets are assigned to fields after the transition, e.g. ->S; x = a1.
The participants successfully used all the approaches, but most of the participants preferred assigning assets to fields before the transition with destination state scoping (option 2). The results of these two parts motivated a language change: Obsidian now supports this approach in addition to atomic assignment (option 1).
Part 3 presented two options for handling assets when transitioning from a state with an asset to a state without it:
The transition evaluates to a collection containing the old assets, e.g. x = ->S indicates that x is assigned the leftover assets after the transition to state S. If the current state is unknown statically, the contents of the collection are determined dynamically.
The transition evaluates to a tuple, e.g. (x = a1) = ->S indicates that x will be assigned the asset a1 which is not present in state S.
There was consistent confusion about which leftover assets are assigned to option 1’s collection after a transition. All participants understood the need for both options in certain cases, but would choose the tuple-like collection for more control and explicitness when the use of either approach is acceptable.
Part 4 focused on releasing assets owned by state fields when transitioning to states in which those fields do not exist. In contrast to part 3, this approach added the option of releasing assets before the transition. The choices were:
Assets must be released before the transition, e.g. release(a1); ->S.
The transition evaluates to a tuple of assets that are no longer owned, e.g. a1 = ->S.
All the participants understood the options and implemented them without mistakes. Implementing using option 2 (evaluating to a tuple) enables both approaches, so participants were asked to indicate scenarios where one option would be preferred over the other. The participants consistently indicated that assets should be released before a transition if they are no longer needed; otherwise, they should evaluate to a tuple. Although we plan to consider tuples in the future, it may suffice to require releasing assets before permitting transitions, which is the current design. This underscores part of the value of user studies, allowing us to prioritize features rather than assuming that the more expressive tuples approach is better or that the simpler assignment approach will suffice.
5.3 Permissions: a qualitative study
Soundly enforcing typestate requires knowledge about all aliases to an object, which is afforded by a permission system. [Bierhoff:2007:MTC:1297027.1297050]. Permissions systems allow the programmer to express what a particular reference can be used for (and therefore also what it cannot be used for). Is there a permission system that users will understand and use effectively? If so, what can we learn from users about how to design it? In this work, we conducted the first studies (of which we are aware) in which people other than the designers of the system were asked to use a permissions system. We found that our initial system design was surprisingly difficult to use.
In order to study permissions, we extracted the permission system from Obsidian and re-cast it in Java as a set of annotations. We conducted a Wizard-of-Oz study where participants received documentation on a Java extension and the experimenter gave simulated compiler error messages. At this time in the development of Obsidian, we assumed that it would be best to separate the notions of permissions and typestate; this approach was reflected in the study materials but may surprise a reader who has studied Fig. 1, which reflects the final Obsidian version, which combines the two. The training materials explained the annotations: @Asset, which applied to classes; and @Owned, @Shared (no restrictions but could not co-exist with typestate-specifying references), and @ReadOnlyState (restricting state modification), which applied to references. We recruited six participants (P14–P19), which was enough to provide substantial and useful feedback. They had a mean six years of programming experience (ranging from three to nine years) and a mean two years of Java experience.
The study included five parts. Since our goal was to identify as many usability problems as possible, we revised the instructions after each participant. The first three participants were given 1.5 hours to do the first four parts; the last three were given two hours to allow a fifth part of the study. An experimenter was available to answer questions.
Part 1. To motivate the need for language features to prevent bugs, we gave participants a 163-line Java medical records system and asked the first two participants to find a bug in which a patient could refill the prescription more times than specified. The first participant did not find the bug within 30 minutes; the second did so just as time expired. To conserve time, we gave the other participants five minutes to inspect the code and explained the problem to them.
We conclude that at least some programmers who use traditional languages would have difficulty detecting the kind of bug that Obsidian prevents. This provides further evidence that if users use Obsidian, the compiler will help them detect bugs that otherwise might go undetected.
Part 2. We told participants we would prevent the previous bug by distinguishing between two kinds of references. “Considering an object o: Kind #1: There is only one reference of kind #1 to o at a time. Kind #2: There may be many references of kind #2 to o at a time.” We asked participants to propose names for the two kinds of references. Note the careful language avoiding bias toward specific vocabulary. Participants’ name suggestions included:
- Kind #1:
KeyReference, UniqueReference, Owned, Singleton reference, Resource handle, @default
- Kind #2:
DuplicateReference, ForeignKeyReference, Borrowed, Flyweight pattern reference, const pointer
The results were too inconsistent to justify an particular choice in the language; Obsidian uses Owned, which is at least consistent with one suggestion, and Unowned.
Part 3. To evaluate the usability of ownership, we gave participants an ownership tutorial and told them we had chosen (no annotation, @ReadOnly) (first participant) or (@Owned, no annotation) (later participants) as keywords. We asked them to modify the code from Part 1 to fix the bug. We hoped participants would require that Prescriptions deposited in a Pharmacy be owned and that the Pharmacy take ownership; thus, a deposited Prescription could not be deposited in a second Pharmacy. Completion times ranged from 3 minutes to 40 minutes. Two participants did not finish, one of whom we stopped after 38 minutes to prioritize other tasks.
We were surprised that many of the participants found this task very difficult. We expanded the tutorial to include a practice section for later participants. In general, participants were not prepared to use a type system to address a bug that they thought of in a dynamic way. For example, P16 wrote if (@Owned prescription), attempting to indicate a dynamic check of ownership. We asked participants who wanted to use dynamic approaches to enforcement to use the language feature instead. P14 commented “I haven’t seen…types that complex in an actual language …enforced at compile time.”
P17 had trouble guessing what the compiler could know or check, expecting an interprocedural analysis (which would be non-modular). For example, in a case where an owned object was being consumed twice, P17 expected the compiler to give an error on the second spend invocation rather than on the invocation to a method that took an owned argument and invoked the second spend.
P17, P18, and P19 had difficulty determining which variables should be annotated @Owned. In one case, a lookup method took an object to search for, but P17 specified that it should take an owned reference. Then he was stuck after invoking it: “How can I get the annotation back?” Likewise P17 was confused by whether accessors should return owned references. Mistakes could be costly. For example, P19 made a class that was contained in a collection @Owned unnecessarily, caused a problem iterating through the collection. He made the reference to the current list element @Owned, which would require removing each item from the collection when iterating over it in code that was not supposed to modify the container at all.
Parameter-passing and assignment were common points of confusion. P18 asked what happens when passing an @Owned object to a method with an unowned formal parameter (ownership was not passed in this case). P19 said, “when I [annotate this constructor type @Owned], I’m not sure if I’m making a variable owned or I’m transferring ownership.” P17 was surprised that assignment from an owned reference to an unowned-type variable did not transfer ownership. We later addressed this confusion by making assignment always transfer ownership; participants in later studies are generally not surprised about which assignments transfer ownership.
Part 4 introduced the notion of assets. After a tutorial explaining the properties of assets, participants were asked to invent code that could indicate a particular owned reference was intentionally going out of scope. Two participants suggested @Disown and free to abandon owned references; the rest did not have time to answer or had no suggestions. We chose disown for Obsidian, since free has additional memory management connotations that are not relevant here.
Part 5 introduced typestate, starting with the fourth participant. Participants read 2.5 pages on typestate in Obsidian (as it existed then), including @ReadOnlyState, @Shared, and @Borrowed (which was for temporary ownership transfer in invocations). Ownership was the default, so no @Owned was needed. The tutorial also explained available in and ends in, which at the time specified state assumptions and guarantees for methods (before we changed to using this parameters instead). Then, they were asked to annotate uses of Bond in a 212-line Java program implementing a financial market. They were told to use ownership and state specifications whenever possible.
Consistent with Part 3, some participants were more comfortable with a dynamic perspective on ownership rather than a static one. P18 felt that ends in declarations were redundant with the transition code already in the method implementations, but these declarations allow separation of interface and implementation and modular checking. P19 wanted to use borrowing to represent the notion that the BondMarket owns a Bond, but an Investor borrows it for a while. In fact, borrowing was only appropriate for the duration of a method invocation. We later changed the design of the formal parameter syntax to remove the need for @Borrowed.
P19 required significant prompting by the experimenter to make maximum use of typestate. First, P19 added annotations on methods but not on any variables. After prompting, he added dynamic checks in one place but required prompting to add static typestate specifications. This suggests that tools may be needed to help users obtain the most benefits from the language. On the other hand, P18 specified @Asset on Bond without being asked to do so, explaining “because it’s something important and I don’t want to get it out of scope…”
Overall, understanding the limitations of the type system and compiler may be an obstacle for some people. Users will need training to reason about what typestate can do, but the observations above motivated simplifying language changes. Tools could mitigate the problem by providing sophisticated static analyses rather than taking a traditional typechecking approach, and by providing detailed, explanatory errors.
5.4 Comparing typestate and ownership approaches
In order to address some of the confusion we observed in the prior study, we invented a new approach: fuse the notions of ownership and typestate in order to simplify the type system. This has the benefit of eliminating Shared references that also specify typestate, which would then have to be disallowed to preserve soundness. Thus, the type Bond@S is always implicitly owned for any state S, and users can write any permission instead of S, as in Bond@Unowned.
We were also interested in another usability concern. Consider Approach 1 in Fig. 2. A reader of line 1 might expect that the type of bond would always be Bond@Offered. In fact, after line 2, the type is Bond@Sold due to the call to buy.
One idea for addressing this involved incorporating types into variable names, shown in Approach 2. The annotations pertain to the current type rather than the new type. The reader would have to look at only the most recent operation to infer the new type of a variable rather than having to potentially read the whole sequence since the declaration.
Approach 3 adds static assertions. Line 3 shows a static assertion that bond references an object in state Sold, which serves as documentation. Unlike traditional assertions, however, the compiler checks correctness. The intent is to make it easier for programmers to determine the types of variables.
Inspired by observations of participants in the first three conditions, we invented approach 4. This approach is like approach 3 except that it removes state specifications from local variable declarations. The removal was not part of the original design but was inspired by early results of this study.
We required that participants be familiar with Java and we administered a simple Java pre-test. We recruited five students (P21–P25) with an average of about four years of Java experience (ranging from one to ten years) and an average of one year of professional development experience (ranging from zero to three years).
Participants spent between 1 and 1.5 hours on the study. We used Qualtrics, which is a web-based tool designed for surveys, to ask participants a series of questions regarding Obsidian programs, but the study took place in a lab and an experimenter was available to answer questions. We assigned participants to one of the four conditions above according to what we hoped to learn from each trial: approach 1 for P22, approach 2 for P21, approach 3 for P23 and P24, and approach 4 for P25.
5.4.3 Results and Discussion
Participants in the “traditional" approach, with permissions and states specified only in declarations, tried to guess the compiler’s behavior, saying things like “If the compiler was smart…". For example, P22 expected that the language would infer an implicit @Off in the declaration LightSwitch s1 = new LightSwitch(). P22 also expected that although changes of state were permitted via transactions, state-mismatching assignment to variables would be forbidden. This approach would be inconsistent and P22’s confusion suggests that the type-declaration approach is problematic.
Including types in variables names seemed to be confusing as well. P21 expected that ownership was not passed into method calls even when an owned reference was passed. P21 was also surprised that no ownership annotation meant that there was no ownership, instead expecting this to mean that ownership was unknown.
Participants in condition 3 seemed to do much better. For example, although the materials did not use the word assertion, P23 observed that the annotations were assertions. P23 liked the system, commenting “Perfect, I like this, this is very nice. I wish Java had this; it would have saved me a lot of bugs." As we obtained additional confidence in the value of approach #3, we added additional material. For P24, we changed assertions to use @ rather than the initial >> so that we could use >> to specify type changes in transaction parameters. With P25, we used ? to indicate lack of static state knowledge. We later simplified the system because this approach was ambiguous, leaving notations Owned, Unowned, Shared, and unions of specific states (separated with |).
P24 was confused because state specifications on local variables were redundant. For example, in LightSwitch@Off s = new LightSwitch(), the @Off portion is redundant because the compiler already knows the state of the new object due to the constructor’s declaration. To resolve this, we added approach 4, removing typestate and permission annotations from local variable declarations; in contrast, permissions are always specified for fields and formal parameters. In those cases, the annotations are important because they constrain types of variables at the ends and beginnings of transactions.
In summary, this portion of the study motivated the removal of state specifications from local variable declarations and provided initial evidence that static assertions are likely to be a convenient way for programmers to specify states and permissions of local variables. We also obtained evidence that with these other changes, static state assertions are understandable by current Java users with little extra training.
5.5 Threats to validity
The studies share common threats to validity: our participants may not be representative of the population of blockchain programmers; we had limited numbers of participants in each trial; and our tasks may not reflect the reality of blockchain programming. We believe, however, that the population of likely language users is more skilled than our participant population, which mostly consisted of students, so if the students are successful in completing tasks, that aspect of the result is likely to generalize. We did not seek to identify all possible usability problems, but rather to identify the most common and severe ones so that we could try to address them.
6 Summative usability study
We solicited experienced Java programmers to take a short screening test online, which took an average of about 9 minutes to complete. We accepted into the three-hour study only those who answered at least five of six basic Java questions correctly. We got six participants (P35-P40), whom we compensated with $50 Amazon gift cards.
The previous studies focused on particular aspects of the design, in many cases by giving participants languages that were not precisely Obsidian. To evaluate our final design, we conducted a usability evaluation. Because Obsidian provides stronger safety guarantees than existing languages, e.g. Solidity, and because of our prior experience showing that it would be very challenging to develop a linear type system that would be usable at all, we focused our work on whether people could effectively complete tasks, not on whether people could complete tasks faster than in existing languages. We based our work on pilot studies conducted by Kambhatla et al. [Kambhatla2019:Pilot].111Presented at the PLATEAU workshop but not published in an archival form.
Our initial approach for teaching Obsidian quickly was with a textual tutorial. Unfortunately, some participants ignored or skimmed the text, and then were unable to complete the programming tasks. We then split the tutorial into nine small parts and added programming exercises to require participants to practice. We found that with this approach, approximately 90 minutes sufficed for the most effective participants to learn the language. We iterated to see to what extent participants needed help from an experimenter. We found that without any kind of help, participants understood most aspects of the language, but that help of some kind was needed for full comprehension. Nonetheless, perhaps it is unrealistic to expect that people will learn a new programming language in only 90 minutes with no help from anyone. Typically, people get help from others or at least use online Q&A sites such as Stack Overflow.
Our experimenter, therefore, simulated an informed colleague, who could answer questions about the language but who did not directly teach the language. The experimenter also gave low-level guidance, such as how to invoke the compiler. Finally, the experimenter provided assistance that simulated more mature tools. For example, when a participant attempted to debug an error that was reported on line 38 by examining line 33, the experimenter pointed out the discrepancy, since the IDE we provided did not highlight the appropriate line.
After completing the tutorial, which included seven programming exercises, we gave participants starter code for the three main tasks, described below. Rather than time-bounding each exercise, because the tasks were progressively more difficult, we allowed participants as much time as needed until their three-hour commitment ended. Although participants used the compiler, they were not given tests or a runtime environment, since the focus of our usability study was the type system (recall that Obsidian is designed to detect as many bugs as possible at compile time, since runtime detection may be too late to ensure safety).
The first task, Auction, simulated an English auction, with the additional constraint that bids were required to come with Money so that bids could be guaranteed to be viable (a bidder could not issue a bid and then fail to pay for the item). As a starter task, we asked participants to finish implementing createBid, requiring them to invoke a constructor. They also needed to finish implementating makeBid, which records a new bid from a client. In makeBid, we were interested in whether they initially wrote code that accidentally lost the previous bid, which held the associated Money (before receiving a compiler error), indicating that Obsidian’s typechecker had helped them avoid losing track of an asset.
The second task, Prescription, corresponded to the medical records system in the Permissions study section above; we were interested in whether our improvements enabled participants to reason more effectively about the code. We asked participants to fill in the type signature for the consumeRefill and depositPrescription transactions, as well as completing the implementation of fillPrescription.
The Casino task was more open-ended and included directions and requirements for what operations should be supported, as well as low-level starter code, such as implementations of Money and Bet. It asked participants to implement a Casino that takes bets on games. When games are complete, the casino enables winners to collect their winnings. We were primarily interested in participants’ abilities to reason about ownership and typestate and to design architectures that could effectively use ownership.
6.3 Results and Discussion
Results for the exercises are summarized in Table 1. With P38, to assess to what extent the tutorial materials stood alone, the experimenter declined to answer Obsidian-related and debugging-related questions. However, this made the first task perhaps unrealistically difficult and lengthy, resulting in insufficient time for the other tasks.
|Task completion times (hours:minutes)|
In the Auction exercise, two of the six participants accidentally introduced a bug in which an asset was lost: they overwrote maxBid, which held money. The compiler gave an error message and they corrected their mistake, but if they had been using Solidity, its compiler would not have caught the bug. After P36, we slightly simplified the Auction exercise by removing a subtask and refactoring to inline a TODO that had been put in a helper transaction. The above times are adjusted to remove the extra time P35 and P36 spent on the removed task (1 and 8 minutes, respectively).
Some participants seemed to think carefully about ownership and wrote the correct code quickly. Others seemed to focus on satisfying the compiler, and their work took longer. For example, P38 got an error message after overwriting the owned maxBid reference, and “fixed” it with disown. This choice may be a result of weaker programming skills and lack of help in the tutorial; P38 took the longest on the tutorial, and was surprised to not be given a design diagram for the () Auction starter code. We changed the tutorial to emphasize that disown should be used to throw away assets.
In the Prescription task, as with other tasks, variance was large. For example, one reason for P38’s long completion time was that P38 had used Python most recently and, despite the tutorial, sometimes wrote Python-like syntax, which did not parse (one example took four minutes to fix). At the time, we were hoping that participants would be able to complete the tasks entirely on their own, but in retrospect, we may obtain more relevant results by carefully providing appropriate help (which we provided to all the other participants).
We were interested in participants’ ability to reason effectively about ownership. All of the participants who started Prescription were able to complete it. P37 encountered some difficulties due to shortcomings in Obsidian’s support for dynamic state tests. Currently, Obsidian does not allow dynamic state tests to be used as arbitrary Boolean expressions, e.g. if (x in S && e) where e is an arbitrary Boolean expression. Likewise, if (x not is Owned) is not supported (perhaps this was inspired by Python’s is operator). In the latter case, P37 developed some intuition: “Ownership doesn’t feel like something I should be using in this way…” and restructured the code to check if (maybeRecord in Full), which was correct. In another case, the compiler found a bug in which the code assumed that a collection must contain an element, a benefit of not allowing null in the language.
The Casino task was substantially more open-ended than the other tasks, requiring substantially more time, but participants who had a full hour for the task were able to finish it. Some participants defined states in the Casino contract (P35, P39), whereas others relied only on the states in the Game contract (P37, P40). Both approaches led to a lot of dynamic state tests, since the Casino object had to check to make sure the Game object was in an appropriate state. These checks could have been avoided if the different states of Casino had different typestate specifications for their references to the Game, an idea that occurred to P40 in retrospect. This observation represents an opportunity for a future version of Obsidian in which states of owning objects are coupled to states of owned objects, reducing the need for dynamic checks.
We noticed that participants who did better on the “advanced Java” portion of our screening test seemed to complete tasks faster. We found that those test scores were negatively correlated with completion times of the Auction task (, ). We regard this result as tentative because there were minor differences among the trials; we defer a definitive conclusion to a future quantitative study. The correlation between the scores and the tutorial completion times was not significant, likely an artifact of the small number of participants. This suggests that much of the variance (91%) in Auction completion times is explained by prior programming background. We hypothesize, then, that participants who have sufficient OOP understanding can learn Obsidian and use the language effectively in only about 90 minutes.
7 Future work
The Obsidian compiler emits Java code, which runs on the Hyperledger Fabric platform. In the future, Obsidian could target other platforms, such as Ethereum. We plan to do a quantitative comparison between Obsidian and Solidity, hoping to show that the bugs that Obsidian prevents are ones that can occur frequently when using standard languages and that the cost of the additional safety is small (in programmer time).
Obsidian would benefit from better tools, such as judicious use of type inference, compiler support fully integrated into the IDE, better error messages, assistance choosing in what order to address errors, and a full reference manual. Finally, in the future, we would like to better understand how the strong type system in Obsidian affects how programmers think and how they design their programs.
Obsidian reflects a new way of designing programming languages that integrates user-centered techniques into all stages of the design process. By incorporating feedback from users, the designers obtained insights that led to a language in which programmers can be effective at obtaining stronger safety guarantees than prior languages provided. We expect our new approach to language design is applicable to the design of a wide variety of different kinds of problem-solving tools.