Convo: What does conversational programming need? An exploration of machine learning interface design

03/03/2020 ∙ by Jessica Van Brummelen, et al. ∙ MIT Harvard University 0

Vast improvements in natural language understanding and speech recognition have paved the way for conversational interaction with computers. While conversational agents have often been used for short goal-oriented dialog, we know little about agents for developing computer programs. To explore the utility of natural language for programming, we conducted a study (n=45) comparing different input methods to a conversational programming system we developed. Participants completed novice and advanced tasks using voice-based, text-based, and voice-or-text-based systems. We found that users appreciated aspects of each system (e.g., voice-input efficiency, text-input precision) and that novice users were more optimistic about programming using voice-input than advanced users. Our results show that future conversational programming tools should be tailored to users' programming experience and allow users to choose their preferred input mode. To reduce cognitive load, future interfaces can incorporate visualizations and possess custom natural language understanding and speech recognition models for programming.



There are no comments yet.


page 1

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With recent major advances in automatic speech recognition (ASR) and natural language processing (NLP)

[4, 1, 6]

, interacting with technology has become as easy as having a conversation. Conversational agents have proliferated such that it would be shocking for a smartphone not to be able to transcribe and take action based on something someone said. Programmers have begun to automate simple, few-turn tasks using conversational artificial intelligence (AI), like turning on lights, as well as longer, more complex tasks, such as conversing with hairdressers to book clients’ appointments


Fig. 1: Convo’s architecture. The VUI passes spoken or typed input to the NLU module, which recognizes the intent. Given the intent, the DM might pass a response like, “What do you want to call it?”, to the VUI to speak, or pass goals like, create procedure, to the PE.

With such advances, this technology is positioned to be leveraged in other spaces too. Specifically, it can increase technology accessibility through question-answering (QA), providing alternative input methods, and not requiring reading/writing skills for interaction. Computer programming could especially benefit from these advantages. Engaging learners in straightforward conversations without syntax requirements, could lower the barrier to entry to programming and provide efficient, alternative input methods. Nonetheless, there are limitations to this technology. For example, ASR can be frustrating, especially for those with nontraditional accents, and NL is innately ambiguous, which could produce additional errors (e.g., is the meaning of “say variable var”, say the value of variable var or say the words “variable var”?).

Currently, little is known about the suitability of conversational agents for lowering the barrier to entry to programming. There has been some work in single-turn program synthesis, in which a NL utterance is converted to a program without conversation [28, 31]; syntax- or keyword-dependent, voice-based programming [3, 25]; conversational agents for controlling specific systems, such as Arduinos [18]; and for learning linear tasks, such as sending emails [21]. However, optimal interaction paradigms for NL, conversational systems remain largely unstudied. We do not know whether it would be best for end-users to conversationally interact through speech, text, or a combination of multiple modalities. We also do not know how programming conversationally affects cognitive load and learning for novice programmers.

This paper will address questions about the usability, feasibility, and cognitive load of a conversational programming system. We completed a study (=45) with our voice- and text-based conversational programming tool, in which users completed novice and advanced programming tasks using only voice-input, only text-input, and voice-or-text systems. They then answered Likert-scale and short answer questions about usability, satisfaction, preference and overall experience. Furthermore, we collected cognitive load indicator data, such as time to completion, number of system resets, lengths of utterances, and number of times users asked for help. We analyzed these data through quantitative and thematic analyses. The results will inform the development of future interactive machine learning (ML) systems, especially conversational programming agents, and begin to address the following research questions:

RQ1: What is the preferred input modality for a conversational tool? Would multimodal input be useful?

RQ2: How do input modalities affect cognitive load?

RQ3: Do novice and advanced programmers’ conversational programming preferences differ?

RQ4: Are current ASR and NLU technologies adequate for conversational coding?

RQ5: Is it better to have constrained or unconstrained NL?

RQ6: Can conversational programming teach computational thinking skills? How does it compare to visual programming?

RQ7: How can a conversational programming system facilitate project creation and computational action [37]?

This study is the first in a series aimed to address these questions. It addresses voice- and text-input preferences (RQI), cognitive load effects of voice- and text-input (RQI), advanced and novice user preferences (RQI), and out-of-the-box ASR and constrained NLU effectiveness (RQI and RQI). Future studies with updated systems (e.g., unconstrained NLU system, system with visualizations) will address remaining questions. Ultimately, our goal is to leverage state-of-the-art ML technologies to empower learners to develop programs and solve problems in their communities. To do so, we will explore conversational AI design spaces with respect to lowering the barrier to entry to programming.

This paper presents the following main contributions:

  1. A formative study (=45) examining cognitive load, input modalities, and advanced and novice programmers’ performance with a conversational programming agent.

  2. The system design of a conversational programming agent, Convo.

  3. Design considerations for future systems based on quantitative and thematic analyses of user feedback.

Ii Background and Related Work

Our conversational programming system, Convo, draws on a number of research areas, including conversational AI, voice-first human-computer interaction (HCI), natural language programming, and accessible programming.

Ii-a Conversational AI, Voice-First HCI, and NL Programming

To create an effective conversational programming system, we utilized conversational and voice-first design principles. Conversational AI design literature often references Grice’s “conversational maxims” [12], which can be decomposed into concise, correct, relevant, and natural principles [38]. For example, to adhere to the natural principle, Convo says “The name, name, has already been used” instead of “Class name already exists”.

Another important principle is NLU flexibility. Design guides suggest that conversational systems should understand synonyms, over-answering and subtextual meaning [14, 10]. However, since we wanted to study the effects of constraining NL in a conversational programming system (RQI), we restricted Convo to understanding particular NL phrases, like “create a variable”, so that we can compare it to an unconstrained NL system in future studies. We hypothesize that constraining NL input will reduce potential for ambiguity and increase the likelihood of Convo understanding users better; though at the same time, reducing the total number of phrases Convo understands could force users to think about precise commands, thereby increasing their cognitive loads.

Previous studies support this hypothesis and illustrate how unconstrained NL results in ambiguity and can cause mismatch in human perception and reality of the system’s understanding [13, 8]. Other research suggests, however, that such ambiguities can be eliminated through conversational QA and programming by demonstration [27, 22]. To the authors’ knowledge, however, there have been no studies directly comparing constrained and unconstrained NL programming systems with such ambiguity reduction methods. In this study, we investigate users’ perception of constrained NL with Convo.

Historically, voice-based programming systems have been developed for advanced programmers using syntactically-constrained vocabularies (largely due to limited ASR and NLU technology) [5, 29, 19, 25]. Recently, however, a number of voice-based, NL systems have appeared. TurtleTalk, for instance, allows children to program the movement of a turtle in a speech-based video game environment [15]. Other systems allow for mechatronic system control or database queries [36, 18, 7]. Nonetheless, each of these systems are limited in scope. Though domain-specific ASR systems constrain the vocabulary space which generally results in better speech recognition, general-purpose ASR systems can provide the flexibility learning technology usually requires. This research aims to determine the feasibility of a general-purpose voice-or-text NL programming system using current state-of-the-art ASR from Google Cloud Speech-to-Text API [11].

Ii-B Accessibility and Cognitive Load

Conversational coding systems have the potential to increase programming accessibility in three main areas: (1) for those with visual- and motor-impairments, (2) for those who are unable to read or type, and (3) for those with little prior programming knowledge. Particularly salient design principles for accessible auditory programming environments include describing code, not syntax; expressing logical context (e.g., nested loop structure) over spatial context (e.g., line 484); providing localization, querying and navigation cues; and intentionally addressing voice-based ambiguity, like homonyms [2, 24, 33, 34, 32, 30]. These principles guided the development of Convo.

Despite these principles, voice-based programming systems can still incite high cognitive load, especially when the vocabulary or grammar deviates greatly from natural language [5, 2]. Cognitive theory suggests that auditory and visual information are processed via separate channels, and that channels have limited capacity [23]. Thus, a voice-based system without any scaffolding may overload the auditory channel, but a combination of the two (e.g., a voice-or-text-based system) may overcome the challenges of high cognitive load [39]. Recent research also shows that student learning is improved when students have verbal interaction back to a conversational system [20].

Iii System Design

Convo is a voice-based system allowing users to develop computer programs by conversing in natural language with a conversational agent. For the user study, the system was designed to support both voice- and text-based conversations. Currently, the system supports three main tasks—(1) program creation, (2) program editing, and (3) system feedback—through natural language. This is illustrated in the following system walk-through.

Iii-a System Walk-through

Currently, Convo is a constrained NL-based programming system in which commands have to be stated exactly for the system to understand. However, we are developing a less constrained version for future studies to analyze NL constraint effects on cognitive load. Here, we illustrate an example scenario and conversation using the current system with Lisa, a user, who wants to make a game for her little brother Chris to help him learn what sounds animals make.

Constrained NL System (Current). Lisa starts up Convo and says “Hey Convo, I want to make a game.” Since the system doesn’t recognize the word “game”, it responds with “I didn’t understand what you want to do. You can start making a program by saying ‘create a program’.” Lisa says, “Okay, create a program.” Convo responds by asking her, “What do you want to name the program?”. Lisa replies, “Animal Sounds.” She can now add actions to the program. Lisa wants to make a loop but does not know how. She sees an example phrase in the sidebar – “Create a loop” – and tries it out. The system asks for the halting condition, to which Lisa responds, “Until I say ‘stop’.” Now inside the loop, Lisa tells Convo to ask for user input by saying, “Get user input and save it as animal.” Next, Lisa makes conditionals for dog, cat, horse, and cow sounds. For the dog sound, Lisa says “If animal is dog, play the dog sound.” After going through each animal sound, Lisa closes the loop by saying, “Close loop”, and tells Convo she is finished by stating “Done.” Finally, Lisa and her little brother Chris run the program by telling Convo to “Play Animal Sounds.”

As illustrated by this scenario, Convo only recognizes exact commands like “create a program” or “make a program,” limiting the possible conversations that users can have with Convo. In addition, Convo has a constrained system for providing feedback or assistance to the user. Below, we illustrate the same scenario with a less constrained NL system.

Unconstrained NL System. Lisa starts up Convo and says “Hey Convo, can I make a game called Animal Sounds?” Convo does not know what a game is but is able to ask Lisa, “What’s a game? Is it like a procedure or a variable?”. Convo responds “It is a procedure.” With that, Convo is able to equate “making a game” to making a procedure. In addition, Convo also recognizes that “Animal Sounds” is the name that Lisa wants to use. Convo proceeds to create a program called “Animal Sounds”. Lisa wants to make a loop, but she has never done it before with Convo. Instead of needing to consult documentation, she directly asks “Convo, how do I make a loop?”. Convo directs her through the process by responding, “First, you need to have a stopping condition. What do you want it to be?”. Lisa says, “Until I say ‘stop’,” and proceeds to add actions to the loop. She says “Set animal to user input”. Convo detects potential ambiguity and asks Lisa, “Do you mean get user input and setanimal’ to the value of the input, or setanimal’ to the words, ‘user input’?” Lisa indicates the former and proceeds to create the rest of the program.

An unconstrained version of Convo would be able to recognize and detect user intents from a larger variety of utterances. Because ambiguity exists in natural conversations, the unconstrained version would also need to detect ambiguity and ask for clarification. By implementing such a system, we would be able to determine the suitability of both constrained and unconstrained conversational NL for programming.

Iii-B Technical Implementation

Convo consists of four modules: the voice-user interface (VUI), natural language understanding (NLU) module, dialog manager (DM), and program editor (PE), as shown in Fig. 1.

Users interact with the system through the VUI. The VUI receives and transcribes voice input into text and sends it to the next module. The ASR is handled by Google’s Cloud Speech-To-Text API [11] service. The transcribed output is sent to a WebSocket server where the NLU and post-processing occurs, and the DM generates appropriate responses. Responses are voiced back to users using Google’s Speech Synthesis API [9].

The second module performs NLU on the input, extracting and recognizing users’ intents based on utterances provided by the VUI. The NLU module is syntactically-constrained and uses a regex expression-based semantic parser to determine intent and extract semantic information. In future studies, we will implement customized NL and speech models, and compare this system with our current constrained system to determine usability and cognitive load effects.

The third module is the DM, which is responsible for keeping track of the conversation, goals between users and the system, and the agent state. Agent states include program creation, editing and execution states, and are managed by a finite-state machine.

Given information extracted by the NLU, the DM creates a “user goal.” The goal contains the user’s intent (e.g. intent to create a variable) and necessary actions the system must take to complete the goal. These actions are referred to as “agent goals.” For example, when a user wants to create a variable, the DM will ask for a name and an initial value if they weren’t originally provided by the user. In this case, the user goal is to create a variable, while the agent goals are to ask for the name and initial value. The latter behavior is commonly known as slot-filling. Completing agent goals leads to various possible changes, including state changes and new additions to the program. The agent goals and state determine the specific response Convo provides after a given user command.

The fourth module is the PE, which is responsible for program-related tasks like editing and execution. The PE interacts with the DM, receiving program-related commands and actions and returning program context and state. During program creation, the editor keeps a program representation in memory. The representation is a list of actions that the agent performs when executing the program, and can be exported to other formats (e.g. JSON, Javascript, Python). During editing, the PE keeps track of the program state, which contains information such as defined variables and the current action.

Iv User Study

We conducted a user study to evaluate the effectiveness of Convo and to understand the user needs of a conversational programming environment.

Iv-a Participants

We recruited 45 participants through university mailing lists and flyers, with 27 males, 17 females, and 1 unspecified. Participants ranged from local high school students to members of local universities to community members from elsewhere. The minimum age of participants was 14, maximum was 64, and mean 25.3. Based on survey results, 12 users self-identified as “novice” (users with little to no programming experience) and 33 users self-identified as “advanced” (users who had completed a programming course or had experience in object-oriented programming). All 45 participants completed at least one part of the user study. Participants were given a $20 or $30 Amazon gift card depending on whether they identified as novice or advanced.

Iv-B Procedure

Upon arriving to the user study, participants were presented with an informed consent form, in which they agreed to be audio- and keystroke-recorded. Participants were asked to provide their own laptops, but were provided with earbuds for the task. Before starting the study, participants read detailed instructions about what to expect during the study and watched a video on how to use the text-input, voice-input, and voice-or-text systems, and filled out a demographics questionnaire.

Participants interacted with the Convo programming environment in three stages: the practice stage, the novice stage, and the advanced stage, with the advanced stage for advanced participants only. The practice stage was designed for participants to familiarize themselves with the environment. At each stage, participants interacted with three systems. Each system had a goal for the user to complete. Participants were shown what success looked like for each goal through a video prior to starting. Participants could not move on to the next goal until they successfully completed the current goal. After interacting with a system in a particular stage, participants filled out a questionnaire about their experience.

We performed a mixed between- and within-subject test, where the between-subject conditions were the novice and advanced stages and the within-subject condition was input modality type. We randomized the order of the systems and introduced slight variations to the goals to account for learning effects. Participants were given as much time as they needed to complete all tasks, and could raise their hand to ask for help if they had questions. Questions were addressed following a strict protocol to ensure participants received the same advice for specific technical issues. After completing the study, participants filled out a final questionnaire.

Iv-C Study Tasks

At every practice, novice, and advanced stage, participants were given three tasks, or goals, to complete, one for each system (voice-input, text-input and voice-or-text systems). The goals varied slightly but were similar enough that participants would produce similar actions while interacting with different systems. Goals were randomly matched to systems. The set of tasks participants were asked to complete were:

  1. Practice Stage: Create a program where Convo says “hello world” in audio format. We varied the phrase Convo should say three times.

  2. Novice Stage: Create a program where Convo listens for user input and plays two different animal sounds (e.g. If I say “cat”, play “meow”). We varied the type of animal sounds required three times.

  3. Advanced Stage: Create a program where Convo continuously listens for user input for a set number of times and plays the corresponding animal sound. We varied the number of times Convo listens for user input and the required animal sounds three times. Only advanced participants completed the Advanced Stage.

The novice goals required participants to use conditionals and variables. The advanced goals required participants to build off of novice goals and use a loop to generate the specified animal sounds multiple times. The user interface for the study is shown in Fig. 2.

Fig. 2: The advanced stage of the study using the voice-or-text-based system, which shows both the record button and text box for input.

Iv-D Data Collection

Each participant completed a questionnaire about their demographic and programming background. All text- and voice-input transcripts were recorded throughout the study. We also recorded the number of times participants asked for help and the number of resets they were given when they got stuck and wanted to redo a particular goal. After every completed task, we collected the following measures:

  • Time: The total time duration for a participant to complete a goal.

  • Usability: Participants responded to “I found it difficult to complete the goal with the [input]-based system.” and “I found programming with the [input]-based system difficult to use.” on a 5-point Likert scale.

  • Satisfaction: Participants responded to “I am satisfied programming with the [input]-based system.” on a 5-point Likert scale.

  • Efficiency: Participants responded to “I found programming with the [input]-based system efficient to use.” on a 5-point Likert scale.

  • Preferences: Participants answered free-form questions about what they liked and disliked in using the system.

  • Desired features: Participants answered free-form questions about what features they wished to add.

At the conclusion of the study, users filled out a questionnaire with 5-point Likert scale questions comparing the three systems and free-form questions asking participants what they would ask the conversational agent, challenges they ran into, and questions they had about the system.

We used analysis of variance (ANOVA) for between-subjects analyses and repeated measures ANOVA for within-subjects analyses.

Iv-E Results

Iv-E1 Quantitative Analysis

The type of input modality had a significant effect on participants’ perception of the system. Our results show that both novice and advanced participants strongly preferred the text-based system over the voice-based system. Participants felt it was more difficult to complete the programming goals with the voice-based system, and were generally more satisfied with the text-based system.

Novice participants made significantly more incorrect utterances with the voice-based system (=17.38) compared to the text-based (=1.38, =17.79, =0.0001) and voice-or-text-based systems (=4.77, =11.39, =0.0017), whereas no significant difference was observed for advanced participants. In addition, novice participants were more satisfied with the voice-or-text based system (=2.61) than the voice-based system (=3.44, =15.90, =0.0003), and found the voice-or-text-based system (=2.66) more efficient to use than the voice-based system (=3.47, =14.18, =0.0006). There was no significant difference in preference observed by advanced participants.

Advanced participants perceived the voice-or-text-based system (=2.94) to be more difficult to use compared to the text-based system (=3.5, =6.36, =0.02); there was no significant difference found for novice participants. Overall, novice participants found the voice interaction of the voice-based and voice-or-text-based systems to be useful and enjoyable, whereas advanced participants tended to disagree more with those statements (see Fig. 7).

Prior programming knowledge and gender did not have a significant effect on completion time for all participants. Novice participants and advanced participants completed the practice and novice stages in around the same time. There was also no significant difference between the number of voice utterances and text utterances during the novice stage. Advanced participants tended to use more text utterances than voice utterances during the advanced stage.

To investigate cognitive load effects, we examined the number of resets of the system (as participants mentioned they reset due to forgetting where they were in the program they were creating), time to goal completion, and number of times users asked for help. Note that we only analyzed the advanced stage for cognitive load, since the instructions were provided line-by-line in the novice stage (i.e., minimal cognitive load involved), whereas users needed to determine which steps to take next on their own in the advanced stage (i.e., significant cognitive load involved). There was no observed significant difference in the number of times asked for help with the voice-input, text-input, and voice-or-text system. The input modality also did not have a significant effect on the number of resets or time to goal completion during the advanced stage.

Iv-E2 Qualitative Analysis

We organized the free-form responses and analyzed for patterns using an inductive approach [35] (i.e. open coding). We identified fourteen design themes, which fell into two main categories, positive feedback and recommendations (see Fig. 3).

Fig. 3: The theme hierarchy created during open coding.

We coded 651 occurrences of these themes and show representative quotations below. Themes for the positive feedback category follow:

Efficient (49/651): “I liked how quick it was. Having to just speak to program is far quicker than typing […]”

Usable (48/651): “I liked how straight-forward and logical it is because it translates the logic of the code into everyday speak.”

Accessible (32/651): “I liked the availability of the text option because it usually would take me a few attempts to get the voice working.”

Effective coding features (9/651): “I liked that it tried to catch cases like ‘not having a false condition’. I imagine this will be super useful in recommending base cases for recursion problems”

Interesting (6/651): “It feels cool to do this - I can imagine coding while driving or doing housework.”

Themes for the recommendations category regarding improving the agent’s output follow:

Increase agent interaction (91/651): “[I would add] a spellchecker, like if a word is spelled incorrectly it could say ‘You said ‘dune’, did you mean ‘done’?”

Add visualization (72/651): “[I would add] some sort of visualization of the function being built up as interaction progresses”

Improve efficiency (30/651): “It also seems quite inefficient to figure out the right way to express a statement in actual words that otherwise can be typed in a programming language […]”

Reduce cognitive load (12/651): “I can’t see my program and I have to remember what’s going on, that will become infeasible very quickly.”

Examples from the recommendation category regarding improving users’ understanding follow:

Increase transparency (25/651): “[I would ask] ‘How do you recognize the voices? Do you use any sort o [sic] machine learning to recognize the accents?’ ”

Reduce ambiguity (12/651): “I’m interested in how does the program differentiate similar commands.”

Convey system purpose (9/651): “Who is the intended audience and what sort of programs do you imagine them writing? […]”

Examples from the recommendation category regarding improving the agent’s recognition follow:

Improve speech-to-text (190/651): “Differentiating between voices and then telling the difference with accents [was a challenge for the system]”

Reduce NL constraints (66/651): “[…] Allowing more variability in what I can say to the agent to get it to do the same command would feel more natural.”

As shown in Fig. 5, six of the top seven themes for novice and advanced users were the same, including improve speech recognition, increase agent interaction, and add visualization. Novice users emphasized increasing transparency over efficiency, and vice versa for advanced users.

Fig. 4: Total number of occurrences for the top seven themes from advanced user responses and top seven from novice user responses. Novice responses emphasized transparency over efficiency. Note how the colors represent which user group(s) the theme came from (e.g., pink represents a top theme from novice users, dark blue represents a top theme from both novice and advanced users).

Among input modalities, participants emphasized improve speech recognition and increase agent interaction for all three systems (voice-input, text-input, and voice-or-text) (see Fig. 5). The voice-input system responses emphasized efficiency; the text-input system responses emphasized improving efficiency; and the voice-or-text system responses emphasized accessibility. Both the voice- and text-input system responses emphasized usability; both the voice-input and voice-or-text system responses emphasized adding a visualization; and both the text-input and voice-or-text systems emphasized reducing the NL constraint.

Fig. 5: Total number of occurrences for the top five themes from each system survey. The voice-input system responses emphasized efficiency; text-input, a need to improve efficiency; and voice-or-text, accessibility. Note how the bars’ colors represent which input system(s) the theme came from (e.g., pink represents text-input system, and green represents voice-input and voice-or-text systems).

V Design Recommendations

Through the quantitative and qualitative analyses, we identified six main design recommendations for future conversational programming systems.

Tailor to programming experience and task. Our results suggest that conversational programming systems should be tailored to their audiences due to differences in user preferences and abilities. We found that novice users generally found voice-input useful and enjoyable, whereas advanced users tended to view it less so, as shown in Fig. 6 and 7. Furthermore, although there was no significant difference between the overall number of voice- and text-inputs, in the advanced stage users tended to type rather than speak (=0.003). Advanced users also perceived voice-or-text to be more difficult than text (=0.02), but there was no significant difference for novice users. Thus, for an advanced audience, it may be more important to have a text-input option than for a novice audience, and for an introductory audience, a voice-input system may be more useful than for an advanced one.

Fig. 6: Novice user responses to Likert scale questions. Novices generally found voice useful and enjoyable.
Fig. 7: Advanced user responses to Likert scale questions. Advanced user responses tended to be less favorable towards voice than novice responses.

We also found that some advanced users found NL programming cumbersome, likely because they were used to syntax-restricted programming languages (e.g., “It also seems quite inefficient to figure out the right way to express a statement in actual words that otherwise can be typed in a programming language using very specialized characters.”), whereas novice users tended to praise the naturalness of the language (e.g., “I liked the simplicity of using the normal talk, as in not coding necessarily”). This is further reflected in how “improve efficiency” was found in advanced users’ the top seven themes, but not novice users’ (see Fig. 4). Thus, NL may be a better fit for an educational, introductory tool than an advanced tool.

Design a flexible, accessible system. Our results suggest conversational programming systems should be accessible through both voice- and text-input. Participants found value in both modalities, often citing voice as efficient (see Fig. 5) and text as accurate. Many participants had comments similar to, “I liked being able to use the voice for longer commands, and the text for shorter commands or misunderstood commands”. This was supported by the significant difference in number of characters (=0.004) and words (=0.003) per voice utterance over text utterance (i.e., longer voice utterances). Furthermore, when using the voice-or-text system, participants used both voice and text input, and there was no statistical evidence for a difference in how many times participants spoke versus typed.

From an accessibility standpoint, it makes sense to provide both input options, and allow each of them to stand alone (such that the system is completely accessible by voice-only and text-only). With current technologies, however, this may be difficult to achieve. The Google Cloud Speech-to-Text [11] ASR system we used—which is often recognized as the gold standard [17, 16]—did not seem sufficient for programming. Many participants commented on this (e.g., “Sometimes it had problems understanding my speech, so I resorted to typing things.”, “It seems like if speech recognition worked well, it would be a better choice, but having this [text option] is useful”) and we found that the most common theme was to improve speech recognition. Thus, until speech recognition systems improve, it may be infeasible to have a standalone voice-input system.

Design a transparent system. Many participants described how they would appreciate being able to ask the system how it works. Some questions included:

  • “What kind of nueral [sic] network do you run on?”

  • “How do you understand what I’m saying?”

  • “How do you map my phrases to commands?”

  • “What kind of voice recognition is used?”

  • “Why didn’t the agent understand me?”

  • “How do you register what I’m saying? Should I speak slower/faster? How can I make it easier for you to understand me?”

  • “Do you use any sort o [sic] machine learning to recognize the accents?”

Transparency was one of the top occurring themes for novice users (see Fig. 4) and especially important when developing AI systems for education.

Design with visualizations. A common theme in the free-form responses was the desire for code visualizations. This was in the top seven commonly occurring themes for both novice and advanced users, and the top five themes for both the voice- and text-based systems. Specifically, users asked for ways to “visualize where [they] are in the program”, view a “representation of the code [they were] making”, “see […] variable names or the name of the procedure”, see “the current state of the program, or at least […] which level [they]’re at”, and visually “modify [their] previous lines that were misinterpreted”. This makes sense, as current technology focuses heavily on visual systems and computer screens, and voice-only systems can force high memorization requirements on users. Nonetheless, depending on a system’s intended audience, one may choose to avoid visualizations or make them non-essential to the system for accessibility reasons.

Design to reduce cognitive load. In the thematic analysis, some participants mentioned high cognitive load due to a lack of visualizations (e.g., “I found it quite challenging to figure out the logic of the program entirely in my head; […] it felt like I had to figure it all out before entering anything.”). In future studies, we will analyze cognitive load effects of integrating visualizations into Convo. We expect this will reduce the cognitive load for sighted users. Other design features to potentially reduce cognitive load include decreasing the constraint on the NL input such that users will no longer have to remember specific phrases, and improving the speech recognition model such that people don’t have to repeat phrases as often, and are more likely to remember where they are in the program.

For all cognitive load indicators (number of resets of the system, time to goal completion, and number of times users asked for help), we found no evidence for a significant difference between the voice-based, text-based, and voice-or-text-based systems; thus, voice-based, text-based and voice-or-text-based systems may be viable options when designing for cognitive load.

Improve ASR and NLU. The most common theme in the free-form responses was to improve speech recognition. As mentioned previously, we used the Google Cloud Speech-to-Text [11] ASR system—which is often recognized as the top online ASR [17, 16]—for Convo. Evidently, current ASR systems are not sufficient for fully standalone voice-based, NL programming systems. One potential avenue for improvement is to develop a custom NL programming ASR model that incorporates common NL programming phrases, like “create a variable”, to ensure recognition of those phrases. Nonetheless, by training on specific phrases, this may cause the model to be less robust to new phrases, which would somewhat defeat the purpose of a generalizable NL system.

In addition to improved speech recognition, participants desired reduced constraint on NL input (e.g., “It’s a very cool idea, and with expanding the dictionary it could work better.”, “I expect more natural-language input support such as ‘nope’, ‘no thanks’, etc. would be valuable as well.”). Reducing NL constraint was a top theme in both the text-based and voice-or-text-based systems, as well as in both novice and advanced users’ responses (see Fig. 4 and 5).

We are currently developing an unconstrained NL version of Convo to understand whether this improves or reduces performance, as there has been research questioning the suitability of unconstrained NL for programming [13, 8]. Nevertheless, with additional ambiguity reduction techniques, such as conversational QA and immediate feedback from the agent, unconstrained NL may become suitable for introductory, educational NL programming, especially due to the positive feedback in this area from the free-form responses (e.g., “It gives feedback, which is really useful”, “The process is pretty interactive and fun. The idea of using natural language to code is great and the system reacts very fast.”, “Feedback is immediate.”).

Vi Conclusions

In this study, we investigated the effectiveness of voice-based, text-based, and voice-or-text-based systems in a conversational programming environment. We analyzed the systems in terms of difficulty, efficiency, and cognitive load indicators through free-form responses, Likert scale questions, and user activity during programming task completion. Our results show a desire for and optimism about conversational programming, especially in introductory programming systems. Future conversational and interactive ML systems should consider the following six design recommendations: (1) Tailor to programming experience and task, (2) Design a flexible, accessible system, (3) Design a transparent system, (4) Design with visualizations, (5) Design to reduce cognitive load, and (6) Improve ASR and NLU. Future iterations of Convo will include addressing questions about the effects of visualizations and reducing NL constraints in terms of usability and cognitive load, and the effectiveness of conversational programming for learning computational thinking skills and taking computational action.

Vii Acknowledgements

We would like to thank Hal Abelson, Marisol Diaz, Selim Tezel, and Ilaria Liccardi for their support, as well as the participants in our study for their time.


  • [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In International conference on machine learning, pp. 173–182. Cited by: §I.
  • [2] A. Armaly, P. Rodeghero, and C. McMillan (2017) A comparison of program comprehension strategies by blind and sighted programmers. IEEE Transactions on Software Engineering 44 (8), pp. 712–724. Cited by: §II-B, §II-B.
  • [3] A. Begel and S. L. Graham (2006) An assessment of a speech-based programming environment. In Visual Languages and Human-Centric Computing (VL/HCC’06), pp. 116–120. Cited by: §I.
  • [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §I.
  • [5] A. Desilets (2001) VoiceGrip: a tool for programming-by-voice. International Journal of Speech Technology 4 (2), pp. 103–116. Cited by: §II-A, §II-B.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §I.
  • [7] J. E. Godinez and H. M. Jamil (2019) Meet cyrus: the query by voice mobile assistant for the tutoring and formative assessment of sql learners. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pp. 2461–2468. Cited by: §II-A.
  • [8] J. Good and K. Howland (2017) Programming language, natural language? supporting the diverse computational activities of novice programmers. Journal of Visual Languages & Computing 39, pp. 78–92. Cited by: §II-A, §V.
  • [9] Google (2014) Introduction to the speech synthesis api. Note:, Last accessed on 2020-02-24 Cited by: §III-B.
  • [10] Google (2019) Conversation design. Note:, Last accessed on 2020-02-24 Cited by: §II-A.
  • [11] Google (2020) Google cloud speech-to-text. Note:, Last accessed on 2020-02-24 Cited by: §II-A, §III-B, §V, §V.
  • [12] H. P. Grice (1975) Logic and conversation. In Speech acts, pp. 41–58. Cited by: §II-A.
  • [13] M. G. Helander (2014) Handbook of human-computer interaction. Elsevier. Cited by: §II-A, §V.
  • [14] A. Inc. (2019) Voice design guide. Note:, Last accessed on 2020-02-24 Cited by: §II-A.
  • [15] H. Jung, H. J. Kim, S. So, J. Kim, and C. Oh (2019) TurtleTalk: an educational programming game for children with voice user interface. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–6. Cited by: §II-A.
  • [16] V. Këpuska and G. Bohouta (2017) Comparing speech recognition systems (microsoft api, google api and cmu sphinx). Int. J. Eng. Res. Appl 7 (03), pp. 20–24. Cited by: §V, §V.
  • [17] J. Y. Kim, C. Liu, R. A. Calvo, K. McCabe, S. C. Taylor, B. W. Schuller, and K. Wu (2019) A comparison of online automatic speech recognition systems and the nonverbal responses to unintelligible speech. arXiv preprint arXiv:1904.12403. Cited by: §V, §V.
  • [18] Y. Kim, Y. Choi, D. Kang, M. Lee, T. Nam, and A. Bianchi (2019) HeyTeddy: conversational test-driven development for physical computing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3 (4), pp. 1–21. Cited by: §I, §II-A.
  • [19] J. Leggett and G. Williams (1984) An empirical investigation of voice as an input modality for computer programming. International Journal of Man-Machine Studies 21 (6), pp. 493–520. Cited by: §II-A.
  • [20] I. Lepadatu (2012) Use self-talking for learning progress. Procedia-Social and Behavioral Sciences 33, pp. 283–287. Cited by: §II-B.
  • [21] T. J. Li, I. Labutov, B. A. Myers, A. Azaria, A. I. Rudnicky, and T. M. Mitchell (2018) An end user development approach for failure handling in goal-oriented conversational agents. Studies in Conversational UX Design. Cited by: §I.
  • [22] T. J. Li, M. Radensky, J. Jia, K. Singarajah, T. M. Mitchell, and B. A. Myers (2019) PUMICE: a multi-modal agent that learns concepts and conditionals from natural language and demonstrations. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 577–589. Cited by: §II-A.
  • [23] R. E. Mayer (2003) The promise of multimedia learning: using the same instructional design methods across different media. Learning and instruction 13 (2), pp. 125–139. Cited by: §II-B.
  • [24] S. Mealin and E. Murphy-Hill (2012) An exploratory study of blind software developers. In 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 71–74. Cited by: §II-B.
  • [25] Nowogrodzki,Anna (2018-Jul 05) WRITING code out loud. Nature 559 (7712), pp. 141–142 (English). External Links: ISBN 00280836 Cited by: §I, §II-A.
  • [26] D. E. O’Leary (2019) GOOGLE’s duplex: pretending to be human. Intelligent Systems in Accounting, Finance and Management 26 (1), pp. 46–53. Cited by: §I.
  • [27] T. Quach (2019) Agent-based programming interfaces for children supporting blind children in creative computing through conversation. Master’s Thesis, Massachusetts Institute of Technology. Cited by: §II-A.
  • [28] M. Rabinovich, M. Stern, and D. Klein (2017) Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1139–1149. Cited by: §I.
  • [29] L. Rosenblatt (2017) VocalIDE: an ide for programming via speech recognition. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 417–418. Cited by: §II-A.
  • [30] E. Schanzer, S. Bahram, and S. Krishnamurthi (2019) Accessible AST-based programming for visually-impaired programmers. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pp. 773–779. Cited by: §II-B.
  • [31] E. C. Shin, M. Allamanis, M. Brockschmidt, and A. Polozov (2019) Program synthesis and semantic parsing with learned code idioms. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 10825–10835. Cited by: §I.
  • [32] A. Stefik (2008) On the design of program execution environments for non-sighted computer programmers. Ph.D. Thesis, PhD thesis, Washington State University. Cited by: §II-B.
  • [33] A. Stefik, A. Haywood, S. Mansoor, B. Dunda, and D. Garcia (2009) SODBeans. In 2009 IEEE 17th International Conference on Program Comprehension, pp. 293–294. Cited by: §II-B.
  • [34] A. Stefik, C. Hundhausen, and R. Patterson (2011) An empirical investigation into the design of auditory cues to enhance computer program comprehension. International Journal of Human-Computer Studies 69 (12), pp. 820 – 838. External Links: ISSN 1071-5819, Document, Link Cited by: §II-B.
  • [35] D. R. Thomas (2006) A general inductive approach for analyzing qualitative evaluation data. American Journal of Evaluation 27 (2), pp. 237–246. External Links: Document, Link, Cited by: §IV-E2.
  • [36] J. Thomason, A. Padmakumar, J. Sinapov, N. Walker, Y. Jiang, H. Yedidsion, J. Hart, P. Stone, and R. J. Mooney (2020) Jointly improving parsing and perception for natural language commands through human-robot dialog. Journal of Artificial Intelligence Research 67, pp. 1–48. Cited by: §II-A.
  • [37] M. Tissenbaum, J. Sheldon, and H. Abelson (2019) From computational thinking to computational action. Communications of the ACM 62 (3), pp. 34–36. Cited by: §I.
  • [38] J. Van Brummelen (2019) Tools to create and democratize conversational artificial intelligence. Master’s Thesis, Massachusetts Institute of Technology. Cited by: §II-A.
  • [39] R. Winkler, S. Hobert, A. Salovaara, M. Söllner, and J. M. Leimeister (2020-04) Sara, the lecturer: improving learning in online education with a scaffolding-based conversational agent. ACM CHI Conference on Human Factors in Computing Systems. External Links: Link Cited by: §II-B.