Block-based visual programming environments are increasingly used nowadays to introduce computing concepts to novice programmers including children and students. Led by the success of environments like Scratch [resnick2009scratch], initiatives like Hour of Code by Code.org [hourofcode] (HOC) and online platforms like CodeHS.com [codehscom], block-based programming has become an integral part of introductory computer science education. Considering HOC alone, over one billion hours of block-based programming activity has been performed so far by over 600 million unique students worldwide [codehscom, wu2019zero].
The societal need for enhancing computing education has led to a surge of interest in developing AI-driven systems for pedagogy of block-based programming [wang2017learning, price2017position, price2017isnap, weintrop2017comparing, maloney2008programming]. Existing works have studied various aspects of intelligent support, including providing real-time next-step hints when a student is stuck solving a task [piech15las, yi2017feasibility, paassen2018continuous, marwan2019impact, efremov2020zeroshot], giving data-driven feedback about a student’s misconceptions [singh2013automated, DBLP:conf/icml/PiechHNPSG15, price2017evaluation, wu2019zero], and demonstrating a worked-out solution for a task when a student lacks the required programming concepts [zhi2019exploring]. An underlying assumption when providing such intelligent support is that afterwards the student can practice new similar tasks to finally learn the missing concepts. However, this assumption is far from reality in existing systems—the programming tasks are typically hand-curated by experts/tutors, and the available set of tasks is limited. Consider HOC’s Classic Maze challenge [hourofcode_maze], which provides a progression of tasks: Millions of students have attempted these tasks, yet when students fail to solve a task and receive assistance, they cannot practice similar tasks, hindering their ability to master the desired concepts. We seek to tackle this pedagogical challenge by developing techniques for synthesizing new programming tasks.
We formalize the problem of synthesizing visual programming tasks of the kind found in popular learning platforms like Code.org (see Fig. 1) and CodeHS.com (see Fig. 2). As input, we are given a reference task , specified as a visual puzzle, and its solution code . Our goal is to synthesize a set of new tasks along with their solution codes that are conceptually similar but visually dissimilar to the input. This is motivated by the need for practice tasks that on one hand exercise the same concepts, while looking fresh in order to maintain student engagement.
When tackling the problem of synthesizing new tasks with the above desirable properties, three key challenges emerge. First, we are generating problems in a conceptual domain with no well-defined procedure that students follow to solve a task—consequently, existing work on educational problem generation in procedural domains does not apply in our setting [Andersen13, gulwani2014example]. Second, the mapping from the space of visual tasks to their solution codes is highly discontinuous; hence, template-based problem-generation techniques [singh2012, polozov2015] that rely on directly mutating the input to generate new tasks is ineffective (see Section 5 where we use this approach as a baseline). Furthermore, such a direct task-mutation approach would require access to an automated solution synthesizer; however, state-of-the-art program-synthesis techniques are not yet on par with experts and their minimal solutions [Bunel18, Devlin17]. Third, the space of possible tasks and their solutions is potentially unbounded, and thus, any problem-generation technique that relies on exhaustive enumeration is intractable [singh2012, Ahmed13, Alvin14].
To overcome these challenges, we propose a novel methodology that operates by first mutating solution code to obtain a set of codes , and then performing symbolic execution over a code to obtain a visual puzzle . Mutation is efficient by creating an abstract representation of along with appropriate constraints and querying an SMT solver [BarrettTinelli2018]; any solution to this query is a mutated code . During symbolic execution, we use Monte Carlo Tree Search (MCTS) to guide the search over the (unbounded) symbolic-execution tree. We demonstrate the effectiveness of our methodology by performing an extensive empirical evaluation and user study on a set of reference tasks from the Hour of the code challenge by Code.org and the Intro to Programming with Karel course by CodeHS.com. In summary, our main contributions are:
2 Problem Formulation
The space of tasks. We define a task as a tuple , where denotes the visual puzzle, the available block types, and the maximum number of blocks allowed in the solution code. For instance, considering the task in Fig. 0(a), is illustrated in Fig. 0(a), , and .
The space of codes. The programming environment has a domain-specific language (DSL), which defines the set of valid codes and is shown in Fig. 3(a). A code is characterized by several properties, such as the set of block types in C, the number of blocks , the depth of the corresponding Abstract Syntax Tree (AST), and the nesting structure representing programming concepts exercised by C. For instance, considering the code in Fig. 0(b), , , , and .
Next, we introduce three useful definitions relating to the task and code space.
Definition 1 (Solution code).
C is a solution code for T if the following holds: C successfully solves the visual puzzle , , and . denotes the set of all solution codes for T.
Definition 2 (Minimality of a task).
Given a solvable task T with and a threshold , the task is minimal if such that .
Definition 3 (Conceptual similarity of ).
Given a reference and a threshold , a task T along with a solution code C is conceptually similar to if the following holds: , , and .
Definition 4 (Conceptual similarity of ).
Given a reference and a threshold , a task T is conceptually similar to if the following holds: , , and .
Environment domain knowledge. We now formalize our domain knowledge about the block-based environment to measure visual dissimilarity of two tasks, and capture some notion of interestingness and quality of a task. Given tasks T and , we measure their visual dissimilarity by an environment-specific function . Moreover, we measure generic quality of a task with function . We provide specific instantiations of and in our evaluation.
Objective of task synthesis. Given a reference task and a solution code as input, we seek to generate a set of new tasks along with solution codes that are conceptually similar but visually dissimilar to the input. Formally, given parameters , our objective is to synthesize new tasks meeting the following conditions:
is conceptually similar to with threshold in Definition 3.
is visually dissimilar to with margin , i.e., .
has a quality score above threshold , i.e., .
3 Our Task Synthesis Algorithm
We now present the pipeline of our algorithm (see Fig. 3), which takes as input a reference task and its solution code , and generates a set of new tasks with their solution codes. The goal is for this set to be conceptually similar to , but for new tasks to be visually dissimilar to . This is achieved by two main stages: (1) mutation of to obtain a set , and (2) symbolic execution of each to create a task . The first stage, presented in Section 3.1, converts into an abstract representation restricted by a set of constraints (Fig. 3(a)), which must be satisfied by any generated (Fig. 3(b)). The second stage, described in Section 3.2, applies symbolic execution on each code to create a corresponding visual task (Fig. 3(c)) while using Monte Carlo Tree Search (MCTS) to guide the search in the symbolic-execution tree.
3.1 Code Mutation
This stage in our pipeline mutates code of task such that its conceptual elements are preserved. Our mutation procedure consists of three main steps. First, we generate an abstract representation of , called sketch. Second, we restrict the sketch with constraints that describe the space of its concrete instantiations. Although this formulation is inspired from work on generating algebra problems [singh2012], we use it in the entirely different context of generating conceptually similar mutations of . This is achieved in the last step, where we use the sketch and its constraints to query an SMT solver [BarrettTinelli2018]; the query solutions are mutated codes such that (see Definition 3).
Step 1: Sketch. The sketch of code C, denoted by Q, is an abstraction of C capturing its skeleton and generalizing C to the space of conceptually similar codes. Q, expressed in the language of Fig. 3(b), is generated from C with mapping . In particular, the map exploits the AST structure of the code: the AST is traversed in a depth-first manner, and all values are replaced with their corresponding sketch variables, i.e., action a, bool b, and iter x are replaced with A, B, and X, respectively. In the following, we also use mapping , which takes a sketch variable in Q and returns its value in C.
In addition to the above, we may extend a variable A to an action sequence , since any A is allowed to be empty (). We may also add an action sequence of length at the beginning and end of the obtained sketch. As an example, consider the code in Fig. 3(d) and the resulting sketch in Fig. 3(e). Notice that, while we add an action sequence at the beginning of the sketch (1), no action sequence is appended at the end because construct RepeatUntil renders any succeeding code unreachable.
Step 2: Sketch constraints. Sketch constraints restrict the possible concrete instantiations of a sketch by encoding the required semantics of the mutated codes. All constraint types are in Fig. 3(c).
In particular, restricts the size of the mutated code within . specifies the allowed mutations to an action sequence based on its value in the code, given by . For instance, this constraint could result in converting all turnLeft actions of a sequence to turnRight. restricts the possible values of the Repeat counter within threshold . ensures that the Repeat counter is optimal, i.e., action subsequences before and after this construct are not nested in it. specifies the possible values of the If condition based on its value in the code, given by . refers to constraints imposed on action sequences nested within conditionals. As an example, consider in Fig. 3(f), which states that if = pathLeft, then the nested action sequence must have at least one turnLeft action, and the first occurrence of this action must not be preceded by a move or turnRight, thus preventing invalid actions within the conditional. ensures minimality of an action sequence, i.e., optimality of the constituent actions to obtain the desired output. This constraint would, for instance, eliminate redundant sequences such as turnLeft, turnRight, which does not affect the output, or turnLeft, turnLeft, turnLeft, whose output could be achieved by a single turnRight. All employed elimination sequences can be found in the supplementary material. The entire list of constraints applied on the solution code in Fig. 3(d) is shown in Fig. 3(f).
Step 3: SMT query. For a sketch Q generated from code C and its constraints, we pose the following query to an SMT solver: (sketch Q, Q-constraints). As a result, the solver generates a set of instantiations, which are conceptually similar to C. In our implementation, we used the Z3 solver [deMouraBjorner2008]. For the code in Fig. 3(d), Z3 generated mutated codes in s from an exhaustive space of possible codes with . One such mutation is shown in Fig. 0(d).
While this approach generates codes that are devoid of most semantic irregularities, it has its limitations. Certain irregularities continue to exist in some generated codes: An example of such a code included the action sequence move, turnLeft, move, turnLeft, move, turnLeft, move, turnLeft, which results in the agent circling back to its initial location in the task space. This kind of undesirable behaviour is eliminated in the symbolic execution stage of our pipeline.
3.2 Symbolic Execution
Symbolic execution [King1976] is an automated test-generation technique that symbolically explores execution paths in a program. During exploration of a path, it gathers symbolic constraints over program inputs from statements along the path. These constraints are then mutated (according to a search strategy), and an SMT solver is queried to generate new inputs that explore another path.
Obtaining visual tasks with symbolic execution. This stage in our pipeline applies symbolic execution on each generated code to obtain a suitable visual task . The program inputs of are the agent’s initial location/orientation and the status of the grid cells (unknown, free, blocked, marker, goal), which is initially unknown. Symbolic execution collects constraints over these from code statements. As in Fig. 5 for one path, symbolic execution generates a visual task for each path in .
However, not all of these tasks are suitable. For instance, if the goal is reached after the first move in Fig. 0(d), all other statements in are not covered, rendering the task less suitable for this code. Naïvely, symbolic execution could first enumerate all paths in and their corresponding tasks, and then rank them in terms of suitability. However, solution codes may have an unbounded number of paths, which leads to path explosion, that is, the inability to cover all paths with tractable resources.
Guiding symbolic execution using Monte Carlo Tree Search (MCTS). To address this issue, we use MCTS [kocsis2006bandit] as a search strategy in symbolic execution with the goal of generating more suitable tasks with fewer resources—we define task suitability next. Symbolic execution has been previously combined with MCTS in order to direct the exploration towards costly paths [luckow2018monte].
As previously observed [kartal2016data], a critical component of effectively applying MCTS is to define an evaluation function that describes the desired properties of the output, i.e., the visual tasks. Tailoring the evaluation function to our unique setting is exactly what differentiates our approach from existing work. In particular, our evaluation function, , distinguishes suitable tasks by assigning a score () to them, which guides the MCTS search. A higher indicates a more suitable task. Its constituent components are: (i) , which evaluates to 1 in the event of complete coverage of code by task and 0 otherwise; (ii) , which evaluates the dissimilarity of to (see Section 2); (iii) , which evaluates the quality and validity of ; (iv) , which evaluates to 0 in case the agent crashes into a wall and 1 otherwise; and (v) , which evaluates to 0 if there is a shortcut sequence of actions (a) smaller than that solves Tout and 1 otherwise. and also resolve the limitations of our mutation stage by eliminating codes and tasks that lead to undesirable agent behavior. We instantiate in the next section.
|Task T||(= )||Type: Source|
|H1||move, turnL, turnR||HOC: Maze 4 [hourofcode_maze]|
|H2||move, turnL, turnR, Repeat||HOC: Maze 7 [hourofcode_maze]|
|H3||move, turnL, turnR, Repeat||HOC: Maze 8 [hourofcode_maze]|
|H4||move, turnL, turnR, RepeatUntil||5||2||HOC: Maze 12 [hourofcode_maze]|
|H5||move, turnL, turnR, RepeatUntil, If||HOC: Maze 16 [hourofcode_maze]|
|H6||move, turnL, turnR, RepeatUntil, IfElse||HOC: Maze 18 [hourofcode_maze]|
|K7||move, turnL, turnR, pickM, putM||Karel: Our first [intro_to_karel_codehs]|
|K8||move, turnL, turnR, pickM, putM, Repeat||Karel: Square [intro_to_karel_codehs]|
|K9||move, turnL, turnR, pickM, putM, Repeat, IfElse||Karel: One ball in each spot [intro_to_karel_codehs]|
|K10||move, turnL, turnR, pickM, putM, While||Karel: Diagonal [intro_to_karel_codehs]|
4 Experimental Evaluation
In this section, we evaluate our task synthesis algorithm on HOC and Karel tasks. Our implementation will be released together with the final version of the paper. While we give an overview of key results, a detailed description of our setup and experiments can be found in the supplementary material.
4.1 Reference Tasks and Specifications
Reference tasks. We use a set of ten reference tasks from HOC and Karel, shown in Fig. 6. The HOC tasks were selected from the Hour of Code: Classic Maze challenge by Code.org [hourofcode_maze] and the Karel tasks from the Intro to Programming with Karel course by CodeHS.com [intro_to_karel_codehs]. The DSL of Fig. 3(a) is generic in that it includes both HOC and Karel codes, with the following differences: (i) construct While, marker-related actions putM, pickM, and conditions noPathA, noPathL, noPathR, marker, noMarker are specific to Karel only; (ii) construct RepeatUntil and goal are specific to HOC only. Furthermore, the puzzles for HOC and Karel are of different styles (see Fig. 1 and Fig. 2). For all tasks, the grid size of the puzzles is fixed to cells (grid-size parameter ).
Specification of task synthesis. was approximated as the sum of the normalized counts of ‘moves’, ‘turns’, ‘segments’, and ‘long segments’ in the grid; segments and long segments are sequences of and move actions. For Karel, additionally included the normalized counts of putM and pickM. was computed based on the dissimilarity of the agent’s initial location/orientation w.r.t. and the grid-cell status dissimilarity based on the Hamming distance between and . As per Section 2, we set the following thresholds for our algorithm: (i) , (ii) , and (iii) for codes with While or RepeatUntil, and otherwise.
Specification of MCTS. We run MCTS times per code, with each run generating one task. We set the maximum iterations of a run to million (M) and the exploration constant to [kocsis2006bandit]. Even when considering a tree depth of , there are millions of leaves for difficult tasks H5 and H6, reflecting the complexity of task generation. We define as
where is an indicator function and constants are weights. For each code , we generated different visual tasks. To ensure sufficient diversity among the tasks generated for the same code, we introduced measure . This measure, not only ensures visual task dissimilarity, but also ensures sufficient diversity in entire symbolic paths during generation (for details, see supplementary material).
Performance of task synthesis algorithm. Fig. 7 shows the results of our algorithm. The second column illustrates the enormity of the unconstrained space of mutated codes; we only impose size constraint from Fig. 3(c). We then additionally impose constraint resulting in a partially constrained space of mutated codes (column 3), and finally apply all constraints from Fig. 3(c) to obtain the final set of generated codes (column 4). This reflects the systematic reduction in the space of mutated codes by our constraints. Column 5 shows the total running time for generating the final codes, which denotes the time taken by Z3 to compute solutions to our mutation query. As discussed in Section 3.1, few codes with semantic irregularities still remain after the mutation stage. The symbolic execution stage eliminates these to obtain the reduced set of valid codes (column 6). Column 7 shows the final number of generated tasks and column 8 is the average time per output task (i.e., one MCTS run).
|Task||Code Mutation||Symbolic Execution||Fraction of with criteria|
Analyzing output tasks. We further analyze the generated tasks based on the objectives of Section 2. All tasks satisfy properties (I)–(III) by design. Objective (IV) is easily achieved by excluding generated tasks for which . For a random sample of up to of the generated tasks per reference task, we did manual validation to determine whether objectives (V) and (VI) are met. The fraction of tasks that satisfy these objectives is listed in the last three columns of Fig. 7. To validate task minimality (for two values of ), we apply delta debugging [DeltaDebugging]. We observe that the vast majority of tasks meet the objectives, even if not by design. For H6, the fraction of tasks satisfying (VI) is low because the corresponding codes are generic enough to solve several puzzles.
Deep dive into an MCTS run. To offer more insight into the task generation process, we take a closer look at an MCTS run for task H5, shown in Fig. 8. Fig. 7(a) illustrates the improvement in various components of as the number of MCTS iterations increases. Best tasks at different iterations are shown in Fig. 7(b), 7(c), 7(d). As expected, the more the iterations, the better the tasks are.
5 User Study and Comparison with Alternate Methods
In this section, we evaluate our task synthesis algorithm with a user study focusing on tasks H2, H4, H5, and H6. We developed an online app, which uses the publicly available toolkit of Blockly Games [googleblockly] and provides an interface for a participant to practice block-based programming tasks for HOC. Each “practice session” of the study involves three steps: (i) a reference task is shown to the participant along with its solution code , (ii) a new task is generated for which the participant has to provide a solution code, and (iii) a post-survey asks the participant to assess the visual dissimilarity of the two tasks on a -point Likert scale as used in [polozov2015]. Details on the app interface and questionnaire are provided in the supplementary material. Participants for the study were recruited through Amazon Mechanical Turk. We only selected four tasks due to the high cost involved in conducting the study (about USD per participant). However, we chose these tasks such that they sufficiently span the set of HOC programming concepts of varying degrees of difficulty. The number of participants and their performance are documented in Fig. 9.
|Method||Total participants||Fraction of tasks solved||Time spent in secs||Visual dissimilarity|
Baselines and methods evaluated. We evaluated four different methods, including three baselines (Same, Tutor, MutTask) and our algorithm (SynTask). Same generates tasks such that . Tutor produces tasks that are similar to and designed by an expert. For each task of the study, we picked similar problems from the set of Classic Maze challenge [hourofcode_maze] tasks exercising the same programming concepts: Maze 6, 9 for H2, Maze 11, 13 for H4, Maze 15, 17 for H5, and Maze 19 for H6.
MutTask generated tasks by directly mutating the grid-world of the original task, i.e., by moving the agent or goal by up to two cells and potentially changing the agent’s orientation. A total of , , , and tasks were generated for H2, H4, H5, and H6, respectively—the complete list is in the supplementary material. Fig. 10 shows two output tasks for H4 and illustrates the challenge in directly mutating the input task, given the high discontinuity in mapping from the space of tasks to their codes. For H4, a total of out of new tasks were structurally very different from the input.
SynTask uses our algorithm to generate tasks. We picked the generated tasks from three groups based on the size of the code mutations from which they were produced, differing from the reference solution code by for . For H2 and H4, we randomly selected tasks from each group, for a total of new tasks per reference task. For H5 and H6, we selected tasks from the first group () only, due to their complexity stemming from nested constructs in their codes. We observed that Tutor tasks for H5, H6 were also of , i.e., . All the generated tasks picked for SynTask adhere to properties (I)–(VI) in Section 2.
Results on task solving. In terms of successfully solving the generated tasks, Same performed best (mean success = ) in comparison to Tutor (mean = ), SynTask (mean = ), and MutTask (mean = )—this is expected given the tasks generated by Same. In comparison to Tutor, the performance of SynTask was not significantly different (); in comparison to MutTask, SynTask performed significantly better (). The complexity of the generated tasks is also reflected in the average time that participants spent on solving them. As shown in Fig. 9, they spent more time solving the tasks generated by MutTask.
Results on visual task dissimilarity. Visual dissimilarity was measured on a Likert scale ranging from 1–4, 1 being highly similar and 4 highly dissimilar. Comparing the dissimilarity of the generated tasks w.r.t. the reference task, we found that the performance of Same was worst (mean dissimilarity = ), while that of Tutor was best (mean = ). SynTask (mean = ) performed significantly better than MutTask (mean = ), yet slightly worse than Tutor. This is because Tutor generates tasks with additional distracting paths and noise, which can also be done by our algorithm (although not done for this study). Moreover, for H2, which had no conditionals, the resulting codes were somewhat similar, and so were the generated puzzles. When excluding H2 from the analysis, the difference between SynTask (mean = ) and Tutor (mean =) was not statistically significant. A detailed distribution of the responses can be found in the supplementary material.
We developed techniques for a critical aspect of pedagogy in block-based programming: Automatically generating new tasks that exercise specific programming concepts, while looking visually dissimilar to input. We demonstrated the effectiveness of our methodology through an extensive empirical evaluation and user study on reference tasks from popular programming platforms. We believe our techniques have the potential to drastically improve the success of pedagogy in block-based visual programming environments by providing tutors and students with a substantial pool of new tasks.