Effects of Hints on Debugging Scratch Programs: An Empirical Study with Primary School Teachers in Training

by   Luisa Greifenstein, et al.
Universität Passau

Bugs in learners' programs are often the result of fundamental misconceptions. Teachers frequently face the challenge of first having to understand such bugs, and then suggest ways to fix them. In order to enable teachers to do so effectively and efficiently, it is desirable to support them in recognising and fixing bugs. Misconceptions often lead to recurring patterns of similar bugs, enabling automated tools to provide this support in terms of hints on occurrences of common bug patterns. In this paper, we investigate to what extent the hints improve the effectiveness and efficiency of teachers in debugging learners' programs using a cohort of 163 primary school teachers in training, tasked to correct buggy Scratch programs, with and without hints on bug patterns. Our experiment suggests that automatically generated hints can reduce the effort of finding and fixing bugs from 8.66 to 5.24 minutes, while increasing the effectiveness by 34 improvement is convincing, arguably teachers in training might first need to learn debugging "the hard way" to not miss the opportunity to learn by relying on tools. We therefore investigate whether the use of hints during training affects their ability to recognise and fix bugs without hints. Our experiment provides no significant evidence that either learning to debug with hints or learning to debug "the hard way" leads to better learning effects. Overall, this suggests that bug patterns might be a useful concept to include in the curriculum for teachers in training, while tool-support to recognise these patterns is desirable for teachers in practice.



There are no comments yet.


page 2


Common Bugs in Scratch Programs

Bugs in Scratch programs can spoil the fun and inhibit learning success....

A Comprehensive Study of Bug Fixes in Quantum Programs

As quantum programming evolves, more and more quantum programming langua...

What we can learn from how programmers debug their code

Researchers have developed numerous debugging approaches to help program...

Bonsai: Synthesis-Based Reasoning for Type Systems

We describe algorithms for symbolic reasoning about executable models of...

Repairing Deep Neural Networks: Fix Patterns and Challenges

Significant interest in applying Deep Neural Network (DNN) has fueled th...

Improved Recognition of Security Bugs via Dual Hyperparameter Optimization

Background: Security bugs need to be handled by small groups of engineer...

Finding Bugs with Specification-Based Testing is Easy!

Automated specification-based testing has a long history with several no...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Block-based programming is frequently used as entry-point to computational thinking and to programming, since the block-based nature reduces complexity compared to text-based programming languages. Nevertheless, there are many concepts to comprehend, and misconceptions about these concepts may hamper progress in learning. These misconceptions often manifest in “buggy” programs, i.e., programs that contain mistakes and do not function correctly. It is then up to the teacher to help learners identify these bugs in their programs, and to fix or explain them in order to overcome the misconceptions that caused them. While subject teachers at secondary schools may be expected to be adequately educated in debugging, programming is increasingly also introduced at primary schools, where teachers are not specialised in all individual subjects they teach and thus may not have adequate training.

Teachers have the advantage that learners frequently have similar misconceptions, which result in recurring patterns of bugs in their programs. For example, a common misconception of learners is that an if-statement triggers each time the condition becomes true (sorva2018). Figure (a)a shows a code snippet from a  (maloney2010) program where this misconception manifests in a bug. A correct implementation (Figure (b)b) would enclose the check inside a forever-loop to continuously check for the occurrence of the event. Bugs matching this pattern occur frequently: In a recent study, this Missing Loop Sensing bug pattern was found in 3,282 projects in a random dataset of 74,830 projects (fraedrich2020).

(a) Buggy version.
(b) Correct version.
Figure 3. An example of the Missing Loop Sensing bug pattern: Instead of continuously checking for collisions with a fish sprite, the buggy snippet performs just a single check immediately after initialising the variable points—this will scarcely suffice to detect all possible touching events.

Instances of bug patterns like the Missing Loop Sensing example above can be detected automatically. The repetitive nature of these bug patterns also makes it possible to provide generic hints on how to fix the identified bugs, and what misconceptions may cause them. However, especially primary school students might be overwhelmed by the concepts and the technical terminology used in automatically generated hints. Therefore it might be advisable that the teacher deals with the hints and not the primary school student. The teachers can then consider which information they communicate depending, e.g., on the student’s skills. While one would expect that an appropriate hint pointing out the location and type of a bug simplifies debugging, the same information may hamper the learning of teachers in training: The use of direct instruction has been reported to have negative effects on transferring compared to using a more discovery based learning approach (kapur2012designing). Intuitively, the act of manually debugging a program is a valuable learning experience, and reliance on a tool may inhibit the teachers’ ability to spot similar bugs on their own, when no such hints are provided. In order to study the effects of hints on bug patterns, in this paper we therefore aim to investigate these two aspects:

  • How do hints on bug patterns influence the effectiveness and efficiency of teachers while debugging and fixing bugs? In order to support research on bug patterns and their detection, and to provide evidence for teachers to decide whether they should apply bug detection tools, we aim to determine empirically what the benefits of showing hints on bug patterns are for teachers.

  • How do hints on bug patterns influence the ability of teachers in training to recognise and fix bugs without hints? In order to support teacher training, we aim to determine empirically whether providing detailed hints on bug patterns during training affects whether the teachers in training can fix similar bugs when not given any hints.

To answer these questions, we conducted an experiment with 163 primary school teachers in training. We first compared the performance in debugging and fixing broken programs between a subgroup of our participants who were given hints on bug patterns (Group Trmt) versus a subgroup not receiving any hints (Group Ctrl). This allows us to quantify effects on their success in fixing the bugs as well as the time it took them to do so. In order to evaluate which learning experiences can be transferred better, we then experimentally determined whether the participants of Group Ctrl were able to spot similar bugs in other programs more easily without the hints than Group Trmt. This provides us with information whether hints can be transferred, or whether missing the learning experience of manually identifying the bugs hampers the debugging skills of teachers in training.

Our experiment confirms that giving hints on bug patterns improves the performance at correcting bugs described by the hints: In general, hints can reduce the effort of finding and fixing bugs in terms of time from 8.66 to 5.24 minutes, while increasing the effectiveness in terms of correct programs by 34% more correct solutions. This suggests that providing tool-support to recognise bug patterns is desirable for teachers in practice, and research on identifying bug patterns has the potential to improve classroom learning. We do, however, observe that the beneficial effects are larger for participants having prior programming experience. Although our experiment provides no significant evidence that either learning to debug with hints or learning to debug “the hard way” leads to better learning effects, the knowledge of the bug patterns that was gained by the hints seems to slightly improve performance on similar tasks without hints. This suggests that bug patterns might be a useful concept to include in the curriculum for teachers in training.

2. Related Work

Figure 4. Example hint provided by LitterBox to support recognising and fixing the Missing Clone Initialisation bug pattern present in Task 4.

2.1. Misconceptions and Bug Patterns

Learners of programming can have misconceptions about programming constructs (swidan2018). Such misconceptions may severely hamper their ability to progress, as they will lead to programs that do not work and are a source of frustration (hansen2007). In the context of beginner programming, Sorva (sorva2018) identified a catalogue of 41 such common misconceptions; for example, the misconception that variables can store multiple values and store the history of values assigned to it, or that a while loop’s condition is evaluated continuously and the loop exits the instant it becomes false. Common misconceptions may result in similarly erroneous (i.e., buggy) programs. Even though programs may be functionally very different, the bugs resulting from misconceptions may be very similar. Fraedrich et al. (fraedrich2020) introduced the notion of common patterns of such bugs. Generally speaking, a bug pattern in a block-based programming language is a composition of blocks that is typical of defective code (fraedrich2020). Such bug patterns can be observed frequently in practice: Fraedrich et al. (fraedrich2020) found that out of a sample of 74,830 publicly shared projects, 33,655 contained at least one instance of a bug pattern.

Of course, not all bugs result in bug patterns. As bug patterns do not deal with task-specific information but the code itself, bugs such as using the wrong variable, the wrong content in e.g. say blocks or forgetting to implement certain features cannot be detected. Such content-related bugs can be addressed with dynamic analysis tools for such as Whisker (stahlbauer2019testing). They run the code and check if specified events occur or not. This implies that the intended aim of the program must be known and respective tests must be available or created by the teacher. This is why dynamic testing cannot be easily applied to programs that result from open programming tasks and without feeding the tool with further individualised information, no feedback on how to proceed can be provided (apart from that the feature must be changed or implemented).

In contrast to that, static analysis tools enable teachers and students to analyse programs without having to specify its intended output. Furthermore, when such tools check if instances of bug patterns exist they can provide generic information on the detected bug patterns, such as knowledge on the associated misconception or on how to remove the respective bug pattern. This is done by the LitterBox (fraser2021) tool as it can be seen in the example hint in Fig. 4. Intuitively, such knowledge can aid in the process of determining whether code is correct, or in the process of debugging defective code: If an instance of a bug pattern has been spotted, this is a suitable point to start debugging. Indeed, bug patterns tend to lead to functionally defective programs: Out of 250 manually analysed projects 218 contained bug patterns that result in erroneous behaviour (fraedrich2020). Other existing code analysis tools for such as Hairball (boe2013), Quality Hound (techapaloku2017b) or Dr. Scratch (moreno2015) report general quality problems in terms of code smells (= unaesthetic or less readable code that does not harm the functionality of a program), rather than specific bug patterns resulting from misconceptions. Previous research has shown that these code smells in hamper novices’ learning processes (hermans2016b). Bug patterns are more severe than code smells, thus likely having an even worse impact on novice programmers; our investigations in this paper suggest a possible way of dealing with them.

2.2. Feedback on Code

The ability to detect such bug patterns in student code makes it possible to provide learners and teachers with feedback. Kennedy et al. (kennedy2020) demonstrated—using peer feedback and discussions of coding assignments—that misconceptions can be cleared with feedback in principle. In this paper, we evaluate the effects of feedback that is generated automatically by static analysis. Past research on automated hint generation has mainly considered the problem of providing hints on what should be the next step in solving programming assignments (singh2012; price2017b; zimmerman2015; rivers2015; wang2020) or open ended programming tasks (price2016; paassen2017) and how novices seek help in these systems (price2017c; marwan2020; aleven2016help; marwan2019evaluation). An exception is the work by Gusukuma et al. (gusukuma2018), who showed that feedback delivery on mistakes that anticipate possible misconceptions generally leads to favourable results, and that showing such hints does not harm transfer to new tasks. In contrast to this work, the hints we evaluate are not tailored to our self-study material and thus might be more universally applicable.

2.3. Teachers as Debuggers

Not only students but also teachers—who are expected to support their students—struggle with debugging. Sentance and Csizmadia (sentance2017computing) examined the opinions of teachers in the UK about challenges when teaching computing. The most frequently mentioned challenge was the teacher’s subject knowledge, and in particular this was mentioned more often by primary school teachers than by secondary school teachers. Within this challenge, teachers were concerned if they were able to help their students with their problems when programming. Yadav et al. (yadav2016expanding) came to similar conclusions after interviewing high school computer science teachers, where they identified both content and pedagogical challenges. Supporting students when teaching programming is perceived as a main difficulty—not only because of the partially missing computer science background: The teachers explained this challenge also with (1) the teacher-student ratio and (2) the different programming approaches of the students that result in different needs. Teachers’ debugging performance has been studied explicitly by Kim et al. (kim2018debugging), who examined what kind of bugs early childhood preservice teachers produce and how they deal with given bugs: The teachers had difficulties debugging given programs, even though they tried different debugging strategies. Michaeli and Romeike (michaeli2019current) interviewed high school teachers on their strategies to cope with programming bugs of their students. Most teachers reported that they have to help efficiently to reach all students with problems and that there is no time for detailed explanations. Consequently, there is a need to examine how teachers—especially those with insufficient subject knowledge—can be supported in debugging. The aim of this paper is to generate an initial understanding of the effects of hints on bug patterns. Do such hints have a positive influence on the effectiveness, efficiency and learning opportunities of teachers inspecting their students’ programs?

3. Study Setup

Figure 5. Fictitious student addressing a study participant as part of Task 9.
Bug Pattern LitterBox Hint
Message Never Sent (Tasks 1 and 8) The message “Game Over”, that is to be received here, is never sent. Therefore, the adjacent script will never be triggered.
If you want to receive a message, you have to select a message, that is already sent via a different script or you have to create and send a matching message inside a different script.
Missing Loop Sensing (Tasks 2 and 9) The highlighted condition is checked only once. Thus, the script runs through too fast.
Implement a continuous check by surrounding the conditional statement with a forever loop.
Comparing Literals (Tasks 3 and 10) You used current year just as plain text. Your comparison will always return FALSE. Therefore, the code following wait until is never executed.
Did you rather intend to use your already existing variable block current year instead of the plain text? You will find this block in the Variables toolbox.
Missing Clone Initialisation (Tasks 4 and 11) When you clone a sprite that has no script beginning with When I start as a clone or When this sprite clicked, the clone is unable to do some work.
You are using a delete this clone block in a script of Sprite Red. Consider using a different event handler block like When I start as a clone or When this sprite clicked at the beginning of this script.
Message Never Received (Tasks 5 and 12) The message “player touches money” that is sent here is never received by a when I receive “player touches money” block. Therefore nothing will happen as a reaction to this message.
When you send a message make sure that another script receives it.
Forever Inside Loop (Tasks 6 and 13) The inner forever loop is never left. Therefore, all blocks in this script but outside of the forever loop are never executed again.
Try omitting the inner forever loop.
Stuttering Movement (Tasks 7 and 14) If you continuously press a key, you expect a smooth event processing. Unfortunately, a delay occurs between the first and the second processing round, resulting in stuttering movement.
You can prevent this delay to happen by using the key right arrow pressed? block from the Sensing toolbox. To do that, you have to put the conditional statement if key right arrow pressed? then inside of a forever loop and use the event handler When green flag clicked instead of the event handler When key right arrow pressed.
Table 1. Bug patterns and corresponding hints.

To evaluate the impact of hints on bug patterns on the performance of teachers, we aim to answer the following research questions:

RQ 1: How do hints on bug patterns influence the effectiveness and efficiency of teachers while debugging and fixing bugs?

RQ 2: How do hints on bug patterns influence the ability of teachers in training to recognise and fix bugs without hints?

To answer these research questions, we conducted an A/B study with primary school teachers in training, who were tasked to fix several buggy programs, with and without hints.

3.1. Study Participants

We implemented a three-week programming session into a course on mathematical didactics for primary schools teachers in training with 242 students signed up at the University of Passau. We chose the participants of this course because primary school teachers have to teach almost every subject but they are not specialised in all of them. Thus, they may not have had adequate training and may lack adequate skills for debugging. Although computer science is not yet part of the Bavarian curriculum, programming is increasingly introduced in primary schools.

While course participants were recommended to take part in the programming sessions, it was not mandatory as they had a choice of which sessions of the overall course their assessment should be based on. Of the registered students, 189 actively participated, but when participants did not submit all tasks, submitted corrupted files, or submitted programs for the wrong tasks, we conservatively excluded all data of these participants. Overall, we excluded all data points of 26 participants. We excluded 5 participants of Group Ctrl and 3 participants of Group Trmt because some of their submitted programs were corrupted or because they submitted programs of previous tasks. Additionally, we excluded 11 participants of Group Ctrl and 7 of Group Trmt because they did not submit all tasks. This allows us to compare the performance across tasks, which would be challenging if participants would differ between tasks.

This results in a total of 163 participants, of which 142 were female, 21 were male, and they were between 18 and 45 years of age with a median of 20 years. According to our demographic pre-survey, many participants had no prior experience in programming at all, which means they neither programmed at school nor at university or anywhere else (Group Ctrl: 27%; Group Trmt: 32%). We will refer to these participants as inexperienced participants and to those with some prior experience as experienced participants.

To ensure all participants had sufficient basic knowledge for the experiment, we provided self-study material that can be completed in 90 minutes: Within one week, the participants were tasked to watch a 30 minutes explanatory video and perform three exercises. In the video a program is built step by step, thereby covering all blocks and concepts that were needed to fix the buggy programs later in the experiment. Specific misconceptions and bug patterns were not discussed. The participants had to stop the video three times to perform an exercise. This approach is based on the Use-Modify-Create-Framework (lee2011computational): At first, they used the presented program by opening, running and interpreting it. In the second exercise, they expanded the functionality of one sprite. Finally, they created their own program in about 30 minutes.

3.2. Experiment Procedure

3.2.1. Overview

After introducing the participants to with the self-study material (described in Section 3.1), the main part of the experiment started: The participants were tasked to fix one example program (Task 0), which served to familiarise participants with the infrastructure, and 14 buggy programs (Tasks 1–14) within two weeks using their own digital devices. They should spend about ten minutes on one task but were allowed to go on to the next task, regardless of the current task being completed or not. However, once they had proceeded to the next task, they were not able to go back to previous tasks. The tasks were provided on a website created by the researchers. The participants downloaded the broken programs, edited them on the public website and then again uploaded the edited programs to our website—even if they could not fix it. The participants were asked to pause only at dedicated pages (after each task) to get meaningful information about the time on task.

3.2.2. Scenario

For each task a fictitious primary school student describes his or her program. To ensure that our results cannot be attributed to differences in the descriptions, the 14 descriptions are structured and filled with content in the same way (an example is shown in Fig. 5). In the first two paragraphs, the students says hello and that he or she wants to create an animation or a game. In the following paragraphs the student explains which functions have already been correctly implemented. In the last paragraph the student describes what does not work yet and thus the desired final state. After having repaired and submitted the given program the participants wrote feedback on the bug to the fictitious primary school student. Giving feedback corresponds to the procedure in school and we intended that the teachers in training reflected more on the bug: We attempted that participants repair the program less by trial and error or by following the hint passively but are rather cognitively strongly involved. When participants were not able to repair a program, they still uploaded their final result and instead of feedback on the bug, they were allowed to write feedback on a positive implementation in the student’s program instead.

3.2.3. A/B study

While course participants were recommended to take part in the programming sessions, it was not mandatory. We therefore assigned participants to Group Ctrl or Group Trmt on-the-fly through the web interface: Whenever a participant started with Task 1, they were assigned to the smaller of the two groups. The example task (Task 0) was identical for both groups. The first half of the following tasks (Tasks 1–7) differed between Group Ctrl and Group Trmt only in that Group Trmt received one hint for each task. The second half of the tasks (Tasks 8–14) did not differ between the groups: no hints were given for either group. In this way we can examine if there are any effects of using hints within the first tasks and then not providing them within the following tasks that contain the same bug patterns but no hints.

3.2.4. Bug Patterns, Hints and Tasks

Task Blocks Scripts Sprites HD HE HL HS HV ICC WMC
1 81 10 3 27.4 19485.3 118 65 710 24 22
2 44 6 2 20.0 7657.1 71 42 382 13 11
3 34 5 2 14.7 4270.7 54 42 291 10 9
4 52 6 2 28.0 12609.4 79 52 450 15 14
5 47 7 4 14.9 6326.6 74 53 423 13 12
6 36 5 4 11.6 3669.6 60 39 317 12 10
7 62 9 4 20.2 12275.4 102 62 607 17 16
8 56 10 5 16.2 9808.4 107 50 603 14 15
9 49 6 4 15.2 5709.8 70 41 375 13 14
10 66 8 6 18.9 12093.2 109 58 638 19 19
11 77 13 4 18.3 13536.6 125 60 738 26 23
12 61 7 2 25.4 13302.1 95 46 524 15 17
13 38 3 2 15.8 4926.9 63 31 312 8 8
14 39 7 2 16.0 5543.9 65 40 345 13 11
Halstead Difficulty (halstead1977elements) Halstead Effort (halstead1977elements) Halstead Length (halstead1977elements)
Halstead Size (halstead1977elements) Halstead Volume (halstead1977elements)
Interprocedural Cyclomatic Complexity Weighted Method Count
Table 2. Complexity of the programs.

For each task, we created or adapted a program based on projects used previously for teaching. Table 2 shows the values of different complexity measures for each program. We modified each of these programs to contain one instance of a bug pattern.

We decided to use the seven most common bug patterns described by Fraedrich et al. (fraedrich2020), who found each of these bug patterns in more than 1,500 projects of 33,655 publicly shared buggy projects. By choosing exactly seven bug patterns we made a compromise between a reasonable amount of time and the largest number of bug patterns and thus generalisability. We used the bug patterns in the order shown in Table 1. By this, we attempted to maximise the temporal distance between bug patterns which are related—such as Message Never Sent and Message Never Received or Missing Loop Sensing and Forever Inside Loop.

The hints given to Group Trmt for Tasks 1 to 7 were generated using LitterBox 111http://scratch-litterbox.org/, last accessed 27.05.2021. As shown in Fig. 4, each hint contains both images and text. The text consists of (1) an explanation of the problem and a clarification of the underlying misconception and (2) a generic suggestion on how to remove the bug pattern. The images show the sprite and the script with the highlighted block, where the cause of the bug is located. Table 1 also lists the hints for Tasks 1 to 7. As LitterBox analyses the code statically and does not deal with content information about the programming task, some hints provide several options on how to proceed. For the bug pattern Missing Clone Initialisation (Fig. 4), e.g., two different event handlers could solve the bug pattern but only one of them might make sense in the associated program. Thus, the user has to decide which option to use to achieve the intended aim of the program. Consequently, it must be examined to what extent generic hints on bug patterns can support learners.

3.3. Data Analysis

We analysed the participants’ performance using the criteria (1) correct fixing of the broken functionalities and (2) time needed.

3.3.1. Effectiveness

We used a Chi-squared test to measure if the functionality differs significantly between Group Trmt and Group Ctrl

and used odds ratio (

) to calculate the effect size. If = 1.0, there are no effects in favour of any group. Values below 1.0 indicate effects in favour of Group Ctrl and values over 1.0 indicate effects in favour of Group Trmt.

To find out if a broken functionality was correctly fixed we mainly used automated Whisker (stahlbauer2019testing) tests. Whisker tests programs dynamically, which means that Whisker runs the programs and checks if the conditions given in the tests are true. We created between 4 and 17 tests for each task with a median of 7. These tests checked not only if the programs were correctly fixed, but also whether any other functionality of the program was broken. The number of test cases per task differs depending on attributes such as the number of sprites.

A submitted program was considered correct if (1) the broken functionality was fixed and (2) no other functionality was broken. For three tasks we had to supplement the tests due to limitations of Whisker: As stamps (Task 12) and the lack of delay (Tasks 7 and 14) could not be tracked yet with Whisker at the time of the experiment, we used LitterBox to analyse if the matching solution patterns were implemented and to check whether the bug pattern was removed.

During the creation of the first version of the Whisker tests (and for Tasks 7, 12 and 14 the LitterBox analyses), we checked with individual programs if the automated results are correct. Then, we rated 420 random programs (30 per task) manually and used the 13 deviations to refine the Whisker tests. After this refinement step we rated another subset of 210 randomly chosen programs (15 per task) manually: This procedure confirms a very good inter-rater reliability ( = .96) and thus reliable automated results.

3.3.2. Hint Evaluation

After submitting their solution for a task participants of Group Trmt evaluated the hint on a 5-point Likert scale regarding its support with debugging. Participants were then also asked to explain their rating. This provided us with qualitative data, and we analysed the comments using thematic analysis (bergman2010hermeneutic)

: Themes are first collected, then counted and in a final step again related to the original data and our research questions. To ensure inter-rater reliability two raters (one author and one assistant) classified a randomly chosen subcorpus of 35 statements (five per Task 1–7) and agreed on a coding scheme. Then, each rater rated half of the statements and additionally five random statements per task. The comparison of these 35 statements confirms a strong inter-rater agreement with

= .75.

3.3.3. Efficiency

We used a Wilcoxon rank sum test to measure if the time differs significantly between Group Trmt and Group Ctrl and used the Vargha and Delaney measure to calculate the effect size. If , there are no effects in favour of any group. Values below .50 indicate effects in favour of Group Ctrl and values over .50 indicate effects in favour of Group Trmt. The time for completing a task was determined using the website on which we provided the tasks. For each task, we started measuring the time when a participant accessed the page of the respective task (where the task is described and the participants down- and upload the program). Time measurement for the task was stopped when the participants submitted their final program. Time spent for giving feedback to the fictitious student, and for Group Trmt rating the hint, were not included in the time we tracked.

3.4. Threats to Validity

Threats to external validity result from our choice of participants, programs and bugs: Our participants consisted only of primary school teachers in training, so effects for teachers at other levels might be different. However, we believe that primary school teachers are one of the most important target groups. We selected a subset of bug patterns and other bug patterns might have other particularities. As we used common bug patterns, many instances of bugs are covered and as we used seven of them, the general insights might be applicable to other bug patterns. Furthermore, our created or adapted programs differ in their complexity. This is why we included the complexity of the programs as a possible explanation of certain results. Threats to internal validity might result from our experiment setup and technical infrastructure: The experiment was embedded in an online course (due to Covid-19 reasons), and participants did not work in a controlled environment. Our time measurement may be unreliable, since other events might interfere (e.g., browser sessions can be accidentally closed). Threats to construct validity arise since we cannot easily measure whether participants have, or can detect, misconceptions. However, we combined several measurements: In this paper, we will evaluate the functionality and time needed. As a next step we will evaluate the feedback the participants gave to the fictitious students.

4. Results

4.1. RQ1: Effects of Showing Hints

To answer RQ1 we consider Tasks 1 to 7, where participants of Group Ctrl had to work without hints and participants of Group Trmt were shown hints. Figure 10 and Table 7 deal with the hint evaluation: Figure 10 shows how the participants rated the hint on a five-point Likert scale. Participants were asked to explain their rating and Table 7 shows how often different aspects of the hints were mentioned.

Figure 6. Proportion of correct solutions for Tasks 1 to 7.
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7
All 11.73 2.26 44.38 8.31 22.00
Experienced 16.45 91.88 .21 13.78 20.04
Inexperienced 8.98 3.82 27.50 4.67 37.50
Table 3. Effect sizes for the functionality of Tasks 1 to 7.
Effect sizes associated with significant -values () are bold.

4.1.1. Effectiveness: Differences in Terms of Functionality

Figure 6 shows how many correct solutions were produced per task and per group and Table 3 shows the effect sizes for each task for all, experienced and inexperienced participants. Overall, Group Trmt submitted 34% more correct solutions than Group Ctrl (Group Trmt: 82%; Group Ctrl: 48%). With an odds ratio of 13.05 on average over all Tasks 1 to 7 there are medium effects in favour of the hints.

We note substantial variation between the individual tasks (Fig. 6): The hints on Tasks 2, 3, 4, 6 and 7 lead to Group Trmt producing significantly more correct solutions () with Task 7 having the highest effect size (OR ). The differences for Tasks 1 and 5 however are not significant. For Task 1, it is barely not significant in favour of the hints (, OR ). For Task 5, the percentage of correct solutions is even lower for Group Trmt (79%) than for Group Ctrl (88%) though not significantly (, OR ).

Table 3 suggests that showing hints generally leads to more correct programs for both experienced and inexperienced participants, but the effects are larger for experienced participants (experienced: OR ; inexperienced: OR ). Only for the Tasks 3 and 5, experienced participants benefit less than inexperienced participants. For Task 3, only inexperienced participants of Group Trmt submitted significantly more correct programs than those of Group Ctrl (, OR ). For Task 5—the only task with slightly negative effects of hints when considering all participants—, inexperienced participants of Group Trmt did not produce significantly more or less correct programs than those of Group Ctrl (, OR ), but the experienced participants of Group Trmt produced significantly fewer correct solutions than those of Group Ctrl (, OR ).

4.1.2. Efficiency: Differences in Terms of Time

Figure 7. Time needed to submit a program for Tasks 1 to 7.
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7
All .59 .86 .72 .81 .82 .63
Experienced .84 .76 .83 .79 .64
Inexperienced .91 .68 .77 .87
Table 4. Effect sizes for the time needed for Tasks 1 to 7.
Effect sizes associated with significant -values () are bold.

Figure 7 shows the time required by the participants per task and per group and Table 4 shows the respective effect sizes for all, experienced and inexperienced participants. On average Group Trmt needs 5.24 minutes to submit a program and Group Ctrl needs 8.66 minutes. Thus, the effort of debugging is reduced by 3.41 minutes. The effect size of = .70 on average over all Tasks 1 to 7 indicates medium effects in favour of the hints.

Looking at individual tasks, results are similar to those regarding functionality: The hint on Task 5 again slightly impairs the performance, but again not significantly (p=.74, = .48). A slight difference to the results of the functionality is that the hint on Task 1 is significant in terms of time (p=.044, = .59).

The effect sizes in terms of time are very similar for inexperienced and experienced participants (experienced: = .71; inexperienced: = .70). However, for Task 7, only the experienced participants of Group Trmt submitted their programs significantly earlier than those of Group Ctrl (p=.013, = .64).

Summary (RQ1) Hints on bug patterns improve the performance at debugging and fixing programs: In general, programs can be fixed significantly more often and faster. In our experiment, one hint significantly impeded the debugging performance regarding the effectiveness of experienced participants.

4.2. RQ2: Effects of Knowing Patterns

To answer RQ2 we consider Tasks 8 to 14, where participants of both groups had to work without hints.

4.2.1. Effectiveness: Differences in Terms of Functionality

Figure 8. Proportion of correct solutions for Tasks 8 to 14.
Task 8 Task 9 Task 10 Task 11 Task 12 Task 13 Task 14
All 2.13
Table 5. Effect sizes for the functionality of Tasks 8 to 14.
Effect sizes associated with significant -values () are bold.

Figure 8 shows how many correct solutions were handed in per task and per group. Table 5 shows the effect sizes for each task and it differentiates between experienced and inexperienced participants. Comparing both groups, we observe that Group Trmt has only a slight edge over Group Ctrl: Group Trmt submitted 54% and Group Ctrl 47% correct programs in total over Task 8 to 14 and we note an average effect size of 1.42.

Looking at the individual tasks (Fig. 8), Task 14 clearly stands out, because Group Trmt produced significantly more correct solutions for this task only (, OR ). A second task that stands out in Fig. 8 is Task 12 where Group Ctrl performs even slightly better than Group Trmt (, OR ).

Inexperienced participants of Group Trmt show a slightly larger improvement on average when given hints during training than experienced ones (experienced: OR ; inexperienced: OR ). However, for Tasks 10 and 11, the effects are noteworthy for experienced participants and somewhat invisible for inexperienced participants (Table 5).

4.2.2. Efficiency: Differences in Terms of Time

Figure 9. Time needed to submit a program for Tasks 8 to 14.
Task 8 Task 9 Task 10 Task 11 Task 12 Task 13 Task 14
All .60 .60
Experienced .66
Inexperienced .69 .66
Table 6. Effect sizes for the time needed for Tasks 8 to 14.
Effect sizes associated with significant -values () are bold.

Figure 9 shows the time required by participants per task and per group and Table 6 shows the effect sizes for all, experienced and inexperienced participants. Overall, it took participants of Group Trmt 7.00 minutes to submit a program, while Group Ctrl needed 7.69 minutes. Thus, the effort of debugging is slightly reduced with an effect size of .55 on average over Tasks 8 to 14.

The largest effects can be seen in Tasks 8, 9, 10 and 14 although the difference is only significant for Tasks 8 (p=.029, = .60) and 10 (p=.022, = .60).

On average, the effects are very similar for experienced and inexperienced participants (experienced: = .54; inexperienced: = .57). For experienced participants of Group Trmt, the time saving effects are again only larger for the Tasks 10 and 11.

Summary (RQ2) When debugging without hints, there are no significant differences between Group Trmt and Group Ctrl.

5. Discussion

Figure 10. Rating of the hints.

5.1. General Insights

5.1.1. RQ 1: Showing Hints

The results on RQ 1 show that generic hints on bug patterns on average lead to debugging programs more often and faster. The increased effectiveness is important as teachers can only help the student to debug its program when understanding how to deal with the bug themselves. The increased efficiency implies that primary school teachers can react earlier to their student’s problem—either by telling that they are not able to fix the bug or by trying to help the student to fix the bug. This is important as one teacher has to deal with many different problem-solving approaches at the same time (yadav2016expanding). When hints are received it might be easier to deal with this challenge as less time is needed per bug. Still, when encountering hints for the first time, participants have to read, understand, and implement these hints, which also takes time. It needs further investigation to find out if seeing the same hints repeatedly in different programs, such that re-reading and comprehending does not need to be repeated, further reduces the amount of time needed.

We observed that more experienced participants benefit even slightly more from showing hints than inexperienced participants. Only for the Tasks 3 and 5 experienced participants benefit less in terms of effectiveness. The complexity of the program for Task 3 is very low (Table 2) and experienced participants might not need the hint to locate the bug. For Task 5, the task complexity does not stand out. However, its hint might harm the performance of experienced participants as they would have actually known where to look for the bug but were misled by the hint. However, for the five other tasks, experienced participants benefit more. This might be because basic knowledge is needed to understand the hints. Therefore, tool-support is especially useful for trained teachers in practice to face the challenge of multiple bugs at the same time.

5.1.2. RQ 2: Knowing Patterns

The results of RQ 2 do not indicate that learning opportunities with or without hints can be transferred significantly better. It seems that having repaired the corresponding bug pattern once does not suffice to transfer the hint’s explanation. Still, it is an open question if repairing the same bug patterns more times would change the findings regarding the performance with vs. without hints. Another consideration is that the explanation is only displayed in the hint. Users could elaborate more on the hint if they additionally had to, e.g., describe it in their own words as suggested by Marwan et al. (marwan2019evaluation). This might enhance the effects of hints on known patterns.

Our results suggest that inexperienced participants benefit slightly more from knowing patterns by the hints—apart from Task 10 and 11. This might be related to the relatively high complexity of these two programs (Table 2), possibly overstraining inexperienced participants of both Group Ctrl and Group Trmt. This suggests that knowledge of bug patterns can only support repairing programs correctly and save time when the programming experience suffices to understand a program of a given complexity level. Indeed, for all other tasks inexperienced participants benefit more. This might be because they need more direct instruction (e.g., provided by hints) to transfer their gained experiences than experienced participants. Consequently, bug patterns might be a useful concept to include in the curriculum for teachers in training.

T1 T2 T3 T4 T5 T6 T7


Problemsolving 17 34 38 29 21 33 26
Clear hint 8 14 13 16 16 19 20
Localisation 20 14 12 9 10 11 5
Single components 17 25 8 14 5 7 2
Attention drawing 12 13 7 7 8 9 7
General assistance 7 6 5 7 9 4 6
Importance 6 7 7 6 0 1 10
Formulating feedback 0 2 1 2 1 2 1
TOTAL 87 115 91 90 70 86 77


Cognitive activation 3 4 2 5 11 3 5
Partial usefulness 0 1 1 2 0 0 1
TOTAL 3 5 3 7 11 3 6


Misleading 3 1 0 6 24 1 3
Comprehension problems 3 2 7 6 8 2 7
Insufficient assistance 5 1 3 3 3 1 6
No problem solving 0 1 3 3 3 1 7
Time 2 0 5 2 1 0 1
No deeper understanding 0 3 0 1 0 2 3
TOTAL 13 8 18 21 39 7 27


Generally unnecessary 3 3 4 3 4 1 1
Independent solving 3 1 1 0 4 2 1
Alternative solution 0 0 3 0 0 1 0
TOTAL 6 4 8 3 8 4 2
Table 7. Classification of the comments on the hints.

5.2. Insights on Hint Generation

5.2.1. Location of the Bug

For both Tasks 5 and 12 the hint impairs effective and efficient debugging. They deal with the bug pattern Message Never Received which has a peculiarity: While the hint is attached to the block sending a message, the fix requires modification of a different script which is often located in a different sprite, where that message should be handled. This is because static analysis only detects the pattern (e.g., that a message is sent but nowhere received) but it does not analyse content information (and thus does not return where the message should be received). This seems to provide too little help which matches the relatively bad rating of this hint (Fig. 10): The participants perceive the hint as misleading and cognitively activating (Table 7) which can also take time: “You had to think a little bit about it to finally get the correct result.” (P72). Furthermore, the hint appears to affect the performance for the similar bug in Task 12 negatively. Thus, hints generally need to make it very clear where the bug is located and where it can be repaired. Consequently, when changes have to be made in another script or even sprite than the shown one, this has to be highlighted especially in the hint to avoid misunderstandings.

5.2.2. Provision of Several Solution Approaches

Showing hints also led to no significant improvement in effectiveness for Task 1 that deals with the bug pattern Message Never Sent. Even though relatively many participants stated that the hint helped them to draw attention to important areas and to locate the bug, relatively few participants highlighted the support with problem solving (Table 7). This might be explained by an obvious difference to other hints that misled some participants: The hint explains two possible solutions: (1) A message that is already sent in another script has to be selected or (2) a new message has to be sent in another script. Again, the automatic static analysis tools cannot detect whether an appropriate message is available (solution 1) or not (solution 2). The only other hint that explains two possible solutions is the hint on the bug pattern Missing Clone Initialisation. However, the two provided solutions only differ in the block that is inserted and may therefore not be crucial. Consequently, attention must be paid if the results of the static analysis lead to several quite different solutions: Hints should then either provide information about which solution should be used in which context or at least mention that all solutions should be tried out until the intended output is reached.

5.2.3. Degree of Detail

We observed the highest effects regarding the effectiveness of hints not only in Task 7 but also in the corresponding transfer Task 14. This is interesting, as the used bug pattern Stuttering Movement certainly is more open to automatic refactoring than all other bug patterns of this study. Consequently the LitterBox hint presents itself quite customised and describes very detailed what should be done to fix the bug (Table 1). The classification of hint comments confirms that the hint was perceived as clear, important and least unnecessary (Table 7). Maybe other bug patterns might not be so easily spotted, grasped and fixed by learners per se, due to greater variety of bug pattern instances or/and greater variety of appropriate fixing pathways and fixing outcomes. They consequently might need far more attention during learning than just trying to solve an individual task. Compared to the effectiveness, however, the hint on the bug pattern Stuttering Movement is slightly less helpful in terms of efficiency. This might be because five blocks have to be changed—instead of one for the other bug patterns. Thus, reading or remembering the relatively long hint might take a while. In conclusion, hints need to provide clear information about how to remove the bug pattern.

6. Conclusions

Teachers are faced with bugs and have to be able to identify and repair them. In this paper, we empirically evaluated to what extent hints can support primary school teachers in training. Our experiment confirmed that hints can help repairing bugs in students’ programs, and we found no evidence that the hints would impede the learning effects gained from debugging. While this result is encouraging for research on hint generation, the observations and the participants’ evaluation of the hints provided valuable insights on how to further improve hints related to misconceptions and automated hint generation in future research.

While we focused on teachers as target for the hints, in the future we also plan to investigate creating hints specifically for younger children. These hints would have to be kept very simple and yet point at an aspect of the underlying problem in an adequate manner. However, hints aimed at teachers have the potential to indirectly reach a greater variety of students, since the insight encapsulated in these hints can be transformed and tailored by the teachers to their students’ needs. As a next step we plan to analyse the output of this transformation in a fictitious student setting. A further important aspect of automatically generated hints is that tools may produce false positives, and not all bugs in programs will be instances of bug patterns. We plan to investigate the effects of these issues on the debugging performance, and hope that insights will also help to improve hint generation techniques.

This work is supported by the Federal Ministry of Education and Research through project “primary::programming” (01JA2021) as part of the “Qualitätsoffensive Lehrerbildung”, a joint initiative of the Federal Government and the Länder. The authors are responsible for the content of this publication.