I Introduction
Through new information technology and research on active learning [1, 2], many have called for a need to restructure engineering education in order to incorporate such results more systematically (see e.g. [3]). One way of doing this is to flip the classroom. The idea of a flipped (or inverted) classroom does not, by itself, imply the use of modern technology. Instead, it is more related to active learning and can be defined as the idea of that “events that have traditionally taken place inside the classroom now take place outside the classroom and vice versa” [4]. Therefore, providing students with an article in paperform, ask them to read it, and then discuss it with them together in class is also a flipped classroom and is a technique that has been applied long before computers were invented. On the other hand, blended learning can be defined as “the thoughtful integration of classroom facetoface learning experiences with online learning experiences” [5], which then implies the use of IT in combination with inclass activities. According to Garrison et al. [5], blended learning is in line with the values of traditional higher education institutions and can “enhance both the effectiveness and efficiency of meaningful learning experiences” [5].
In the present study, online and inclass components were always mixed and therefore the following definition of a flipped classroom is used: “an educational technique that consists of two parts: interactive group learning activities inside the classroom, and direct computerbased individual instruction outside the classroom” [6].
The background of this change in teaching methodology is very much connected to the academic life of the author of this present paper. When he started as a PhD student in 2013, he had 20% teaching duties. The second year of his studies, he was more involved in teaching the Empirical Software Engineering course at the Master’s level. The course was to include more statistical methods for doing empirical software engineering research to better prepare students for applying a quantitative method to their thesis work. Up until 2013, the course was mostly theoretical, which did not prepare the students well for writing their theses using quantitative research methods. In the fall of 2014, the author also took over some introductory statistics lectures from another teacher and obtained lower scores from the student evaluation compared to previous years (from a mean of 4 on the students’ overall impression of the course to around 3). Figure 1 shows the students’ overall impression of the course across years, starting from 2013 (when the author was first involved in adding some more practical components of the course) up until 2017. When the results from the student evaluation came in 2014, he had a feeling that there must exist a better way of teaching hard topics to university students than lecturing, and that the poor result was not only due to him being a novice lecturer. That January after (2015), he attended a pedagogical conference at Chalmers University of Technology^{1}^{1}1https://www.chalmers.se/en/conference/KUL/
. By chance, one of his old classmates from engineering knew two people who had a short presentation on the flipped classroom, which the classmate promoted. The author of this paper and the speakers spoke afterwards and decided to try a flipped classroom approach to the course, if the author of this paper received funding to change the course (which he did shortly after). The idea was from the beginning to add to the body of knowledge in flipped classrooms for higher education, since such studies were scarce in 2015. In particular, using online components was assessed as an easy change in methodology since the students of software engineering are very accustomed to IT. The largest difficulty and concern was the fact that the teaching team needed a set of new skills. In order to increase the probability of a successful implementation, pedagogical experts of the flipped classroom were hired the first two years the course was flipped. Furthermore, the fact that around a fifth of the students dropped out of the course in 2014 was also a realization that an improvement of the course was needed.
In recent years (since 2015), there is an increased interest in the flipped methodology both in science and practice. A quick overview of the results thus far is presented in the next section.
The rest of the paper is structured as follows: In the next section (Section II) previous work from within and outside the software engineering education domain is presented. Section III provides an overview of the course given, both in general and in relation to what was changed when the course was flipped. In Section IV, the statistical analysis of both the exam grades and the course evaluations are presented in detail. In Section V, the results are discussed, and in the final section (Section VI) conclusions and future work are suggested.
Ii Previous work
There are numerous studies in educational research and in engineering education on the effectiveness of active learning [1, 2] and the evidence is clear due to the fact that these studies were large secondary studies with huge sample sizes. Freeman et al. [1], for example, concluded that active learning increases student performance based on 225 primary studies. The evidence of the whole concept of the flipped classroom is less clear. In a secondary study by Bishop et al. [6], they conclude that most existing studies up until 2013 look at student perceptions and only include singlegroup study designs. The results of student perceptions of the flipped classroom were concluded to be somewhat mixed, but positive overall. The authors also presented that students tend to prefer inperson lectures over videos, but that they also preferred active learning over traditional lecturing. Bishop et al. [6] also concluded, in 2013, that there is only anecdotal evidence in relation to improvements in relation to student learning in the flipped context, and recommend future work to study the effects more objectively and by using experimental or quasiexperimental designs.
In more recent studies, Kerr [9] conducted a short survey of the research and found an increase in studies in this topic in engineering education in 2015. The studies show high student satisfaction and increased performance using the flipped classroom methodology in engineering education and the research methods included discourse analysis, quasi–experimental designs, and mixed methods. However, she concludes that a lot of studies do not include statistical analysis of the data nor do they have enough details about the context of the instruction.
In a very recent review (published in 2018), KarabulutIlgu et al. [10] analyzed papers up until May 2015 and conclude that research on the flipped classroom in engineering education focuses more on documenting the design, but only have preliminary findings. The authors call for more studies with sound theoretical frameworks and evaluation methods, which are still needed in order to establish the methodology in engineering education.
In the software engineering context, there has also been an increase in studies in the last couple of years. Paez et al. [8] obtained positive results from flipping a software engineering course, but had a small sample and no control group. Lin [11] conducted an experimental study published in 2019 on using the flipped classroom approach to software engineering students and concludes that there was an increase in a diversity of aspects, such as an improvement in the students’ learning achievement, learning motivation, learning attitude, and problem solving ability. However, this study was conducted in one course offering as quasiexperiment. Another recent study was by Erdogmus et al. [12] who shared their experiences of flipping a software engineering course (also a single course offering in 2014). They summarize some of their challenges and note that they underestimated the teaching assistants needed for a flipped approach. They also, along the course, offered students more possibilities to share their learning with peers. In this present study, we recruited three pedagogical experts of the flipped classroom, hired four students assistant, and monitored the student outcome across four years.
Allinall, the previous work motivates a longitudinal study of the effects of flipping the classroom with rigorous statistical methods for software engineering students by both measuring grades and course evaluation changes.
Iii Course Implementation
The effect of flipping the Empirical Software Engineering (ESE) course (compulsory in the first year of the Software Engineering Master’s Program) was evaluated by comparing four fall semester offerings, of which the first one (given in 2014) was a traditional lecturing course and the following three were flipped courses (2015, 2016, and 2017). The teacher team was constant over the first three years except for teaching assistants correcting assignments and participating in the inclass activities when the course was flipped. The final year (2017) the teachers of the flipped course were two new staff (Ph.D. students) and the previous teacher of the flipped course was, for the most part, only participating in a few classes and was more of a support function to the new teachers. The reason for changing teachers the last year (2017) was because we wanted to test if the effect on grades was kept even if the teachers changed.
In addition, parts of the course were run at the University of Zambia in order to evaluate how teaching a flipped version of empirical software engineering differs in a different culture. However, grades were not collected from that short pilot course, only feedback in form of a course evaluation questionnaire. For the course given in Sweden, both grades and course evaluation questionnaires were collected for all the four years the course was given.
The course studied in this paper comprises 7.5 credits, which is equivalent to 20 hours expected work per week for the students, and the course was given in November and December each year. The course aims at teaching the basics of empirical software engineering, with a focus on applied statistics for the commonly used methods in the software engineering research field. The specific areas taught in the course, and the student learning objective can be found on the Chalmers University course web page^{2}^{2}2https://student.portal.chalmers.se/en/chalmersstudies/courseinformation/Pages/SearchCourse.aspx?course_id=28866&parsergrp=3.
The course was organized around 14 lectures consisting of 2x45 minutes each with a 15minute break in between, and three laboratory assignments of the same length where the student used statistical software to solve an assignment in groups of around 4 students. The first lab comprised using statistical software on real software engineering data (or data taken from the course book on experimentation in software engineering) to output different types of descriptive statistics and what they mean in relation to the data collected. In the second laboratory assignment, the students were also given data sets from real examples, but were instead instructed to use inferential statistics and interpret the results. The third laboratory assignment was disconnected from software engineering but comprised of the Paper Helicopter Experiment
^{3}^{3}3http://www.paperhelicopterexperiment.com/ in order to get a more handson experience on factorial experiments through more active learning. The groups handed in compulsory lab reports for all the three labs, which were graded Pass or Fail. Since an overwhelming majority of the student groups pass the laboratory assignments after a couple of iterations with the teachers, they were opted out when assessing the effects of the flipped classroom approach.Iiia Traditional Course
The first year of this study (2014) the entire course was given using classical lecturing for 50 students, that served as a control group. Every lecture was taught using PowerPoint slides while the students took notes. Occasionally, the teacher drew on the blackboards in order to further explain if questions were raised by the students. The lectures were based on two textbooks and a schedule of lectures and their connections to chapters of the books were given to the students. For some lectures, a set of research articles was also included and made available on the course web page. This page only included the syllabus, schedule, and reading material. The three labs were the same across the four years and began with a brief introduction by the teacher that the lab instructions were online. After this short introduction the student groups worked independently and the teacher and the teaching assistants (former Master’s students who had taken the course the previous year) walked around in the two available classrooms offering to answer questions and help out.
IiiB Flipped Course
The second, third, and fourth years served as the experimental group with some variation in the forth year. The flipped version of the course was organized around 11 active learning classes consisting of 2x45 minutes each with a 15minute break in between. Five of these were completely flipped, meaning that the entire 90 minutes were devoted to material that the students were supposed to have gone through beforehand.
IiiB1 PreClass Activities
The material was on an online platform and consisted of video lectures of around 10 to 20 minutes each, packaged in the form of slides with text wrapped around the embedded videos. Only two video lectures were recorded by the teacher and the other ones were taken from YouTube, which was due to the fact that good quality videos on the topic were possible to find. The reason for this was that introducing basic applied statistics (like in the first class) is a general subject that exists in most disciplines. However, others have reported that good quality videos are hard to find in some disciplines [13]. The online material also had 4 larger quizzes, one after each online component connected to an active lecture. A quiz contained around 30 question and were in the form of multiple choice questions with four alternatives. Two examples of quiz questions could be: “Helena is a software engineer at a car manufacturer. She is developing a car software component for a selfdriving car that predicts sensor failure. She think that she needs to take Latency in data transfer, Signal strength, and Noise (error in data) into account in her model. She plans to test the car and collect failure data in number of failures and the other data on two levels: Latency (20 and 40), Signal strength (10 and 80), Noise (3 and 5). How many factors does Helena have in her experiment? a) 8, b) 3, c) 4, or d) 2.” and “If Helena wanted to reduce on the number of unitary experiments she has to conduct because she wants to save time and money, which of the following approaches should she use? a) TwoWay ANOVA, b) OneWay ANOVA, c) Fractional Factorial Design, or d) Full Factorial Design.”
After each video the students were also asked to solve multiple choice questions in relation to the videos, and given the opportunity to provide open feedback about what was difficult and what they wanted the inclass discussions to focus on. Three of the active learning classes were labs and were the same as before the course was flipped, and three of the active learning classes were repetition lectures meaning that students were given the opportunity to decide what should be discussed and further explained in class.
IiiB2 InClass Activities
After having put all the lecturing material online, teachers need to fill all the lectures that are now empty. It is also challenging to plan the active lectures since they need to be planned the night before or at the same day of the lecture. Students tend to postpone looking at videos and providing feedback, and it takes a lot of resources from the teacher team to plan well often only hours before the actual lecture. However, the uptodate student feedback is essential and the beginning of each lecture typically comprised of teachers’ comments and explanations of reported difficulty from when students watched the videos. In the first two years of flipping the classroom pedagogical expert on the subject were hired who helped out with creating the online components, but also with planning inclass activities as well as actually participating during class and providing feedback.
All the active lectures were a mixture of the following components: (1) Introduction with 5–10 min discussion in pairs on the corresponding online component, i.e. what was it about? what was difficult? etc. (2) The online components often include an open question where they are asked to write about what they just saw. Therefore, discussions were held with the whole class, in groups, or in pairs, which descriptions that were troublesome and how the description could be improved and more accurate. (3) Around 5 min administrative information regarding labs, and lab reports. (4) Discussion in groups about an online video, but the students were asked to reflect on it from new perspectives introduced by the teachers. (5) The teachers showed a provocative but accurate statement about an aspect of the subject and the students did thinkpairshare [14]. (6) An example calculated by hand on the blackboards, but before each new formula needed the teachers wrote three similar but different formulas and the students used clickers on their electronic devices to vote for an option. The distribution of the answers where used for pairing up with someone of a different view, if possible, and then discuss why they voted differently. The whole example calculation was divided into such parts. (7) An example of a result was shown on a slide and the students did thinkpairshare in relation to possible explanations to the result. (8) In relation to statements about how to investigate some phenomenon, the students were asked to design such a study in relation to the context they are in. One example was that students had been introduced with the fact that software engineering research should include more experiments. The teacher then asked students to design an experiment to measure “goodness of lecture” because the teacher and the university wanted to know. The students worked in groups in relation to the experimentation planning protocol obtained from the course book and immediately started to list issues regarding definitions of concepts, obtaining accurate measurements, confounding factors, replication, and so on and so forth. The students then discuss how they should be careful, and skeptical, in relation to experimentation in complex adaptive systems (i.e. software development organizations). (9) Sometimes the examples where more technical and the student were asked to provide bits and pieces of an experiment in class. One example being an experiment investigating latency in relation to different server programming languages.
IiiB3 Regular Lectures in Parallel
In parallel, 6 regular lectures disconnected to the other classes were held both in relation to content and who taught them. Students are known to differ between years in both motivation and prior knowledge of the topic, and in order to compare within the same year some topics were kept in the classical lecturing format. These lectures were given by a separate teacher who was not involved in the flipped classroom project. On average, two active lectures and one traditional lecture were given each week. The forth year (2017) the online platform and the main teaching team were changed to see if the improvement remained.
The miniversion of the course given in Zambia only comprised of 3 flipped classes and 2 labs due to the fact that the course was only given during a 2week period. However, the rest of the course during the semester was given as regular lectures so a comparison between a flipped and a regular part of the course was still possible.
Iv Evaluation
The evaluation of this pedagogical experiment was twofold: (1) the differences in student grades across the 4 years were statistically tested, and (2) extra questions were added and followed up on the course evaluation questionnaires that the students fill in after every finished course. Questions of comparisons within the years between the flipped part of the course and the regular lectures were also included.
Iva Comparison of the Exam Grades
The course is given at two universities in Gothenburg simultaneously and at Chalmers Unversity of Technology the grades Fail, 3, 4, and 5 are given (5 is the highest grade possible) and at the University of Gothenburg (GU) the grades Fail, Pass, and Pass with Distinction are given. In order to compare exam results between years, we used the corresponding “Chalmers grade” for all students. We conducted a statistical test (the KruskalWallis test) that is based on ranks since we also wanted to include the Fail grades that incorporate all the exam results below 3. The grade given to students were based on the following intervals: Maximum points: 35, Pass:
, Grade 3: 17–24, Grade 4: 25–30, Grade 5: 31–35. Furthermore the students’ group lab reports had to be of grade Pass for a student to pass the entire course (the labs were only graded Pass or Fail).The typical exam included: (1) A multiple choice question about, e.g. the parametric assumption of data or heteroscedasticity. (2) At least one open question about an important statistical concept, e.g. the relationship between types of statistical error in hypothesis testing or distinguishing between a sample and a population. (3) At least two questions on other types of research aspects like sampling, supervised on unsupervised survey research, ethics, or the like. (4) At least one question about assumptions for different statistical tests (like the
test or linear regression). (5) A question about how to design experiments in the software engineering context. (6) A question in relation to interpreting, or setting up, hypotheses or what conclusions that can be drawn from statistical software output. (7) One larger calculation of a research question that can be solved through the use on analysis of variance (ANOVA). An example of such an exam question is:
“A software development company wants to test 3 different software testing techniques (exploratory testing, unit testing, and integration testing) to see if it will affect amounts of postrelease defects. The company has 10 software testers and they want you (the experimenter) to block the effect of the different levels of experience among these testers. In other words, they are only interested in differences between the testing techniques. In your design, each tester will test the same part of the system by applying only one of the testing techniques at the time, however, all testers will apply all techniques. The following sum of squares were obtained: the block (testers) = 280, the treatment (testing technique) = 90. The total sum of squares was 500. a) State the null and alternative hypotheses. (1p), b) Set up an ANOVA table and analyze the effect of the treatment (alpha = 0.05). (2p), c) Calculate the effect size where relevant. What is an effect size? (1p), d) Interpret the results in words. (2p), e) What did they gain by using a block design? Based on the results, do you think they did well in their decision to use a block design? (1p), f) What would the results have been if you would not have used blocking? Set up a new ANOVA table and interpret the results. (2p), g) Could you have carried out multiple ttests instead of the ANOVA? (1p)
.As an example, the final exam question was corrected using the following protocol: a) 0.5p for the null hypothesis and 0.5p for the alternative one. b) 1p for
and and 1p for values. c) 0.5p for calculating the effect size. and 0.5p for the definition. d) 2p for the correct interpretation. e) 0.5p for each of the two questions. f) 1p for ANOVA and 1p for correct interpretation. g) 1p for stating alpha inflation.The distribution of grades from each year is shown in Table I. The author created the first exam in 2014, but the following years the author instructed the TAs or the new teachers to create the exam questions at the same level of difficulty in order to remove bias in both directions, i.e. not make the exam easier nor harder than in the first year.
Grade  2014 (nonflipped) # of students (percentage)  2015 # of students (percentage)  2016 # of students (percentage)  2017 # of students (percentage)  Grand total 

F  10 (20%)  9 (17%)  7 (12%)  14 (25%)  40 
3  28 (56%)  31 (60%)  14 (23%)  8 (14%)  81 
4  10 (20%)  11 (21%)  22 (37%)  30 (53%)  73 
5  2 (4%)  1 (2%)  17 (28%)  5 (9%)  25 
Year Total  50 (100%)  52 (100%)  60 (100%)  57 (100%)  219 
As mentioned, the nonparametric IndependentSamples KruskalWallis Test was used, which is based on ranks. This test is an overall test of any differences across all years and the results were: Test Statistic
, (2sided) , , meaning that there are differences between years, but in order to know between which years a run posthoc pairwise comparison tests are needed. The mean rank for each year was 89.34 (2014), 89.24 (2015), 137.88 (2016), and 117.72 (2017) respectively. There were significant pairwise comparisons () for two comparisons, namely between 2015 and 2016, and 2014 and 2016. The nonsignificant comparisons were between 2014 and 2015 (), 2015 and 2017 (), 2014 and 2017 (), and 2016 and 2017 ().The result means that there was no significant improvement in grades the first year of the flipped approach, i.e. there was no significant change between 2014 and 2015. However, in the third year (2016), the grades were significantly different from the first (2014) and second year (2015). The third year of the flipped approach (2017) was not significantly different from any other year, i.e. the positive change from the second year (2016) of the flipped approach did not remain in 2017 when the teaching team changed. This change cannot be explained by the number of dropout since out of 68 students registered on the course in 2016, only 8 dropped out (12%), while in 2017 out of 94 registered students, 37 dropped out (39%), i.e. there was not an increase in students following the entire course in 2017 resulting in more Fail grades, but possibly the opposite.
IvB Comparison of the Student Course Evaluation Questionnaires
The second part of the evaluation was the student course evaluation questionnaire filled out by the students after each course was given. The number of students answering each survey, and their corresponding response rates were in 2014 (, response rate 50%), 2015 (, response rate 45%), 2016 (, response rate 26%), 2017 (, response rate 36%), Zambia 2017 (, response rate 23%). The lower response rates are due to the fact that the university administration distributes the course evaluation questionnaires via email to all students when the course has finished. The first part of the questionnaire consists of questions that were included in all the years of the study and therefore makes sense to statistically test over time in order to further investigate the effects of flipping the classroom.
The following questions were the ones relevant for studying the effects of flipping the classroom both in relation to the nonflipped version of the course (2014) and investigating the temporal perspective of the effects:

What is your overall impression of the course? Rated from 1 (very poor) to 5 (excellent).

The teaching worked well. Rated from 1 (disagree completely) to 5 (agree completely).

The course literature (including other course material) supported the learning well. Rated from 1 (disagree completely) to 5 (agree completely).

The course workload as related to the number of credits was… Rated from 1 (too low) to 5 (too high).
The same statistical test as for grades above was used and the list below shows which ones of the IndependentSamples KruskalWallis Tests that were significant (<0.05). Which pairwise comparisons that were significant between the years and their corresponding mean rank in parentheses are also shown, if applicable (all pairwise comparison are adjusted for multiple tests by using the Bonferroni correction).

The null hypothesis was rejected (Test Statistic = 18.797, , . Sig. difference between 2017 (48.75) and 2015 (72.15).

The null hypothesis was rejected (Test Statistic = 15.288, , . Sig. difference between 2014 (50.33) and 2016 (81.00), and between 2017 (51.21) and 2016 (81.00).

The null hypothesis was not rejected (Test Statistic = 5.909, , .

The null hypothesis was not rejected (Test Statistic = 3.040, , .
In summary, the only significant difference of the overall impression of the course was between the first year of flipped (2015) and the third year of flipped (2017), meaning that the students assessed their overall impression higher in 2015 than in 2017. The only parts that were changed were the main teachers and a new platform for the online components. Perhaps the answer lies in the implementation and motivation of the flipped approach. The same introduction was given in both versions of the course on why the classroom was flipped, but there were more comments about students not seeing the point of the flipped classroom 2017 when compared to 2015 in the open comments section of the survey. Maybe the students were more tolerant since, in 2015, it was clearly described that this was the first time parts of the course would be flipped. However, during both years, some students also commented that they really like the flipped approach. Another explanation could of course be that the new teachers in 2017 did a worse job being liked by the students, possibly because the new teaching team was new to the flipped approach. One should also note that this difference did not hold for a comparison between 2016 (when the grade were significantly higher) and 2017. There were no changes in the planned activities in the classroom, however, the teacher during 2015 and 2016 experienced an increase in facilitating the students’ discussions in class. From only asking questions to the whole class to psychically walking around and listening in to the discussions held in small peergroups and asking students groups directly about sharing aspects that the teacher had heard. One example would be: “I heard you [pointing at a specific group] lifted the very interesting aspect [X], would you mind sharing that aspect with the whole class?” Such encouragements increased the student participation in class to a large extent, which was most likely done to a lesser extent in 2017. As a side note, an auditorium is not an optimal lecture hall to use for the flipped classroom because student sit in rows and cannot move around easily. The possibility of fitting all students into a better designed lecture hall was impossible since none existed at the university campus. Across the years, some students also stated that it would have been easier to have all courses flipped, since they need to act very differently depending on the teaching methodology used.
The student thought that the teaching worked better in 2016 (the second version of the flipped course) than the nonflipped version in 2014 and the third year of flipped in 2017. In 2017 and 2015 the students, though, both praised and disliked the flipped approach about equally much. In 2014 the students complained a lot about how inappropriate it was to use lecturing with slides for teaching the subject. It seems to have happened something essential in 2016 since the grades changed significantly from 2015. The explanation for this change is that it was the second year of teaching the flipped course for the same teacher in 2016, which apparently meant that students learn better using the flipped approach and also earned higher grades. However, this increase in grades did not remain in 2017 when the teaching was graded lower along with the exam results, which might have been due to the new teachers. This highlights the importance of extensively coaching new teachers in the flipped approach since it is much more dependent on teacher facilitation in class and shortterm planning before each active lecture. Below are some descriptive statistics on specific questions in relation to a comparison between the flipped and the traditional approach to teaching.
In order to assess how well the flipped classroom approach was implemented across years, the following two questions were included in the questionnaire:

I was engaged when participating in classroom activities. Rated from 1 (never) to 5 (always).

The instructor made meaningful connections between the topics in the prerecorded lecture and the class activity. Rated from 1 (disagree completely) to 5 (agree completely).
The descriptive statistics for the first question is shown in Figure 2 and shows that students considered themselves engaged overall. There was no statistically significant difference between the years (i.e. we fail to reject the null hypothesis, Test Statistic = 2.791, , ), which means that it is not possible to explain differences in opinions or grades by the level of engagement by students. For the second question, the null hypothesis was rejected (Test Statistic = 23.053, ,
, and significant differences between 2017 (33.53) and 2016 (62.61), and between 2017 (33.53) and Zambia 2017 (72.93) were found. These numbers match the student grade differences but not the overall impression of the course by students. This aspect, then, could explain why 2016 was a successful year and making meaningful connections between online material and active lectures could be an important skill for teachers to train in this context in order to increase the student’s academic success. In doing so, the course content will seem stringent and explicitly connecting different aspects that are introduced could increase the number of “Aha!” moments from students. The course given in 2016 and the Zambian pilot course were both given by the same teacher. In 2016, the teaching was also rated as working very well (see above). The 2017 version of the course was given by two teachers without any previous knowledge of flipping a classroom.
Finally, the students were also asked to compare the two approaches each year when a part of the course was flipped. They were asked the following questions:

Please compare the two approaches in the course (the flipped and the regular lectures). Rated from 1 (regular is much better) to 5 (flipped is much better).

Please compare the flipped approach to having the same material as regular lectures (i.e. the statistics). Rated from 1 (regular is much better) to 5 (flipped is much better).
The descriptive statistics for all years are shown in Figure 3 and Figure 4. The figure essentially shows a twopeaked distribution meaning that the students are split between the ones who think that the flipped approach is better and the ones who think that a traditional lecturing approach would be better, and few students saw them as equal. This means that the students either dislike the flipped approach or like it (actually a majority tends to like it rather than dislike it, as can be seen in Figure 3), but the classes are split. In the small sample from Zambia 2017, the students all preferred the flipped approach to teaching. When looking at the grades, the changes made towards more active learning through flipping the classroom did result in higher grades and the student preference of pedagogical method does not seem to entirely overlap with actual learning outcomes, and previous studies have shown that student course evaluations are prone to many biases [15], many of which are discussed next.
V Discussion
Overall, the results from flipping the classroom for an empirical software engineering course were promising. The results from this study are important since few studies have investigated the flipped approach to software engineering education, and also due to the fact that there is a lack of larger longitudinal studies on flipping the classroom in higher education in general [6]. This implies that the use of the flipped classroom seem promising for teaching more software engineering topics, but only with extensive pedagogical training for teachers. Perhaps the software engineering students are extra susceptible to this type of teaching due to their deep knowledge of IT systems, but this remains to be further investigated.
The diversity of active components inclass and the student preparation before class through watching videos online, being asked to reflect often while watching videos, and doing larger quizzes on the material before class, showed an increase in grades when the teacher had pedagogical training and experience from flipping a classroom. However, the effect on grades only surfaced the second year of the flipped approach, which means that teachers implementing a flipped approach should not despair if the effects are not shown immediately. In fact, changing the pedagogical approach in not just a large change for the students but also for the teachers, and mastering active learning classes takes time. The evidence on the effects of introducing active learning components, however, is clear [1], and flipping the classroom is a way of buying more time for active learning classes without adding more hours to the course schedule when students and teachers need to colocate. This study shows that the initial effort of putting good material online and creating good discussion topics in class are not enough. The real challenge starts when teachers need to facilitate active lectures in a way they do not usually do.
In this study, the improved effectiveness of the flipped course was confirmed through the significant increase in exam grades, but this effect did not remain when the main teachers of the course were changed together with the online platform used. This could be interpreted as that the choice of online learning platform could affect student learning, but for software engineering students, this is assessed as somewhat unlikely. Building new teaching skills for this new pedagogical approach, however, and preparing students for the pedagogical change were shown to be important in this current study since the grades were lower after changing the teaching team. The fact that pedagogical expertise were hired the first two years was definitely a key to success in creating both the preclass and the inclass activities, and onboarding new teachers without external expertise was shown to be difficult. This study also shows that the exam result does not fully correspond to the students’ perceptions of their learning experience, however, the majority of the students liked the flipped approach more overall when asked to compare the two instead of only rating their overall impression of the course each year. The results suggest that students’ academic success, but maybe not their subjective overall liking of a course, can be enhanced by introducing a flipped classroom and therethrough more active learning in class.
It is also worth discussing the fact that out of 6090 enrolled students often only around 50 finished the course every year. By flipping the classroom, an expected effect could be to get fewer students who drop out. This was not achieved in flipping the course, but there was an observed increase in how active the student who finished the course were. From previously having around five active students, the estimate for the flipped approach was around 1520. The high numbers of dropouts were explain by students as an effect of having two courses with high workload during the same study period. When asked, students explained that they chose to focus on the other course to a large extent and were planning on passing the current course during reexamination periods.
It is important to discuss validity threats to educational research in relation to quasiexperimentation. This study only looked at changes in grades over a fouryear period of which the first year comprised of a traditional approach to teaching and the following three years were partially flipped. Even if the exam was not created, nor corrected, by the teacher team when the experiment started, it is very difficult to control aspects like how teachers grade and the level of exam question difficulty. However, the student exam grades are the best available option to investigate effects across many years, but that only holds given the assumption that a written exam is a good measurement of student learning. The exam grades, at least, capture some aspects of student learning even if there are additional aspects to learning not measured during a written exam.
In student course evaluations, the relation to student learning is even more complex. In a metastudy by Spooren et al. [15] they conclude that research on the topic is far from having provided clear answer to critical questions and present studies that show that the course evaluations are effected by many different aspects. In relation to the present study the most eminent confounding factors that cause a negative effect in relation to the student course evaluations are:

Class attendance [16] — there was always a drop in attendance of classes along the course (both flipped and nonflipped) and, as an example, out of 90 registered only around 50 attended the classes.

Precourse interest [17] — leaning about research methods and applied statistics have been described as irrelevant for many students’ careers as software engineers in the comments section of the course evaluations.

Interest change during the course [17] — the dropout could be explained by the fact that the course is given in parallel to another course which is also perceived as timeconsuming.

Instructor’s tenure [18] — the classic lecturing given alongside the flipped course was given by a professor and not a Ph.D. student, which was the case of the flipped part of the course.

Class size [19] — the classes have mostly been very large (from 60 to 90 registered students each year).

Course difficulty [20] — many students do not have the background in basic statistics that are needed for the course.

Course discipline [21] — the course is a natural science course.

Elective vs. required courses [22] — the course is compulsory for the Master’s program the vast majority of the students are enrolled in.

General education vs. specific education [22] — the course is broad and comprises many different aspects of empiricism in research.
Therefore, the results of this study should be seen as an indication of a trend and there are, apparently, too many confounding factors to draw wide conclusions, especially from the student course evaluations. The overall trend across the four years of this study is, though, that software engineering students do learn this difficult subject better when more active learning is introduced through flipping the classroom.
Vi Conclusion
This paper set out to investigate the effects of flipping the classroom on exam grades and student course evaluations across four years. Through a statistical analysis of data collected between 2014–2017, flipping the classroom was found to increase the students’ exam grades, but a clear effect on students’ perception of the course was not found. Furthermore, making relevant connections between online material and inclass discussions were found as a key to student learning, but require extensive training and a new skill set from teachers. Students did rate flipping the classroom as better overall when asked to compare the two pedagogical approaches, but their overall impression of the course each year gave less clear results in connection to exam grades. Overall, we conclude that flipping the classroom increased the student learning and recommend the approach to be tested in teaching more software engineering topics. These findings are important contributions to software engineering education, but also to educational research in general, since few studies contain such extensive data over more than two years.
In terms of future research, more studies using exam grades corrected by other teachers than the ones involved in the course in order to control for exam variations are recommended. The students’ course evaluations should also be adjusted for bias before they are used as any valuable source for teacher evaluation or in research [18].
Acknowledgment
The course development was funded by Chalmers University of Technology in form of Quality Funding (Dnr C 20141712) given by the Swedish Higher Education Authority in 2015.
Teaching parts of the course at the University of Zambia was funded by International Staff Mobility (Dnr E2016/598) in 2017 by the International Centre at the University of Gothenburg.
The author would like to thank all the people that have been involved in the course during the years of this study: Richard Torkar, Ivar Thorvaldsson, Henrik Marklund, CarlAdam Hellqvist, Johan Svensson, Mukelabai Mukelabai, Francisco Gomes de Oliveira Neto, Jackson Phiri, David Issa Mattos, Katja Tuma, Mohammad Haghshenas, and Christos Charalampous.
References
 [1] S. Freeman, S. L. Eddy, M. McDonough, M. K. Smith, N. Okoroafor, H. Jordt, and M. P. Wenderoth, “Active learning increases student performance in science, engineering, and mathematics,” Proceedings of the National Academy of Sciences, vol. 111, no. 23, pp. 8410–8415, 2014.
 [2] M. Prince, “Does active learning work? a review of the research,” Journal of engineering education, vol. 93, no. 3, pp. 223–231, 2004.
 [3] R. Adams, D. Evangelou, L. English, A. D. Figueiredo, N. Mousoulides, A. L. Pawley, C. Schiefellite, R. Stevens, M. Svinicki, J. M. Trenor et al., “Multiple perspectives on engaging future engineers,” Journal of Engineering Education, vol. 100, no. 1, pp. 48–88, 2011.
 [4] M. J. Lage, G. J. Platt, and M. Treglia, “Inverting the classroom: A gateway to creating an inclusive learning environment,” The Journal of Economic Education, vol. 31, no. 1, pp. 30–43, 2000.
 [5] D. R. Garrison and H. Kanuka, “Blended learning: Uncovering its transformative potential in higher education,” The internet and higher education, vol. 7, no. 2, pp. 95–105, 2004.
 [6] J. L. Bishop and M. A. Verleger, “The flipped classroom: A survey of the research,” in ASEE National Conference Proceedings, Atlanta, GA, vol. 30, no. 9, 2013, pp. 1–18.
 [7] P. N. Kiat and Y. T. Kwong, “The flipped classroom experience,” in IEEE 27th Conference on Software Engineering Education and Training (CSEE&T). IEEE, 2014, pp. 39–43.
 [8] N. M. Paez, “A flipped classroom experience teaching software engineering,” in Proceedings of the 1st International Workshop on Software Engineering Curricula for Millennials. IEEE Press, 2017, pp. 16–20.
 [9] B. Kerr, “The flipped classroom in engineering education: A survey of the research,” in 2015 International Conference on Interactive Collaborative Learning (ICL). IEEE, 2015, pp. 815–818.
 [10] A. KarabulutIlgu, N. Jaramillo Cherrez, and C. T. Jahren, “A systematic review of research on the flipped learning method in engineering education,” British Journal of Educational Technology, vol. 49, no. 3, pp. 398–411, 2018.
 [11] Y.T. Lin, “Impacts of a flipped classroom with a smart learning diagnosis system on students’ learning performance, perception, and problem solving ability in a software engineering course,” Computers in Human Behavior, vol. 95, pp. 187–196, 2019.
 [12] H. Erdogmus and C. Péraire, “Flipping a graduatelevel software engineering foundations course,” in Proceedings of the 39th International Conference on Software Engineering: Software Engineering and Education Track. IEEE Press, 2017, pp. 23–32.
 [13] C. F. Herreid and N. A. Schiller, “Case studies and the flipped classroom,” Journal of College Science Teaching, vol. 42, no. 5, pp. 62–66, 2013.
 [14] A. Kothiyal, R. Majumdar, S. Murthy, and S. Iyer, “Effect of thinkpairshare in a large cs1 class: 83% sustained engagement,” in Proceedings of the ninth annual international ACM conference on International computing education research. ACM, 2013, pp. 137–144.
 [15] P. Spooren, B. Brockx, and D. Mortelmans, “On the validity of student evaluation of teaching: The state of the art,” Review of Educational Research, vol. 83, no. 4, pp. 598–642, 2013.

[16]
P. Spooren, “On the credibility of the judge: A crossclassified multilevel analysis on students’ evaluation of teaching,”
Studies in educational evaluation, vol. 36, no. 4, pp. 121–131, 2010.  [17] O. J. Olivares, “Student interest, grading leniency, and teacher ratings: A conceptual analysis,” Contemporary Educational Psychology, vol. 26, no. 3, pp. 382–399, 2001.
 [18] M. A. McPherson and R. T. Jewell, “Leveling the playing field: Should student evaluation scores be adjusted?” Social Science Quarterly, vol. 88, no. 3, pp. 868–881, 2007.
 [19] K. Bedard and P. Kuhn, “Where class size really matters: Class size and student ratings of instructor effectiveness,” Economics of Education Review, vol. 27, no. 3, pp. 253–265, 2008.
 [20] R. Remedios and D. A. Lieberman, “I liked your course because you taught me well: The influence of grades, workload, expectations and goals on students’ evaluations of teaching,” British Educational Research Journal, vol. 34, no. 1, pp. 91–115, 2008.
 [21] S. A. Basow and S. Montgomery, “Student ratings and professor selfratings of college teaching: Effects of gender and divisional affiliation,” Journal of Personnel Evaluation in Education, vol. 18, no. 2, pp. 91–106, 2005.
 [22] K.f. Ting, “A multilevel perspective on student ratings of instruction: Lessons from the chinese experience,” Research in Higher Education, vol. 41, no. 5, pp. 637–661, 2000.
Comments
There are no comments yet.