Supervised machine learning (SML), with its capabilities to support—or even replace—human workers in their daily tasks, is omnipresent in current discussions. While research is investigating the capabilities of SML in a broad range of areas, for example image classification (He et al., 2016) or speech recognition (Hinton et al., 2012), tasks where machine learning models outperform humans are increasing (Grace et al., 2018). For instance, in the field of autonomous driving, replacing humans as drivers is supposed to happen sooner or later, although autonomous vehicles are not (yet) able to completely substitute humans in this task (Casner et al., 2016). In the long run, several studies foresee humans being completely substituted by machines in many tasks, including their work processes (Müller and Bostrom, 2016; Makridakis, 2017). But can this development be observed across all tasks? Current examples of (supervised) machine learning models outperforming humans are mainly present in areas where a high amount of training data is available, for example billions of played “Go” games (Chang et al., 2016) or millions of labeled images (Russakovsky et al., 2015).
In real life, however, often only limited “training” data is available (Baier et al., 2019)—sometimes just a single instance (Lee et al., 2015). In this article, we are especially interested in learning patterns by humans and machines in data with only a few instances on a given task. While there is theoretical work in the field of inductive programming (Muggleton, 1991; Olsson, 1995), which aims to design techniques capable of capturing patterns with few examples, empirical work in the comparison to human learning is still rare. The research stream of Cognitive Sciences has been investigating learning processes of humans (and recently also machines) (Favela and Martin, 2017), especially with a focus on understanding and mimicking the human brain (Thagard, 2005; Dupoux, 2018). As a subfield, computational cognitive science has been studying the similarities and differences between human and machine learning in the past two decades (Kogler and Pessoa, 2017; Tenenbaum et al., 2011; Lake et al., 2015; Griffiths et al., 2007, 2010; Lake et al., 2017; Lucas et al., 2015).
However, the investigation of the learning curves, meaning the relation of required training samples and the resulting performance (Perlich et al., 2003) of humans in comparison to SML models, is a topic that has not yet been investigated. This investigation is of major importance, as it needs to be considered whether humans or machines will be performing a task, especially in future, SML-based applications. It gives more insights to the question about which entity learns more efficiently (Hernández-Orallo, 2017b). For instance, in the field of healthcare, data labeling by physicians is extremely costly. From an economic perspective, it might be questionable to use supervised machine learning models in healthcare because the labeling cost can exceed the machine’s saving potential (Raghupathi and Raghupathi, 2014). To give first insights into comparing the learning curves of humans and machines with limited training data, we phrase our general research question (GRQ) as follows:
GRQ: How does the learning performance of humans and supervised machine learning models differ with limited training data?
In academia, direct comparisons between humans and supervised machine learning models performing the same task are still rare (Hernández-Orallo, 2017a). Besides the aspect of limited training data, it must be of fundamental interest for researchers to gain a better understanding of which tasks can possibly be undertaken by a supervised machine learning model, as well as the precise conditions that apply (Witten et al., 1994; Adler and Schuckers, 2007; Marcus, 2018). As this work reinforces, there are infinite possibilities of tasks and task characteristics. To provide a starting point for research endeavors in the supervised field, we explore one special scenario. The chosen task for humans as well as machines is identifying patterns with limited training samples (5 to 50 instances). To measure the human performance on this task, we conduct a lab experiment with 44 participants where four different patterns need to be identified. We then apply different machine learning algorithms on the same patterns and subsequently evaluate and compare the results.
The remainder of this work is structured as follows: In the next Section, we present the necessary fundamentals and related work in the fields of human and machine learning (Section 2). Next, we define the overall task characteristics (Section 3.1) and elaborate on our methodological focus for the experiment design (Section 3.2). In Section 4, we report the isolated results (humans and machines respectively) of the task performance and subsequently introduce the comparison. Finally, we discuss implications (Section 5) and conclude the study (Section 6).
2 Fundamentals and related work
This article assesses and analyzes learning performances of humans and machines. To sketch the foundation for this endeavor, we first give an overview of current research in learning, segmented into learning of humans (Section 2.1), machines (Section 2.2), and research on their comparison (Section 2.3).
2.1 Human learning
Scientific research on human learning started in the second half of the 19 century. In one of the first books about learning, Ebbinghaus (1885) postulated the concept of a learning curve, as the subject group’s learning progress flattened over time. Kotovsky and Simon (1973) started analyzing humans learning patterns on a large scale. Human learning is currently divided into three main learning theories: Cognitive psychology, social cognitive theory, and sociocultural theory (Ormrod and Davis, 2004).
For this article’s research topic—analyzing human learning with small sample sizes and comparing it to SML—the most relevant research field is cognitive psychology. We leverage phenomena from this area to provide possible explanations for human learning patterns. Cognitive psychology is “the study of how people perceive, learn, remember, and think about information” (Sternberg and Sternberg, 2016, p. 3). The research on cognitive psychology includes studying mental phenomena, such as visual perception, object recognition, attention, memorization, knowledge, speech perception, judgment, and reasoning. To explain such phenomena, cognitive psychology has recourse to neuroscience and its knowledge of brain functioning (Eysenck and Keane, 2015).
In turn, social cognitive theory (Rosenthal and Zimmerman, 1978) includes many ideas from cognitive psychology, but focuses on how humans learn from other human beings through watching and imitating their behavior. The theory suggests that humans can control their own learning. This differs from behaviorism, a now dated theory which led to social cognitive theory and in which learning is solely the result of stimulus–response relationships (Ormrod and Davis, 2004). Learning from others also has the benefit of learning quicker by making fewer mistakes compared to learning from own experiences (Bandura, 1986).
Sociocultural theory stresses the importance of society and culture in learning. Learning a sociocultural tool like a language is not only useful for communication, but also supports humans in their thinking development (Vygotsky, 1964). In contrast with social cognitive theory, humans do not only learn from each other but also work together towards goals that cannot be achieved by individuals. The research focuses on the interaction of children and parents. Children’s individual development of capabilities are usually related to interactions with their parents. Additionally, caregivers like parents can broaden a child’s problem-solving abilities and stimulate cognitive growth by assisting them to solve more difficult tasks that they would otherwise not be able to accomplish (Vygotsky, 1980).
A few number of studies studied how humans learn from a few instances and have also built computational models to understand human few-shot learning (Vul et al., 2014; Lieder et al., 2012). However, their aim was to re-engineer the human learning process, while we aim to show empirically how the two entities perform in a direct comparison on limited training data.
2.2 Machine learning
The capabilities of machines have been discussed from various perspectives, including their abilities to capture knowledge (Lieto et al., 2018), think (Hoffmann, 2010), feel (Velik, 2010; O’Regan, 2012; Osuna et al., 2020), be creative (Veale et al., 2010) and making morally good decisions (Tavani, 2011; Yilmaz et al., 2017). The process of how machines obtain their knowledge in the first place is addressed in the area of machine learning. Machine learning describes a set of techniques commonly used to solve a variety of real-world tasks with the help of computer systems that can learn to solve a task instead of being explicitly programmed to do so (Koza et al., 1996). In general, we differentiate between unsupervised, reinforcement, and supervised machine learning (Jordan and Mitchell, 2015).
Unsupervised machine learning comprises methods and algorithms that reveal previously unknown data patterns. Consequently, unsupervised learning tasks do not necessarily have a “correct” solution, because there is no ground truth(Wang et al., 2009)
. In the area of reinforcement learning, rewards and punishments allow the model to learn continuously over time with many learning instances. The focus is on a trade-off between an uncharted environment’s exploration and the knowledge base’s exploitation(Kaelbling et al., 1996).
In this study, we mainly focus on supervised machine learning, because the most widely used methods are supervised (Jordan and Mitchell, 2015). It therefore seems to be a promising starting point. In respect of supervised machine learning, learning means that a series of examples (“past experience”) is used to build knowledge about a given task (Dietterich, 1996). Although statistical methods are used during the learning process, manual adjustment and rule or strategy programming to solve a task are not required. In more detail, (supervised) machine learning techniques aim to build a model by applying an algorithm to a set of known data points to gain insight into an unknown set of data (Hastie et al., 2017). Typically, supervised machine learning models rely on large amounts of data to work properly. First techniques, not all directly related to SML, aim to reduce the required amount with different techniques, namely inductive programming Olsson (1995); Schmid and Kitzelmann (2011)et al., 1998)2009)2010), combinations of both (Rhee et al., 2017), external memories Vinyals et al. (2016) or one-trial learning (Feng and Sun, 2019).
In terms of a supervised machine learning model’s “creation” procedures, the proposed processes vary slightly in their definition of the phases, but generally employ the three main phases: model initiation, performance estimation, and deployment(Hirt et al., 2017). During the model initiation phase, a task is defined, the data is prepared and processed, and a suitable machine learning algorithm is chosen. During performance estimation, various parameter permutations describing the algorithm are validated and a suitable configuration is selected based on its performance when solving a specific task. Lastly, the model is deployed and put into practice to solve a task related to previously unseen data.
2.3 Human vs. machine learning
When it comes to the comparison of human and SML, Hernández-Orallo (2017b)
motivates the comparison of natural and artificial intelligence in the first place. The field of Neuroscience(Florez, 2015; Rajalingham et al., 2018; Hutto and Kirchhoff, 2015) aims to understand the learning of humans and the facilitation with machines on a theoretical level. The precise capturing of the related learning curves have been analyzed theoretically and empirically for humans and machine learning techniques separately in different domains, e.g. creativity tests (Olteţeanu et al., 2016)2007), music prediction (Witten et al., 1994) or cognitive research (Marcus, 2018)
. In the field of computer vision, multiple comparisons of human and machines have been made(Elsayed et al., 2018; Zhou and Firestone, 2019; Eckstein et al., 2017; Peterson et al., 2018).
Apart from these specific domains and closer related to our study is the idea to build computer models capable of solving IQ tests (Hernández-Orallo et al., 2016). While not using supervised machine learning, Insa-Cabrera et al. (2011) aim to compare reinforcement and human learning, however, they only regard small sample size of observations.
Human learning can be compared to machine learning based on various aspects. Dubey et al. (2018) focus on human priors for playing video games. In their experiment, they use an unknown video game that a human solves quite easily by using its priors on semantics, gravity, and objects. By reversing semantics and masking affordances, the human performance decreases drastically. The machine performance, represented by reinforcement learning algorithms, performs significantly better under the same conditions. Humans’ prior knowledge is important when it comes to solving new problems quickly.
Kim et al. (2019) does research on psychophysical phenomena—which can be found in human learning—in trained machine learning models. Gestalt phenomena are a part of human visual perception in which humans realize that the whole differs from the sum of its parts (Köhler, 1967)
. They show that some neural networks are able to show one type of Gestalt phenomena under the proper circumstances.
In hybrid intelligence (Dellermann et al., 2019), humans’ complementary strengths, like flexibility and common sense, are combined with those of machines, for example consistency and speed. This sociotechnological ensemble can overcome the current limitations humans and machines have. Another way to combine human and machine abilities is to treat machines as teammates (Burr et al., 2018; Smart, 2018; Seeber et al., 2019). This could increase work speed and lead to better decision-making by detecting negative cognitive biases.
In conclusion, a direct comparison of human learning and supervised machine learning for the same task with limited training data availability still remains a research gap and is addressed in this work.
Before discussing the experiment’s design of comparing human and machine learning, we need to set the prerequisites for the task being solved in this experiment. As we require a controllable task with precise benchmarks for performance evaluation, a suitable candidate is supervised machine learning, which we utilize for this article. When it comes to choosing a meaningful task in the area of SML, there are many possible characteristics to describe it. To deduce the possibilities and reason our selection, we look at corresponding task characteristics (Section 3.1.1) and subsequently outline our implementation of the chosen task (Section 3.1.2).
3.1.1 Task characteristics
A learning curve depicts task performance based on experience. In our case, experience is measured by the amount of training data, more precisely by the number of training instances. Task performance is influenced by two main factors: the characteristics of the entity performing the task (humans or machines) and those of the task itself. Depicting a general learning curve for every type of task characteristic exceeds the scope of this article, and we have to limit the scope to an interesting selection of all possible tasks. For our supervised machine learning task, four task characteristics are of importance: input, output, instances, and features.
The input describes the data the task is based on. It can differ by data type (e.g., numeric or binary) and by data representation (e.g., table, picture, or audio).
A task also differs in the demanded output. Two types of output are relevant in this case: classification and regression. A classification determines whether each instance belongs to one of the predetermined classes, whereas the result of a regression is a continuous number.
The number of instances that are available for the learning process.
The instances of a task are described by a fixed number of distinct features.
To start the research endeavor, we select a task with a binary input, a binary classification as output, a small set of training instances, and a limited number of features. An overview of all task characteristics of interest and their implementation in this work can be found in Table 1. To conclude, we update the general research question to our research question (RQ):
RQ: How and when do learning curves differ between humans and supervised machine learning models for small sample sizes, using a binary classification with limited binary features?
Additionally, we define the following requirements for our task: it should not require prior knowledge and should use a balanced data set (same number of true and false instances) and should be solvable in a reasonable timeframe. The task should be represented in a suitable way for humans and machines, and it should be possible to depict the results in a learning curve.
|Task characteristic||Attributes||This work|
|Data type||e.g. numeric data, binary data||binary data|
|Input||Data representation||e.g. table, picture, audio||picture (humans), table (machines)|
|Output||classification, regression||binary classification|
|Instances||number of instances||5 to 50|
|Features||number of features||9 features|
3.1.2 Implementation of task characteristics
As a last prerequisite, we have to agree on a task that satisfies our set of task characteristics and also complies with the additional requirements defined in Section 3.1.1.
We use two suggestions in the field of intelligence tests as a foundation for our task, namely minimum intelligent signal tests (MISTs) and Raven’s progressive matrices (RPMs): MISTs are binary questions that are used to quantify humanness (McKinstry, 1997; Łupkowski and Jurowska, 2019). Compared to other intelligence tests, these questions do not require a complex answer, but only a simple yes or no, which satisfies our limitation on a binary output. However, the input is natural speech and not a set of a few, binary features. RPM (Raven, 2000) is a test of visual geometric objects, designed by a rule. The task is to complete the set of visual geometric objects by selecting an object out of six or eight options. Only one of the selectable objects matches the rule. RPMs have a graphical representation that can be reduced to a set of instances with a few binary features to get standardized instances. However, they lack a binary output.
By combining these two tests, we define the following task: To have the same number of features, we use only 3x3 matrices with 9 elements (= 9 features). Every feature is binary. Accordingly, we have a set of = 512 different matrices. These matrices can be displayed as a picture with elements of black and white (for humans) or as a list of numbers with features of 1 and 0 (for machines). Figure 1
shows an example of how the same instance is represented for humans and machines respectively. Based on a rule regarding the feature value, we can classify the matrices: Some instances (matrices) fulfill the rule, therefore they are labeled as true, whereas all the other instances do not fulfill the rule and are labeled as false. We define four basic patterns as the four rules that define our classification task.
Matrices that fulfill the diagonal rule have at least one diagonal line that is labeled black, either starting in the upper left block and continuing to the lower right block, or stating in the lower left block and ending in the upper right block.
Matrices that fulfill the horizontal rule have at least one horizontal row of only black elements.
The numbers rule is satisfied if five elements in total are labeled black.
Symmetry describes axis symmetry, either to the middle column or the middle row of the matrix.
3.2 Experiment design
) in an experimental setting. In the end, the results should render a learning curve. To generate a learning curve for a specific rule, we define a game with multiple rounds. During the game, the rule does not change. At the start, the player receives access to five labeled instances (training data). We ensure the probability of each instance to be labeled positive is 50% (and accordingly 50% to be labeled negative) to account for imbalances of positive and negative labeled instances in the data set based on the selected rule. Additionally, the player receives five unlabeled instances (test data) that have to be labeled based on the knowledge derived from the labeled training instances. The probability for each instance to be labeled positive remains 50% as explained before. We then measure the performance on the test data with theaccuracy metric, which is defined as the number of correctly labeled instances divided by the total number of labeled instances.
As labeling is only a binary decision in our work, an accuracy indicator of “1” is a 100% correct labeling, whereas an accuracy indicator of “0.5” is equivalent to a random guess where labels are randomly assigned. The accuracy of the labeling of the five instances represents the performance in the first round.
An instance consists of nine elements and a binary label that indicates if the instance fulfills the rule or not.
In every round, humans and machines get five (additional) labeled instances and five new instances to label.
A game has either 10 (humans) or 20 (machines) rounds.
An experiment is finished when four games with four different rules are played.
In the second round, the previously labeled instances disappear and five new, unlabeled instances are displayed (new test instances). Five additional labeled instances are shown, leading to a total of 10 labeled instances available for training. The labeling of the five new unlabeled instances in the second round determines the performance in Round 2. Evidently, additional rounds follow the same pattern. This is depicted in Figure 3. The order of labeled and unlabeled instances is randomized in every game. However, one matrix (instance) will only be part of either the training or testing data, not both. The learning curve is generated based on the performances in each round.
3.2.1 Experiment with humans
The experiment with humans is conducted by studying participants in different sessions. They participate in the experiment individually and without any prior knowledge. In advance, they get a standardized introduction about the general aim of the experiment, the layout of the user interface, and some abstract examples. This introduction is available before and during the experiment in printed form, and they can use scrap paper and a pencil to make notes.
Every participant has the possibility to play all four rules, leading to four games in total. The total number of rounds per game is limited to 10, which means that the participants will see 50 labeled instances in total and have the opportunity to label 50 instances during one game. After finishing one game, a participant does not receive any feedback about his/her performance. This ensures independent games, as a participant is not influenced regarding the following games. The order of rules is randomized for each participant. Figure 4 shows an example of GUI of the experiment with humans for the rule symmetry.
The experiment is organized and recruited with the software hroot (Bock et al., 2014). In total, 44 people participate in two sessions, with 19 people in the first session and 25 in the second one. There are 20 female and 24 male participants, with an average age of 26 years (SD = 9.6). Most experimentees (91%) are currently enrolled at a German university, majoring in 17 different disciplines, mainly Industrial Engineering and Management (10 students) and Computer Science (seven students).
Before each session, the experimentees are given instructions on how the experiment works and about their tasks. These instructions are also available on the screen before every game and in printed form during a game. Every person conducts the experiment individually in soundproof cubicles, using a computer. The time limit to complete the experiment with all four rules is set to one hour. The participants are incentivized by an individual payment (Kvaløy et al., 2015) which is based on their performance relative to that of all other participants in the same session and which ranges from €16 (best performance) to €7 (worst performance).
3.2.2 Experiment with machines
The experiment with machines is conducted by three different algorithms out of all supervised machine learning algorithms, namely a logistic regression (linear), a decision tree (propositional), and a neural network algorithm (non-linear). We do not perform an excessive parameter tuning for the algorithms, because this requires a large number of instances (which is a limitation of our task). This is consistent with the human experiments, because the study participants have no option of playing the game in advance to gain additional knowledge that facilitates completing the task. To increase comparability, every algorithm is applied to every game with the same number of resulting models as the number of humans who played the game. Since we are not limited concerning resources like time, money, or room availability as in the experiment with humans, we can double the number of rounds to 20, which leads to 100 labeled instances (test instances). While our main focus is on the comparison of the first 10 rounds between human and machine, we are curious about how a machine’s performance would develop with additional samples. The algorithm is instantiated for one game only and gets terminated after every game so that knowledge from previous games is not used.
After presenting the experiment in the last chapter, we conduct the experiment with humans and the different machine learning algorithms. In this chapter, we evaluate the experiment conducted by humans (Section 4.1) and by the machine learning models (Section 4.2) in detail. In a follow-up step, we compare the human results to those of the machine learning models (Section 4.3).
4.1 Experiment with humans
The experiment was conducted in two sessions, as described in Section 3.2.1. Of the participants, 91% (40 out of 44) finished all four rules within the time limit of one hour. The order of the games is random. We therefore have 42 datasets for the rule diagonal and horizontal rule and 43 for the rule numbers and symmetry.
We use analysis of variances (ANOVA)(Girden, 1992) to analyze the dataset in a first step (Table 2). A one-way ANOVA with the two sessions shows no significance in performance, which indicates that there is no statistical significance between the means of the two sessions. We can therefore analyze all sessions together. To determine the influence of the rules and the number of training instances, we use two-way ANOVAs with replication, since we have a set of paired data where one individual has played several games as well as 10 rounds (= 10 data points) within a game. Since two-way ANOVAs with replication require an equal set of paired data per individual, we exclude the four participants who did not play all four games. Rules and instances show a high statistical significance in performance, as expected in our research question. We account for it later by looking at each rule independently and using learning curves that display the performance for each number of training instances separately without aggregation.
The performance of the order of games is also statistically significant. This could indicate something like a learning effect between the games, but analyzing this finding is beyond the scope of this article and can be investigated in future research.
4.2 Experiment with supervised machine learning models
Corresponding to the number of human experimentees who played one rule, we respectively use 42 or 43 machine learning models for each of our three types of machine learning algorithms—regression, decision trees, and neural networks—to play a game for a certain rule. The games are played in the same way that the humans conduct the experiment, seeing five labeled training instances in the first round, and the performance is determined by labeling five instances. As regression algorithm, we choose a logistic regression (Pedregosa et al., 2011) with an L-BFGS solver (Liu and Nocedal, 1989). The used decision tree algorithm is a DecicionTree Classifier (Breiman et al., 1984; Pedregosa et al., 2011) and a Multilayer Perceptron (MLP) (Glorot and Bengio, 2010; Pedregosa et al., 2011) with an L-BFGS solver as our neural network algorithm of choice. In the following section, we will use the terms of MLP, decision tree, and logistic regression, knowing that we will always compare the aggregate of 42 or 43 individual performances of these machine learning algorithms.
To compare the results of the three machine learning algorithms with the human performance, we analyze each rule through the learning curves our experiment generated. Figure 5 depicts the results for the rule diagonal. The number of training instances is displayed on the x-axis. The left y-axis belongs to the line charts and shows the average accuracy of all experimentees with the given number of training instances.
The right y-axis belongs to the bar chart. The bar chart shows three different levels of statistical significance (, ,
) between the performance of the machine learning models and that of humans. From 55 training instances onward, there is no corresponding human data and the significance refers to the performance difference between machine learning models and humans with 50 training instances. The statistical significance in performance difference is calculated by a two-sided t-test for unequal variance(Yuen, 1974). As this results in multiple t-tests on the same dataset, we control the false discovery rate (FDR) by the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995).
Regarding the rule diagonal (Figure 5), the decision tree outperforms all other machine learning models and human participants. In one case, the performance of the decision tree is not significantly better compared to the human within the first 50 training instances. Beginning with 55 training samples, the decision tree performs significantly better () than humans in 50 instances. In contrast, the MLP and the logistic regression show similar accuracy compared to the human and do not improve significantly in later rounds. Therefore, those machine learning models do not outperform humans in 50 training instances significantly.
Figure 6 displays the results for rule horizontal. In contrast to the previous chart, humans significantly outperform the machine learning models in the first 50 training instances. However, the statistical significance in performance decreases with more training instances the machine learning models have to learn. Beginning with 55 training instances, the performance of humans with 50 instances and machines with 55 instances does not differ significantly anymore. With the machine learning models, the accuracy is on an equal level and only in the end it seems that the performance of the logistic regression deteriorates a bit.
The rule numbers in Figure 7 shows the highest accuracy of human performance across all four rules. Starting with 15 training instances, the performance is always on or above 90%. The accuracy of the three machine learning models shows no improvement and the accuracy remains at around “0.5” for the entire 100 training instances. The performance difference between humans and machine learning models is therefore significant () for the experiment across all rounds.
The results for rule symmetry are depicted in Figure 8. Similar to the human performance in rule numbers, humans outperform the machine learning models. With five training instances, the performance is significantly better (MLP: , decision tree and regression: )). Afterwards, the human performance improves more than the machine performance and the differences become highly significant (). However, the human performance reaches its accuracy maximum right below 0.9 after 20 instances and remains on this level, whereas the accuracy of MLP and decision tree slightly improves from round to round. After 50 training instances, the significance level decreases ().
The results of the experiments provide ground for possible interpretations. We discuss possible explanations of the observed values. This is by no means a full explanation of the shape of every learning curve; however, we give insights on how different theories in computer science and in cognitive psychology can explain some of the outlined results. We start with explanations of the human performance (Section 5.1) and end with the machine experiment (Section 5.2).
5.1 Experiment with humans
Given our experiment setup and its results, we employ theories from the area of cognitive psychology, as introduced in Section 2.1, to interpret our results. Other areas of human learning—social cognitive theory and sociocultural theory—are less applicable, as the experiment is performed individually.
The human performance shows two key characteristics across all four rules: High accuracy when labeling the first five instances (no accuracy below 60%, which outperforms the supervised machine learning models in three of the four rules) and only small performance improvements after learning with 20 or more training instances.
An explanation for the first observation is grounded in the concept of one-shot learning (Lee et al., 2015). Besides incremental learning, where humans learn step-by-step through trial and error (Thorndike, 1913), a human is also capable of one-shot learning, which is a technique to learn from a single instance. When a child touches a hot stove plate, he/she will immediately learn not to do it again. This single training instance illustrates one-shot learning. With object recognition, one-shot learning enables humans to recognize objects after one instance by relating the newly seen object to prior knowledge (Fei-Fei et al., 2006). Although the used patterns in the experiment are highly abstract, the human can still connect them to known shapes.
The second finding can be explained by cognitive load theory (CLT) (Sweller et al., 1998; Shaffer, 2017; Fan et al., 2010). CLT describes the learning process as the combination of three loads: the intrinsic cognitive load coming from the difficulty and complexity of the learning subject; the extraneous cognitive load, which originates in preparation of the learning subject; and the germane cognitive load, which describes humans’ needed learning capacity to understand the learning subject (Paas et al., 2003). The working memory, where the cognitive process takes place, is essential for the learning process. The working memory itself is limited (Ayres and Paas, 2009; Van Merrienboer and Sweller, 2005), which everyone can experience when playing the board game “Memory” and then being unable to memorize every card that has been revealed. Our interpretation of the data suggests that 20 training instances are the limit for humans’ working memory, with more training instances only leading to cognitive overload (Moreno and Mayer, 1999) and not to improved performance. An additional explanation can be found in the fatigue effect. The longer the human plays the same rule, the more his performance is effected negatively by fatigue (Gonzalez et al., 2011) and counteracts the positive effect of enlarging the data basis by seeing more instances.
Regarding the human performance per rule, the learning curve for the rule diagonal is unique. The accuracy in the first round is the second lowest, the maximum performance is the lowest of all four rules, and the machine learning models have similar accuracy numbers or even outperform humans. All these findings indicate that the human performance is particularly bad compared to the other rules. When looking at instances fulfilling the rule diagonal, as shown in Figure 9, one can see that the elements (which form the diagonal line) are not joined on the sides but are only linked via their corners. Similar to an optical illusion (Coren et al., 1978), like the well-known rabbit–duck illustration in which some people see a rabbit and others a duck, the diagonal lines can “disappear” in some instances while seeing other possible rules. Therefore it becomes harder for humans to see the diagonal line as a potential rule.
Humans show the best performance for the rule numbers, which implies that this rule benefits human learning the most. Cognitive fit theory (CFT) (Vessey, 1991) indicates a link between the task and the chosen type of presentation that leads to superior task performance—the finding of the maximum value in a numerical dataset is completed quicker by humans when plotting the data as a graph compared to a data table. In our experiment, the matrix representation of the data favors counting the number of true elements as well as comparing the counts between training instances.
Despite the mentioned theories explaining many findings from the results in our experiments with humans, we have to keep the statistical significance in performance between the different games in mind. In future research, we must analyze the results of every game individually and we may find further theories that explain other aspects of human performance.
5.2 Experiment with machines
There are two general findings regarding the machine learning models across all four models: The performance after five training instances is similar or lower compared to the human performance and the machine learning models’ performance correlates negatively with the complexity111In this case, complexity can be understood as the number of constraints / number of basic components needed to describe the pattern; the more basic components, the more complex the problem of the individual rules.
The first finding relates to the one-shot learning (Lee et al., 2015) we discussed in Section 5.1. In contrast to humans, all three machine learning models can only do incremental learning. The chosen machine learning models require a certain amount of training instances to perform properly. However, there are special machine learning algorithms designed for one-shot learning, and this is an interesting topic for future work. The Bayesian one-shot algorithm (Fei-Fei et al., 2006) is an example of a machine learning algorithm that is able to learn via a single instance.
Regarding the second finding, the complexity of each rule can be defined by the number of basic components. For example, the rule diagonal consists of two basic components—either a diagonal line starting in the upper left corner and continuing to the lower right corner, or starting in the lower left corner and ending in the upper right corner. The rule itself leads to the best accuracy numbers for all rules, even outperforming humans. The rule horizontal is the combination of three different basic components—either a line in the first, second, or third row. The performance with one additional basic component is slightly lower and gets outperformed for the same number of training instances. The learning curve for the rule symmetry shows the third best performance. Because we use an uneven number for the rows and columns of our matrix, the axis on symmetry lies on three elements, which leads to six elements of our matrix being used for the rule. To fulfill the rule symmetry, one has to check pairwise whether two elements on opposite sides have the same value, which leads to three pairwise comparisons. By using two different symmetry axis, horizontal and vertical, this rule consists of six basic components. The rule numbers can also not simply be broken down into a number of basic components, which may lead to the worst performance of the machine learning models across all four rules.
Analyzing the machine learning model performances for each rule individually, the decision tree model shows remarkable accuracy for the rule diagonal, even outperforming humans. On the one hand, this relates to the comparatively bad performance of humans discussed in Section 5.1. On the other hand, the rule diagonal is unique: The feature , referring to the central cell of the matrix, is true—irrespective of the direction of the diagonal line. This circumstance is easy to detect via a decision tree and is a good indication of whether the instance follows the rule or not.
The rules numbers and symmetry require the combination of several features and either counting (numbers) or comparing features, disregarding their binary status (symmetry). A logarithmic regression only looks at each feature individually and fails to detect both rules correctly. In machine learning, the process of feature engineering is utilized frequently—the machine learning model is trained with additional, often human-generated features that are a combination of other (original) features (Yu et al., 2010). For example, without feature engineering, a decision tree is not able to find the feature count that is necessary for the rule numbers. Furthermore, it has problems with detecting differences and ratios, which are essential for the rule symmetry (Heaton, 2016). This may explain the poor performance of the decision tree for the rule symmetry and its failure to learn the rule numbers. In contrast, a neural network like the used MLP can generate complex features like counting by its layer structure (Heaton, 2016). The better performance of the MLP compared to the other machine learning models is visible for rule symmetry. However, the MLP also fails to learn the rule numbers
without feature engineering. This may be up to the low number of training instances or an unsuitable default configuration of layers and neurons for the rulenumbers.
In accordance with Vapnik et al. (1994), in future work an analysis could be undertaken on the number of examples needed (depending on the class of learner and the class of pattern).
This article provides first insights on how learning performance differs between humans and SML models by comparing three different types when there is limited training data. The results of our experiment show a high dependency between performance and the underlying rules of the task. Whereas humans perform relatively similarly across all rules, SML models show big differences between the various patterns. Overall, as expected, humans seem to learn more out of a small number of instances compared to machines. Interestingly, we can observe large differences in the learning curves of our SML models for the different rules we applied in our experiment. In half of the rules we employ, SML models reach the same level or even outperform humans. In the other half, SML models struggle to learn the respective patterns, as those require a deeper understanding that could be gained by a more complex combination of input features, referring to feature engineering. After 20 training instances, humans’ performance does not improve anymore in our experiment—arguably due to cognitive overload. Machines learn slower and need more training instances compared to humans.
Our experiment design comes with several limitations: The number of experiment participants could be increased and lead to more, statistically significant results. In addition, we chose three different supervised machine learning algorithms out of hundreds of possible algorithms and parameter combinations. Our selection can only provide a hint of how SML performs in general. The task characteristics have been selected out of a whole set of possibilities. In future, other task combinations need to be used to answer the question on how learning performance differs between humans and machines in general.
This work shows that further research on the application of supervised machine learning is needed. It is crucial for the application of SML, e.g. as part of autonomous agents, to gain a reliable understanding of tasks and their characteristics that are suitable for automation with SML models. From a business perspective, more research is required on the cost-benefit ratio of replacing human tasks with SML models. This may come with lower task performance, but provides the benefit of automation. We also stress the need for special supervised machine learning algorithms for limited training data, apart from inductive and genetic programming. Continuing the outlined road map of task characteristics and looking at other task characteristic combinations in the future, the single results of each combination will form a more general understanding of the differences between human learning and SML.
- Comparing human and automatic face recognition performance. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37 (5), pp. 1248–1255. Cited by: §1, §2.3.
- Interdisciplinary perspectives inspiring a new generation of cognitive load research. Educational Psychology Review 21 (1), pp. 1–9. Cited by: §5.1.
- Challenges in the deployment and operation of machine learning in practice. Proceedings of the 27th European Conference on Information Systems. Cited by: §1.
- Social foundations of thought and action. Englewood Cliffs, NJ 1986. Cited by: §2.1.
- Genetic programming. Springer. Cited by: §2.2.
- Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1), pp. 289–300. Cited by: §4.3.
- Hroot: hamburg registration and organization online tool. European Economic Review 71, pp. 117–120. Cited by: §3.2.1.
- Classification and regression trees. The Wadsworth statistics/probability series, Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA. Cited by: §4.2.
- An analysis of the interaction between intelligent software agents and human users. Minds and machines 28 (4), pp. 735–774. Cited by: §2.3.
- The challenges of partially automated driving. Communications of the ACM. Cited by: §1.
- Google deep mind’s alphago. OR/MS Today 43 (5), pp. 24–29. Cited by: §1.
- The effect of optical blur on visual-geometric illusions. Bulletin of the Psychonomic Society 11 (6), pp. 390–392. Cited by: §5.1.
- Hybrid intelligence. Business & Information Systems Engineering, pp. 1–7. Cited by: §2.3.
- Machine learning. ACM Comput. Surv. 28 (4es). External Links: Cited by: §2.2.
- Investigating human priors for playing video games. In International Conference on Machine Learning, pp. 1348–1356. Cited by: §2.3.
- Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner. Cognition 173, pp. 43–59. Cited by: §1.
- Über das gedächtnis: untersuchungen zur experimentellen psychologie. Duncker & Humblot. Cited by: §2.1.
- Humans, but not deep neural networks, often miss giant targets in scenes. Current Biology 27 (18), pp. 2827–2832. Cited by: §2.3.
- Adversarial examples that fool both computer vision and time-limited humans. In Advances in Neural Information Processing Systems, pp. 3910–3920. Cited by: §2.3.
- Cognitive psychology: a student’s handbook. Psychology Press. Cited by: §2.1.
- Learning hmm-based cognitive load models for supporting human-agent teamwork. Cognitive Systems Research 11 (1), pp. 108–119. Cited by: §5.1.
- “Cognition” and dynamical cognitive science. Minds and Machines 27 (2), pp. 331–355. Cited by: §1.
- One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4), pp. 594–611. Cited by: §5.1, §5.2.
- On simulating one-trial learning using morphological neural networks. Cognitive Systems Research 53, pp. 61–70. Cited by: §2.2.
- Michael s. gazzaniga, george r. mangun (hrsg.): the cognitive neurosciences, 5th edition. Springer. Cited by: §2.3.
- ANOVA: repeated measures. Sage. Cited by: §4.1.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §4.2.
- A cognitive modeling account of simultaneous learning and fatigue effects. Cognitive Systems Research 12 (1), pp. 19–32. Cited by: §5.1.
- When will ai exceed human performance? evidence from ai experts. Journal of Artificial Intelligence Research 62, pp. 729–754. Cited by: §1.
- Probabilistic models of cognition: exploring representations and inductive biases. Trends in cognitive sciences 14 (8), pp. 357–364. Cited by: §1.
- Google and the mind: predicting fluency with pagerank. Psychological Science 18 (12), pp. 1069–1076. Cited by: §1.
- The elements of statistical learning: data mining, inference and prediction. Vol. 9, Springer. Cited by: §2.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- An empirical analysis of feature engineering for predictive modeling. In SoutheastCon 2016, pp. 1–6. Cited by: §5.2.
- Computer models solving intelligence test problems: progress and implications. Artificial Intelligence 230, pp. 74–107. Cited by: §2.3.
- Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artificial Intelligence Review 48 (3), pp. 397–447. Cited by: §1.
- The measure of all minds: evaluating natural and artificial intelligence. Cambridge University Press. Cited by: §1, §2.3.
- Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
- An end-to-end process model for supervised machine learning classification: from problem to deployment in information systems. In Designing the Digital Transformation: DESRIST 2017 Research in Progress Proceedings of the 12th International Conference on Design Science Research in Information Systems and Technology. Karlsruhe, Germany. 30 May-1 Jun., pp. 55–63. Cited by: §2.2.
- Can machines think? an old question reformulated. Minds and Machines 20 (2), pp. 203–212. Cited by: §2.2.
- Looking beyond the brain: social neuroscience meets narrative practice. Cognitive Systems Research 34, pp. 5–17. Cited by: §2.3.
- Comparing humans and ai agents. In International Conference on Artificial General Intelligence, pp. 122–132. Cited by: §2.3.
- Machine learning: trends, perspectives, and prospects. Science 349 (6245), pp. 255–260. Cited by: §2.2, §2.2.
- Reinforcement learning: A survey. Journal of artificial intelligence research 4, pp. 237–285. Cited by: §2.2.
- Do neural networks show gestalt phenomena? an exploration of the law of closure. arXiv preprint arXiv:1903.01069. Cited by: §2.3.
- Celebration of twenty years promoting cognitive science. Cognitive Systems Research 100 (43), pp. 125–127. Cited by: §1.
- Gestalt psychology. Psychological research 31 (1), pp. XVIII–XXX. Cited by: §2.3.
- Empirical tests of a theory of human acquisition of concepts for sequential patterns. Cognitive Psychology 4 (3), pp. 399–424. Cited by: §2.1.
- Automated design of both the topology and sizing of analog electrical circuits using genetic programming. In Artificial Intelligence in Design ’96, pp. 151–170. External Links: Cited by: §2.2.
- Hidden benefits of reward: a field experiment on motivation and monetary incentives. European Economic Review 76, pp. 188–199. Cited by: §3.2.1.
- Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §1.
- Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §1.
- Neural computations mediating one-shot learning in the human brain. PLoS biology 13 (4). Cited by: §1, §5.1, §5.2.
- Burn-in, bias, and the rationality of anchoring. In Advances in neural information processing systems, pp. 2690–2798. Cited by: §2.1.
- The knowledge level in cognitive architectures: current limitations and possible developments. Cognitive Systems Research 48, pp. 39–55. Cited by: §2.2.
- Semi-supervised classification of network data using very few labels. In 2010 International Conference on Advances in Social Networks Analysis and Mining, pp. 192–199. Cited by: §2.2.
- On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (1-3), pp. 503–528. Cited by: §4.2.
- A rational model of function learning. Psychonomic bulletin & review 22 (5), pp. 1193–1215. Cited by: §1.
- The minimum intelligent signal test (mist) as an alternative to the turing test. Diametros 16 (59), pp. 35–47. Cited by: §3.1.2.
- The forthcoming artificial intelligence (ai) revolution: its impact on society and firms. Futures 90, pp. 46–60. Cited by: §1.
- The algebraic mind: integrating connectionism and cognitive science. MIT press. Cited by: §1, §2.3.
- Minimum intelligent signal test: an objective turing test. Canadian Artificial Intelligence, pp. 17–18. Cited by: §3.1.2.
- Visual presentations in multimedia learning: conditions that overload visual working memory. In International Conference on Advances in Visual Information Systems, pp. 798–805. Cited by: §5.1.
Inductive logic programming. New generation computing 8 (4), pp. 295–318. Cited by: §1.
- Future progress in artificial intelligence: a survey of expert opinion. In Fundamental issues of artificial intelligence, pp. 555–572. Cited by: §1.
- How to build a robot that is conscious and feels. Minds and Machines 22 (2), pp. 117–136. Cited by: §2.2.
- Inductive functional programming using incremental program transformation. Artificial intelligence 74 (1), pp. 55–81. Cited by: §1, §2.2.
- Artificial cognitive systems that can answer human creativity tests: an approach and two case studies. IEEE Transactions on Cognitive and Developmental Systems 10 (2), pp. 469–475. Cited by: §2.3.
- Human learning. Merrill London. Cited by: §2.1, §2.1.
- Development of computational models of emotions: a software engineering perspective. Cognitive Systems Research 60, pp. 1–19. Cited by: §2.2.
- Cognitive load theory and instructional design: recent developments. Educational Psychologist 38 (1), pp. 1–4. Cited by: §5.1.
- Scikit-learn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.
- Tree induction vs. logistic regression: a learning-curve analysis. Journal of Machine Learning Research 4 (Jun), pp. 211–255. Cited by: §1.
- Evaluating (and improving) the correspondence between deep neural networks and human representations. Cognitive science 42 (8), pp. 2648–2669. Cited by: §2.3.
- Big data analytics in healthcare: promise and potential. Health information science and systems 2 (1), pp. 3. Cited by: §1.
- Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience 38 (33), pp. 7255–7269. Cited by: §2.3.
- The raven’s progressive matrices: change and stability over culture and time. Cognitive Psychology 41 (1), pp. 1–48. Cited by: §3.1.2.
Active and semi-supervised learning for object detection with imperfect data. Cognitive Systems Research 45, pp. 109–123. Cited by: §2.2.
- Social learning and cognition. Academic Press. Cited by: §2.1.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1.
- Inductive rule learning on the knowledge level. Cognitive Systems Research 12 (3-4), pp. 237–248. Cited by: §2.2.
- Machines as teammates: a research agenda on ai in team collaboration. Information & Management. Cited by: §2.3.
- Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.2.
- Cognitive load and issue engagement in congressional discourse. Cognitive Systems Research 44, pp. 89–99. Cited by: §5.1.
- Human-extended machine cognition. Cognitive Systems Research 49, pp. 9–23. Cited by: §2.3.
- Cognitive psychology. Nelson Education. Cited by: §2.1.
- Cognitive architecture and instructional design. Educational Psychology Review 10 (3), pp. 251–296. Cited by: §5.1.
- Can we develop artificial agents capable of making good moral decisions?. Minds and Machines 21 (3), pp. 465–474. Cited by: §2.2.
- How to grow a mind: statistics, structure, and abstraction. science 331 (6022), pp. 1279–1285. Cited by: §1.
- Mind: introduction to cognitive science. Vol. 17, MIT press Cambridge, MA. Cited by: §1.
- The psychology of learning. Vol. 2, Teachers College, Columbia University. Cited by: §5.1.
- Cognitive load theory and complex learning: recent developments and future directions. Educational psychology review 17 (2), pp. 147–177. Cited by: §5.1.
- Measuring the vc-dimension of a learning machine. Neural computation 6 (5), pp. 851–876. Cited by: §5.2.
- Computational creativity: a continuing journey. Minds and Machines 20 (4), pp. 483–487. Cited by: §2.2.
- Why machines cannot feel. Minds and Machines 20 (1), pp. 1–18. Cited by: §2.2.
- Cognitive fit: a theory-based analysis of the graphs versus tables literature. Decision Sciences 22 (2), pp. 219–240. Cited by: §5.1.
- Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2.2.
- One and done? optimal decisions from very few samples. Cognitive science 38 (4), pp. 599–637. Cited by: §2.1.
- Thought and language. Annals of Dyslexia 14 (1), pp. 97–98. Cited by: §2.1.
- Mind in society: the development of higher psychological processes. Harvard University Press. Cited by: §2.1.
- CVAP: validation for cluster analyses. Data Science Journal. Cited by: §2.2.
- Comparing human and computational models of music prediction. Computer Music Journal 18 (1), pp. 70–80. Cited by: §1, §2.3.
- Computational models of ethical decision-making: a coherence-driven reflective equilibrium model. Cognitive Systems Research 46, pp. 61–74. Cited by: §2.2.
- Feature engineering and classifier ensemble for kdd cup 2010. In KDD Cup, pp. 1–16. Cited by: §5.2.
- The two-sample trimmed t for unequal population variances. Biometrika 61 (1), pp. 165–170. Cited by: §4.3.
- Humans can decipher adversarial images. Nature communications 10 (1), pp. 1–9. Cited by: §2.3.