Can you hear me now? Sensitive comparisons of human and machine perception

03/27/2020
by   Michael A Lepori, et al.
Johns Hopkins University
0

The rise of sophisticated machine-recognition systems has brought with it a rise in comparisons between human and machine perception. But such comparisons face an asymmetry: Whereas machine perception of some stimulus can often be probed through direct and explicit measures, much of human perceptual knowledge is latent, incomplete, or embedded in unconscious mental processes that may not be available for explicit report. Here, we show how this asymmetry can cause such comparisons to underestimate the overlap in human and machine perception. As a case study, we consider human perception of adversarial speech– synthetic audio commands that are recognized as valid messages by automated speech-recognition systems but that human listeners reportedly hear as meaningless noise. In five experiments, we adapt task designs from the human psychophysics literature to show that even when subjects cannot freely transcribe adversarial speech (the previous benchmark for human understanding), they nevertheless can discriminate adversarial speech from closely matched non-speech (Experiments 1-2), finish common phrases begun in adversarial speech (Experiments 3-4), and solve simple math problems posed in adversarial speech (Experiment 5) – even for stimuli previously described as "unintelligible to human listeners". We recommend the adoption of sensitive tests of human and machine perception, and discuss the broader consequences of this approach for comparing natural and artificial intelligence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/04/2021

Audio Adversarial Examples: Attacks Using Vocal Masks

We construct audio adversarial examples on automatic Speech-To-Text syst...
04/09/2021

Accented Speech Recognition Inspired by Human Perception

While improvements have been made in automatic speech recognition perfor...
01/23/2020

On the human evaluation of audio adversarial examples

Human-machine interaction is increasingly dependent on speech communicat...
12/13/2016

Evaluating Automatic Speech Recognition Systems in Comparison With Human Perception Results Using Distinctive Feature Measures

This paper describes methods for evaluating automatic speech recognition...
06/12/2020

"Notic My Speech" – Blending Speech Patterns With Multimedia

Speech as a natural signal is composed of three parts - visemes (visual ...
02/19/2012

Perception Lie Paradox: Mathematically Proved Uncertainty about Humans Perception Similarity

Agents' judgment depends on perception and previous knowledge. Assuming ...
10/12/2020

Perceptimatic: A human speech perception benchmark for unsupervised subword modelling

In this paper, we present a data set and methods to compare speech proce...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

How can we know when a machine and a human perceive a stimulus the same way? Machine-recognition systems increasingly rival human performance on a wide array of tasks and applications, such as classifying images

(Krizhevsky:2012wl, szegedy2016rethinking), transcribing speech (chan2016listen, lamere2003cmu), and diagnosing mechanical or biological anomalies (cha2017deep, Lakhani:2017ep). Such advances often call for comparisons

between human and machine perception, in which researchers collect responses from human subjects and machine-recognition systems and then ask how similar or different those responses are. In some cases, these comparisons serve to establish benchmarks for determining when a machine-recognition system has reached “human-level” performance (e.g., by recording average human accuracy rates on visual and auditory classification tasks). In other cases, the purpose of such comparisons is subtler, as in work that explores which aspects of human perception are shared with various machine-learning systems

(Schrimpf:2018dp). For example, recent work asks whether humans and machines that demonstrate comparable overall accuracy nevertheless show different patterns of errors (Eckstein:2017gz, Rajalingham:2018kv), exhibit different biases (Baker:2018jp, buolamwini:2018gs), or are susceptible to different illusions and “attacks” (Elsayed:2018ni, jacob2019deep, Ward:2019bi, Zhou:2019jxa).

Knowing more than you can say

Regardless of their goals, all comparisons between human and machine perception face a deep and persistent challenge: Whereas machine-recognition systems are often evaluated by direct and explicit measures (since their classification decisions and other outputs are openly available to an experimenter), such measures are widely understood to be inadequate tests of human perception. This is because human perception almost always involves partial knowledge or unconscious processes that may not be cognitively accessible to the perceiver, such that assessing what someone knows or perceives is rarely as straightforward as asking them to label what they see, hear, or feel.

Indeed, decades of research on human perception and cognition have revealed knowledge and abilities that were not initially evident from tasks in which subjects freely describe what they know or experience. For example, even when subjects cannot explicitly predict how a swinging ball will travel when released from a pendular trajectory, they may accurately place a cup to catch the ball — suggesting that they possessed the relevant physical knowledge all along but were able to access it only through more implicit processes (Smith:2018ff). Similarly, even when subjects incorrectly report the locations of remembered objects, they may perform reliably above chance if asked to take a second or third guess (Wu:2018iu; see also Vul:2008cz). Even when subjects report no awareness of objects that are masked or appear outside the focus of attention, they may nevertheless show priming effects for the unnoticed stimuli, suggesting that those stimuli were processed unconsciously or below the threshold for explicit report (kouider2007levels, Mack:2003hn; for critical discussion, see phillips2018unconscious). Even when subjects fail to notice statistical regularities in visual displays, they may still apply those regularities on subsequent trials, implying that they learned and incorporated such regularities implicitly (chun2000contextual, chun1998contextual). And in an especially dramatic case, patients with cortical blindness who fail to freely report features of the objects they are looking at (e.g., being unable to answer questions like “what orientation is the line in front of you?”) may nevertheless succeed under forced-choice conditions (e.g., being able to correctly answer questions like “is the line in front of you horizontal or vertical?”; weiskrantz1986blindsight, weiskrantz1996blindsight).

Sensitive tests in human-machine comparisons

The above examples involve what we will refer to here as sensitive tests of human perception and cognition. Sensitive tests are tasks or measures that go beyond simply asking someone to describe what they see, hear, or know. Such techniques include making subjects act on a piece of information (rather than report it), exploring downstream consequences for other behaviors (as in priming studies), collecting additional responses (such as ranking various options rather than giving a single answer), or using some piece of knowledge to make a discrimination (rather than trying to report that knowledge directly).

How might this apply to comparisons between humans and machines? The fact that human perceptual knowledge can be partial, incomplete, or buried beneath layers of unconscious mental processing creates a challenge for comparisons of human and machine perception. In particular, whenever such comparisons rely mostly or only on explicit descriptions of sensory stimuli, there is a risk that these measures may underestimate what the human subjects really know about the stimuli they perceive, and thereby misestimate the overlap between human and machine perceptual processing.111Whether similar considerations also apply to tests of machine perceptual knowledge is interesting open question. For relevant work on this issue, see (Zoran_2015_ICCV, Ritter:2017uk).

Here, we suggest that these considerations matter in concrete and measurable ways. In particular, we argue that some apparent gaps or disconnects between human perceptual processing and the processing of various machine-perception systems can be explained in part by insufficiently sensitive tests of human perceptual knowledge. To demonstrate this, we explore an empirical “case study” of how using more sensitive tests can reveal a perceptual similarity when previous studies seemed to show a deep dissimilarity. Accordingly, we recommend that comparisons of human and machine perception adopt such sensitive tests before drawing conclusions about how their perceptual processing differs.

A case study: Adversarial misclassification

Of any human-machine difference reported in the growing literature that conducts such comparisons, the most striking gap is surely the one implied by adversarial misclassification (Szegedy:2013vw). Adversarial examples are inputs designed to cause high-confidence misclassifications in machine-recognition systems, and they may crudely be divided into two types. The first type is sometimes called a “fooling” example, in which a stimulus that would otherwise be classified as meaningless or nonsensical (e.g., patterns of image static, or auditory noise) is recognized as a familiar or valid input by a machine (e.g., a dog, or the phrase “OK Google, take a picture”; Nguyen:2015cv, Carlini:vl). The second type is a “perturbed” example, in which a stimulus that would normally be classified in one way (e.g., as an orange, or a piece of music) can be very slightly altered to make a machine classify it in a completely different way (e.g., as a missile, or the command “Call 911 now”) — even when such perturbations seem irrelevant (or are not even noticeable) to human observers (Athalye:2018ml, Szegedy:2013vw).

Adversarial misclassifications are significant for at least two kinds of reasons. First, and more practically, they expose a major vulnerability in the security of machine-perception systems: If machines can be made to misclassify stimuli in ways that humans would not notice, then it may be possible to attack such systems in their applied settings (e.g., causing an autonomous vehicle to misread a traffic sign, or making a smartphone navigate to a dangerous website) — a worry that may only intensify as such technologies become more widely adopted (Hutson:2018ks). Second, and more theoretically, the fact that the machine’s classifications seem so alien to a human perceiver suggests “an astonishing difference in the information processing of humans and machines” (brendel2020adversarial) and so threatens to undermine the recent excitement that such machines could serve as models of human perception and cognition (Kell:2019hn, kell2018task, Kriegeskorte:2015dg, Majaj:2018ha, Schrimpf:2018dp, Serre:2019cm).

Crucially, the reason adversarial misclassifications carry such important and interesting consequences is the very strong and intuitive sense that machines do perceive these stimuli in ways that humans do not. And indeed, a growing literature has sought to demonstrate this empirically, by asking human subjects to classify such stimuli and noting similarities and differences in their classification decisions (Carlini:vl, chandrasekaran2017takes, Harding:2018vx, Zhou:2019jxa; see also Baker:2018jp, feather2019metamers, golan2019controversial). But might some of these discrepancies arise in part because of the means of comparison themselves?

Can people understand adversarial speech?

As a “case study” of this possibility, we consider here the example of adversarial speech. A recent and highly influential research program shows that it is possible to generate audio signals that are recognized as familiar voice commands by automated speech-recognition systems but that human listeners hear as meaningless noise (Carlini:vl; Figure 1). In short (though see below for more detail), a normal voice command can be “mangled” by removing audio features not used by the speech-recognition system, such that it remains perfectly intelligible to that system but becomes incomprehensible to a human listener222According to some restrictive definitions of “adversarial examples”, stimuli only get to count as adversarial if they very closely resemble the original stimuli from which they were created. By those standards, the present case involves a misclassification, but perhaps not an adversarial one. By contrast, here we assume the broader and more popular definition given by Goodfellow et al. (goodfellow_papernot_huang_duan_abbeel_clark_2017; see also Szegedy:2013vw): “Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake”. By this definition, the case we explore here easily fits the bill.. To verify that such stimuli are truly heard as meaningless non-speech, this work included an empirical comparison between the relevant speech-recognition system and a cohort of human subjects, who were played the audio clips and asked to make judgments about them. In particular, the comparison asked the subjects to transcribe the audio files, and found that no subjects were able to transcribe the adversarial speech clips into the underlying messages from which the files were produced. These and similar tests led the authors to conclude that “0%” of subjects heard the hidden messages in the files, that subjects “believed our audio was not speech”, and more generally that these commands were “unintelligible to human listeners”.

Figure 1: Schematic illustration of adversarial speech. Human listeners and automated speech-recognition systems may transcribe normal human speech with comparable accuracy, adversarial speech involves synthetic audio clips that human listeners hear as meaningless noise but that are recognized as valid commands by automated speech-recognition systems. Procedure and results drawn from (Carlini:vl).

However, accurate transcription is an almost paradigmatically “insensitive” test — an extremely high bar to study comprehension of this sort, including for the practical and theoretical issues raised above. For example, even in an applied setting, it could be valuable to know whether a human listener can tell that a hidden command was just played, even if the listener doesn’t know the precise nature of the command (so that, e.g., the user could monitor or disable their device if they suspect it is being attacked). Similarly, for at least some purposes, understanding even part

of a message might be nearly as good as understanding the whole thing. For example, a smartphone user who heard mostly nonsense in an adversarial speech clip — but managed to pick up the words “9-1-1” — might understand what their phone is about to do, even if they couldn’t make out all or even most of the full message (e.g., “OK Google, please dial 911”). This is perhaps especially likely if they have a sense of the adversary’s intentions or probable messages (e.g., the kinds of commands one would give to a phone in the first place). Finally, beyond such applied concerns, discovering that humans could extract these subtler patterns of meaning from adversarial speech would imply greater overlap in “information processing” than previous tests seemed to reveal. But, to our knowledge, none of these possibilities has been explored in any study of adversarial misclassification. Might more sensitive tests show that this is the case?

The present experiments: Sensitive tests of adversarial speech comprehension

Here, we use perhaps the simplest of such sensitive tests — forced-choice classification — to explore how differences between human and machine perception may be overestimated by insensitive tests of human perceptual knowledge. We generate adversarial speech commands using the same method as in (Carlini:vl). But rather than ask subjects to freely transcribe these messages, we probe their understanding by asking them to make discriminations based on the commands. Experiments 1 and 2 ask whether subjects can discriminate adversarial messages that contain English speech from messages that do not contain speech but that are otherwise closely matched on many low-level properties. Experiments 3 and 4 ask whether subjects can supply the last word in a familiar phrase begun in adversarial speech (e.g., “It is no use crying over spilled  ”), even without any advance knowledge of what words will be played. Finally, Experiment 5 asks whether subjects can answer simple math problems posed in adversarial speech (e.g., “”). To foreshadow the key results, all five experiments show that subjects who cannot easily transcribe such speech clips can demonstrate having comprehended them, when tested under more sensitive conditions. Our particular results thus point to a much more general lesson: Sensitive tests can reveal latent or incomplete perceptual knowledge that less sensitive tests can miss, and in ways that directly inform comparisons between human and machine perception.

General Methods

Though the five experiments reported here involve different hypotheses, designs, and research questions, all five proceeded in a very similar way and shared many methodological characteristics. What follows is thus a general methods section applying to all of the experiments below, followed by shorter methods sections for each individual experiment covering those key factors that differed.

Open Sciences Practices

The hypotheses, experimental designs, analysis plans, exclusion criteria, and sample sizes were determined in advance — and formally pre-registered — for all five experiments. The data, materials, code, and pre-registrations for all experiments reported here are available at https://osf.io/x2ahn/.

Participants

Each experiment recruited 100 subjects (i.e., 500 subjects total across all five experiments) from Amazon Mechanical Turk (https://mturk.com). For discussion of this subject pool’s reliability, see (Crump:2013fn), who use this platform to replicate several classic findings from cognitive and perceptual psychology. All subjects consented to participate in the study and were monetarily compensated for their participation.

All experiments applied exclusion criteria that required successful performance on attention-check trials and tests of the subjects’ audio quality. These criteria varied slightly across experiments, and were always pre-registered. Across all experiments, an average of 77% of subjects passed all exclusion criteria and were thus included in subsequent analyses. However, we note that no result reported here depended in any way on these exclusions; in other words, all of the results reported below remained statistically significant, in the same direction, even without excluding any subjects at all.

Generating Audio Commands

All experiments involved the presentation of adversarial speech. To generate these speech stimuli, we used the white-box method described in (Carlini:vl)333We thank Nicholas Carlini for generously sharing the code to create these stimuli.. We chose the white-box method (so named because it requires knowledge of the speech recognition system of interest [i.e., not a “black box”]) because it was identified by the original authors as “significantly better than the black-box attack” at fooling human subjects. For this attack (and only for this attack), it was claimed that “0%” of subjects understood the speech clips, and even that subjects “believed our audio was not speech”. To maximize the difficulty of comprehending these hidden audio commands (so that it will be especially clear how sensitive tests can reveal such comprehension), we also chose the version of this attack that does not correct for difficulties being heard “over-the-air”.

The attack proceeds by running the output of a text-to-speech system through an “audio mangler”, which removes audio features not used by the speech-recognition system. This erases many features that humans rely on in order to understand an audio clip. Additionally, the audio mangler attempts to render phonemes in as few frames as possible in order to further hinder human understanding.444Note that the speech-recognition system attacked by (Carlini:vl), CMU Sphinx (lamere2003cmu)

, has a non-neural architecture (based on a Hidden Markov Model) and so is in many ways

unlike the systems that are much discussed today as sharing deeper aspects of human perception and cognition. But this fact only serves to strengthen any positive results in our experiments: If we can demonstrate greater-than-expected overlap between human perception and a machine-perception system that is not typically thought to have architectural similarities to the human mind and brain, then it should seem all the more impressive if humans are able to understand such messages. (Moreover, it is not actually clear that HMMs and other non-neural architectures are necessarily ‘worse’ as models of human perception and cognition; indeed, they were regularly used for just that purpose in a previous generation of computational cognitive science; kaplan2008overview, miller1952finite.) All audio files used here, as well as the code for generating them, are available in our archive of data and materials.

Experiment 1: Which One is Speech?

Can more sensitive tests reveal deeper human understanding of adversarial speech? Experiment 1 first asked whether forced-choice conditions could allow human subjects to distinguish adversarial speech from closely matched non-speech (even without requiring that subjects report the content of the speech; see Experiments 3–5 for tests of such contentful comprehension). We synthesized several dozen adversarial speech commands that previous work suggested should be “unintelligible to human listeners” and even “believed [to be] not speech” (Carlini:vl), and then played these commands to subjects either forwards or backwards (Figure 2A). We predicted that subjects would hear the forwards-played audio as more speech-like than the backwards-played audio, even though the two kinds of clips were matched on nearly all possible low-level features (since these two trial types involved the very same audio clips — the only difference was whether they were played forwards or backwards). If so, this would suggest that subjects do hear such audio clips as speech after all, in ways that would suggest a greater overlap in how such audio is processed by human listeners and the relevant speech-recognition systems.

Methods

Stimuli

We generated 54 hidden audio commands using the procedure described above. To select the content of the speech commands, we chose common idioms, quotes from history and media, or natural sequences — for example, “laughter is the best medicine”, “we have nothing to fear but fear itself”, and “1 2 3 4 5 6 7 8 9 10 11 12”. (See materials archive for the full list of phrases.) We chose such phrases instead of completely arbitrary collections of words so that we could roughly match the familiarity of the phrases used in (Carlini:vl) — which included, for example, the phrases “take a picture” and “text 12345”. Thus, in both our study and in past studies, the stimuli included words and phrases that a typical listener would have heard before, even though the subjects had no advance knowledge of which particular words would appear. Notably, this is also the case for the likely words that a malicious attacker might transmit to a smartphone or home assistant (e.g., messages involving key words or phrases such as “call”, “browse”, “unlock”, etc., as well as small numbers).

In addition to the 54 audio clips containing these messages, we also generated a corresponding set of 54 audio files that simply played those very same clips in reverse

. This process ensured that these two sets of stimuli were minimal pairs, perfectly matched for many low-level auditory characteristics, including average length, frequency, intensity, and so on (as well as the variance in such characteristics). Thus, these pairs differed mostly or only in whether they followed the patterns characteristic of human speech.

Finally, we generated one audio file containing a simple tone, to be used as a “catch” trial to ensure that subjects were engaged in the task and paying attention (see below). This file appeared 5 times in the experiment.

There were thus 109 audio files (54 Speech + 54 Non-Speech + 1 Catch), and 59 trials (54 experimental trials containing one forwards clip and its backwards counterpart, and 5 catch trials). All clips were generated, stored, and played in .wav format.

Procedure

Subjects were told a brief story to motivate the experiment:

A robot has hidden English messages in some of the following audio clips. Can you help us figure out which clips contain the robot’s messages? We know that about half of the messages have hidden English and half of them don’t; we want your help figuring out which ones are English and which are not.

The experiment proceeded in a self-paced manner, with subjects triggering the playing of the clips. On each trial, subjects completed a two-alternative forced-choice task (2AFC). Two embedded audio players appeared onscreen, each loaded with a single clip that was played when the subject hit a “play” button. The two clips were always forwards and backwards versions of the same adversarial audio command (with left-right position on the display always randomized for each trial). After the subject played each clip at least once, they could select whether the left or right clip was the one that contained English speech. Subjects could play each clip additional times if they chose to before responding. After making their selection, the next trial began and proceeded in the same way. The command played on each trial was always randomly chosen (without replacement) from the 59 total trials (54 Experimental and 5 Catch), each of which was played for every subject.

Note that, even though these clips were of fairly well-known phrases, subjects had no advance knowledge of the particular words they should look for in the clips. As has been noted previously (Carlini:vl), adversarial speech can sometimes be easier to decipher when one knows (or is “primed” by) what the message is supposed to be; but, as in previous work, no such knowledge or priming was possible here (beyond knowledge that the message might be drawn from the extremely broad class of messages that includes all vaguely familiar phrases, idioms, and sequences in English).

To ensure that subjects were paying attention and that the audio interface was working properly, subjects were instructed about how to behave on Catch trials: “A few times in the experiment, instead of hearing some sounds from the robot, you will instead hear a simple beep or tone; whenever that happens, make sure to click ‘Right’.” No data from Catch trials was included in our analysis, except as criteria for exclusion. Additionally, before beginning the experiment, a “sound test” was performed in which a single audio clip was played (Bach’s Cello Suite No. 1), and subjects had to say what kind of audio the clip contained (beeping, clapping, conversation, traffic, music, or ocean). Only if subjects selected “music” could they proceed to the experimental trials; any other option ended the experiment without collecting data.

We excluded subjects based on two criteria. First, any subject who failed to provide a complete dataset was excluded from our analysis. Second, any subject who failed to follow instructions on any one of Catch trials (i.e., who selected “Left” when they should have selected “Right”) was excluded entirely. This was done to ensure that all subjects had read the instructions and were focused on the experiment. These exclusion criteria, along with the rest of the design and analysis plan, were pre-registered.

Readers can experience this task for themselves at https://perceptionresearch.org/adversarialSpeech/E1.

Figure 2: Design and Results of Experiment 1. (A) On each trial, subjects heard two adversarial speech clips, one forwards-played and one backwards-played version of the same adversarial audio command; their task was just to say which contained English speech. (B) Subjects correctly discriminated adversarial speech from these closely matched non-speech clips (leftmost graph, in blue). The vast majority of subjects classified the clips correctly more often than they classified them incorrectly (middle graph, in yellow), and the vast majority of speech clips were classified correctly more often than they were classified incorrectly (rightmost graph, in red). (C) This performance can be visualized as a “beeswarm” plot, where each point represents performance on the discrimination task either for one subject (yellow) or for one audio command (red).

Results and Discussion

Subjects demonstrated an ability to discriminate adversarial speech from closely matched non-speech. Average accuracy on experimental trials was 70.6%, which significantly differed from chance accuracy (50%), (Figure 2B, leftmost graph in blue)555A direct replication of Experiment 1, with extra reminders for how to behave on catch trials, excluded fewer subjects and produced a similar pattern of results: 63.8%, .. Thus, subjects were able to reliably determine whether an adversarial audio clip contained speech.

In addition to the pre-registered analysis we report above, we also conducted two exploratory analyses whose logic closely followed previous work (Zhou:2019jxa). Beyond the “raw” accuracy across all subjects and all trials (i.e., the proportion of trials in which subjects correctly classified a clip as Speech or Non-Speech), it may also be informative to know (a) the proportion of subjects who tended to classify such stimuli correctly (collapsing across all speech clips), as well as (b) the proportion of speech clips that tended to be classified correctly (collapsing across all subjects).

In fact, collapsing across all speech clips, 92.9% of subjects showed classification performance that was numerically above chance (Figure 2B, middle graph in yellow). In other words, the vast majority of subjects tended to classify such clips correctly rather than incorrectly, suggesting that the ability to hear adversarial speech as speech was quite widespread. Additionally, collapsing across all subjects, 94.4% of the 54 speech clips were classified correctly more often than they were classified incorrectly (Figure 2B, right graph in red, and Figure 2C)666Note that this doesn’t mean that each of these subjects performed significantly above chance, or that each of these clips were identified as speech significantly above chance — only that 92.9% of subjects got the right answer more often than they got the wrong answer (whereas chance performance would predict that only 50% of subjects would do so), and that 94.4% of speech clips were classified correctly more often than they were classified incorrectly (whereas chance performance would predict that only 50% of clips would be correctly identified in this way). As stated in the main text, mean performance on any given trial was 70.6%..

In other words, whereas previous methods involving speech transcription suggested that “0%” of subjects heard the hidden messages in the files, or that subjects “believed our audio was not speech”, our approach here suggested that a large majority of subjects heard the speech clips as speech more often that not, and that nearly all clips were heard as speech more often than not. These results thus provided initial evidence that a more sensitive approach (here, using forced-choice classification) could reveal perceptual knowledge and abilities that less sensitive tests were unable to detect.

Experiment 2: Speech or Non-Speech?

The previous experiment revealed an ability to discriminate adversarial speech from closely matched non-speech, when subjects are tested under sensitive forced-choice conditions. However, the psychophysically powerful 2AFC design may have given subjects an undue advantage, since success on this task requires only that subjects decide which clip sounds more speechlike, regardless of whether either of the clip actually sounds like speech. In other words, it’s possible that subjects felt that neither clip sounded like speech much at all, but were still able to succeed on the test as long as they could tell which clip better resembled ordinary speech. Though this is, in many ways, the entire purpose of 2AFC task designs, people in the real world (i.e., where such adversarial attacks might actually be deployed) may not be in the position of subjects in Experiment 1. In that case, a natural question is whether subjects could identify adversarial speech as speech under looser conditions in which they must actively label a single clip as speech, rather than compare two clips.

Experiment 2 investigated this question by asking subjects to label single clips as speech or non-speech. The design was extremely similar to Experiment 1, except that instead of 54 experimental trials each containing two speech clips (one forwards and one backwards), there were 108 experimental trials each containing one clip (either the forwards or backwards version of the 54 audio commands). Subjects’ task was now to classify each clip as speech or non-speech, rather than to decide which of two clips was more speechlike. Could subjects succeed even under these conditions? (Readers can experience this task for themselves at https://perceptionresearch.org/adversarialSpeech/E2.)

Results and Discussion

Subjects demonstrated an ability to identify adversarial speech clips as speech. Average accuracy on experimental trials was 62.2%, which significantly differed from chance accuracy (50%), . Performance was comparable on forwards trials (63.0%) and backwards trials (61.5%). Thus, subjects were able to reliably determine whether an adversarial audio clip contained speech, by identifying adversarial speech as speech and closely matched non-speech as non-speech.

We also performed the same two exploratory analyses as described in Experiment 1. Collapsing across all speech clips, 91.0% of subjects showed classification performance numerically above chance. And collapsing across all subjects, 84.3% of the clips were classified correctly more often than they were answered incorrectly.

Thus, in addition to being able to tell the difference between adversarial speech and adversarial non-speech, subjects could also identify a given adversarial speech clip as speech.

Experiment 3: Fill in the Blank

The previous experiments suggested that human subjects can identify and discriminate adversarial speech from closely matched non-speech. But this result says little or nothing about subjects’ ability to understand the content of that speech. Experiment 3 thus asked whether forced-choice conditions could allow human subjects to display knowledge of the content of adversarial speech, by asking them to identify the next word in an adversarial phrase.

We took a subset of the speech clips from Experiments 1–2 and simply removed the last word from the phrases (so that, e.g., “It is no use crying over spilled milk” became “It is no use crying over spilled”), and then asked subjects to supply the final word under forced-response conditions. If they can do so, this would suggest that subjects can engage in deeper and more contentful processing of adversarial speech, beyond knowing which clips are speech and which are not.

Methods

Experiment 3 proceeded in the same way as Experiments 1–2, except as noted below.

Stimuli

To generate the stimuli, we analyzed performance on the 54 adversarial speech clips from Experiment 2, and selected the 20 phrases with the highest classification accuracy in that experiment. Using the same procedure described earlier, we generated new adversarial speech clips from these 20 phrases, shortened by one word. For example, “laughter is the best medicine” became “laughter is the best”. (See materials archive for the full list of phrases.) Importantly, all of these missing “last words” (e.g., “medicine” above) were unique to a given adversarial speech clip, and none of these words were spoken in any of the other adversarial speech clips. These clips served as the experimental stimuli.

We also generated 3 files containing non-mangled speech using an online text-to-speech system. These commands contained an uncorrupted human voice reciting portions of the alphabet (e.g., “A, B, C, D, E…”). These were used as catch trials to ensure that subjects were engaged in the task and paying attention.

Procedure

Subjects were told a modified version of the story from Experiments 1–2:

A robot has hidden English messages in some audio transmissions that we’ve recovered. However, these audio clips have been “corrupted”, in two ways. First, most of the transmissions sound very strange and garbled; the robot’s “voice” is very different than a human voice. Second, they’ve all been cut short by at least one word; for example, a message that was supposed to be “O say can you see by the dawn’s early light” might actually come across as “O say can you see by the dawn’s early”.

After the subjects played the clip on a given trial (e.g., “It is no use crying over spilled”), the two buttons that appeared either contained the correct next word in the current phrase (e.g., “milk”), or a word that would complete a different experimental phrase (e.g., “medicine”). The incorrect option was randomly selected from the pool of last words for other phrases in the experiment (without replacement), such that each last word appeared twice in the experiment, once as the correct option and once as the incorrect option. The pairs of options were randomly generated for each subject, as was the order in which the clips were shown. Note that, even though these clips were of fairly famous phrases, subjects had no advance knowledge of the particular kinds of words they should look for in the clips, just as in Experiments 1–2. The only property uniting these phrases was that they were likely to be vaguely familiar (rather than, say, being likely to be about food, sports, or some other theme).

As in previous experiments, we excluded any subject who failed to provide a complete dataset, as well as any subject who answered any of the Catch trials incorrectly.

Readers can experience this task for themselves at https://perceptionresearch.org/adversarialSpeech/E3.

Figure 3: Design and Results of Experiment 3. (A) On each trial, subjects heard an adversarial speech clip that was missing its last word; their task was to identify the word that should come next. (B) Subjects correctly identified the next word in adversarial speech clips (leftmost graph, in blue). A majority of subjects identified these last words correctly more often than not (middle graph, in yellow), and the vast majority of speech clips had their last words identified correctly more often than not (rightmost graph, in red). (C) This performance can be visualized as a “beeswarm” plot for subjects (yellow) and clips (red).

Results and Discussion

Subjects demonstrated an ability to identify the next word of a phrase contained in an adversarial speech command. Average accuracy on experimental trials was 67.6%, which significantly differed from chance (50%), (Figure 3B, leftmost graph in blue). Thus, subjects could reliably understand at least some of the content of the adversarial speech commands, and could apply this comprehension to identify the next word of the hidden phrase. Collapsing across all speech clips, 79.3% of subjects showed classification performance numerically above chance (Figure 3B, middle graph in yellow). Thus, most subjects were able to correctly identify the next word in the hidden phrase. Collapsing across all subjects, 19 of the 20 clips were completed correctly more often than they were completed incorrectly (and 1 was completed correctly exactly half of the time).

These results suggest that subjects can not only hear adversarial speech as speech, but can also decipher at least some of the content of such speech. Whereas the task in Experiments 1–2 could have been solved by attending to phonetic or phonological features (such as prosody, intonation, or picking up one or two phonemes), Experiment 3 required some understanding at the level of words. This ability thus goes far beyond what was previously shown in transcription tasks, further showing how sensitive sensitive tests can reveal surprisingly robust perceptual knowledge in ways that less sensitive tests cannot.

Experiment 4: Fill in the Blank, After a Single Play

Experiment 3 suggested that subjects can not only hear adversarial speech as speech, but can also comprehend the content of that speech. However, it is perhaps possible that the forced-choice options themselves assisted subjects in parsing the speech, and in a way that could potentially undermine our interpretation of that experiment. When one is given a clue about what to hear in obfuscated speech, it may be easier to hear that speech in line with one’s expectations (suggesting that speech processing is subject to top-down influence; see especially Remez:1981ub, as well as discussion in firestone2016cognition, firestone2016seeing, vinson2016perception). In that case, one could imagine the following sequence of events: First, subjects heard an adversarial speech clip (e.g., “laughter is the best  ”) and initially found it completely incomprehensible; second, they noticed the possible answers (e.g., “milk” vs. “medicine”); third, it occurred to them that not many well-known phrases end in “medicine”, such that one of the possible phrases might be “laughter is the best medicine” (or “a taste of your own medicine”, but perhaps few others); fourth, and finally, they re-played the adversarial speech clip while paying special attention to whether it might be the phrase “laughter is the best”, and indeed were able to hear it that way.

Though this interpretation would still attribute to subjects some ability to comprehend adversarial speech, it is perhaps less impressive than subjects exhibiting this ability even without any such hints. So, to find evidence of this more impressive ability, Experiment 4 repeated Experiment 3 with two small changes. First, the “last word” options on a given trial were revealed only after the adversarial speech clip had completely finished playing, such that every subject first heard the clip without any keywords that might tell them what to attend to. Second, though we still gave subjects the ability to play the clips multiple times, we recorded the number of plays a given subject made on a given trial, and pre-registered a follow-up analysis including only those trials in which the subject chose not to re-play the clip (i.e., those trials on which the first, clue-less play was the only play). If subjects can still identify the missing word in a clip even without any hint in advance about which particular words might be in the clip, and even on only a single clue-less play of the clip, this would be especially compelling evidence that subjects can and do comprehend aspects of the content of adversarial speech.

Readers can experience this task for themselves at https://perceptionresearch.org/adversarialSpeech/E4.

Results and Discussion

In fact, this is exactly what we observed. First, considering all trials (including multi-play trials), subjects anticipated the next word in the phrases with an accuracy rate of 67.8%, which significantly differed from chance (50%), ; this result replicates Experiment 3. But second, and more tellingly, even on trials in which subjects played the clip only a single time (such that they knew the last word options only after hearing the adversarial speech clip), subjects retained the ability to anticipate the next word in the adversarial speech phrases, with an accuracy rate on such trials of 69.8%, which significantly differed from chance (50%), . Thus, subjects can comprehend the content of adversarial speech, even without advance knowledge of which words to hear in such clips.

Experiment 5: Math Problems

Even though the results of Experiments 3–4 suggest that human listeners can hear meaning in adversarial speech commands, the task could still have been completed with a very minimal understanding. For example, a subject who was played “It is no use crying over spilled” could fail to comprehend almost the entire message, but still correctly guess “milk” over relevant foils if they just heard one telling keyword (e.g., “crying”). Could subjects instead complete a task that required parsing an entire message spoken in adversarial speech?

Experiment 5 asked whether subjects could correctly answer simple arithmetic problems that are contained in adversarial speech commands. We generated 20 audio commands containing these arithmetic problems (e.g., “six minus two”), and then asked subjects to select the correct answer from two options. On one hand, this approach constrains the space of possible messages from the extremely broad space explored earlier (i.e., the space of words that appear in familiar phrases) to a more limited space involving permutations of the numbers 0 through 9 and the operations of addition and subtraction. On the other hand, this task remains especially challenging, because mis-hearing even one word of the problem would make it difficult or impossible to answer the question correctly. So if subjects can succeed at this task, this would suggest that they can essentially understand every word of an adversarial speech command when tested under more constrained conditions.

Methods

Stimuli

We generated 20 arithmetic problems using the procedure described above. The problems contained two digits (from 0-9) and one instance of either the addition or subtraction operation. The problems were generated so that each digit (0-9) was the correct solution to two problems: one addition problem and one subtraction problem.

Similarly to Experiments 3–4, 3 files contained arithmetic problems spoken by an uncorrupted human voice, and served as catch trials to ensure that subjects were engaged in the task and paying attention.

Procedure

Subjects were told a modified version of the story from previous experiments:

A robot has hidden simple math problems in some audio transmissions that we’ve recovered. However, these audio clips have been “corrupted”. Most of the transmissions sound very strange and garbled; the robot’s “voice” is very different than a human voice.

We want you to help us by solving the math problems. On each trial, you will listen to a short audio clip. These clips contain simple addition or subtraction problems, containing two numbers from zero to nine. After you play the clip, you will be presented with two possible answers to the problem. Your job is just to select the correct answer. For example, the robot voice might say “5 plus 3”. If that happens, you should select “8”.

This experiment proceeded in a very similar way to Experiments 3–4. After the subject played the clip, one button showed the correct answer, and one showed an incorrect answer. These pairs of options were randomly generated for each subject, as was the order in which the clips were shown.

Experiment 5 also contained Catch trials of the same form as Experiments 3–4, and used the same exclusion criteria.

Readers can experience this task for themselves at https://perceptionresearch.org/adversarialSpeech/E5.

Figure 4: Design and Results of Experiment 5. (A) On each trial, subjects heard an adversarial speech clip expressing a simple arithmetic problem; their task was to supply the answer. (B) Subjects correctly answered the adversarial arithmetic problems (leftmost graph, in blue). A majority of subjects gave more correct answers than incorrect answers, and the vast majority of problems were answered correctly more often than incorrectly (rightmost graph, in red). (C) This performance can be visualized as a “beeswarm” plot for subjects (yellow) and clips (red).

Results and Discussion

Subjects demonstrated an ability to correctly answer arithmetic problems posed in adversarial speech. Average accuracy on experimental trials was 64.6%, which significantly differed from chance (50%), (Figure 4B, leftmost graph in blue). Thus, subjects could reliably comprehend most or all of the content of these audio commands, since such knowledge was necessary to answer the problem correctly.

Moreover, collapsing across all speech clips, 81.9% of subjects showed classification performance numerically above chance (Figure 4B, middle graph in yellow). Collapsing across all subjects, 18 out of 20 arithmetic problems were answered correctly more often than they were answered incorrectly (and 1 was answered correctly exactly half of the time) (Figure 4B, right graph in red).

Whereas Experiments 3–4 could have been solved by understanding just one or two salient words, these results show that subjects displayed the ability to correctly decipher most or all of the adversarial speech commands. Noticing the word “spilled” could cue a subject to select “milk” as opposed to “time” when played the phrase “It is no use crying over spilled”. However, here, every word must be understood in order to solve these arithmetic problems, since even a single misheard word would make it nearly impossible to answer correctly. Thus, subjects can demonstrate an ability to understand whole phrases of adversarial speech, when tested in more sensitive ways.

One potential concern about this result is that it may not, after all, demonstrate comprehension of full speech commands, because subjects who heard only two words (e.g., “7 plus ***”) or even just the operation itself (e.g., “*** minus ***”) could still use this knowledge to perform above chance, even without comprehending the whole arithmetic problem. For example, consider a trial in which “9 minus 5” was played, but you heard only “*** minus ***”; if you then noticed that the two options for that trial were “9” and “4”, you might make the educated guess that “4” is more likely than “9” to be the correct answer, because there is only a single problem that could have the word “minus” in it while still having the answer “9” (i.e., the problem “9 minus 0”), whereas many more problems containing “minus” could have the answer “4” (including, e.g., “9 minus 5”, “8 minus 4”, “7 minus 3”, and so on). Since we have shown only that subjects perform above chance (and not, e.g., that they perform perfectly), it is possible that this above-chance performance merely reflects strategic responding based on partial knowledge, rather than complete understanding of adversarial speech commands.

However, this concern can be overcome by examining only those trials in which the correct answer was the less probable one given the logic above. For example, suppose “9 minus 0” was played, and the options were again “9” and “4”. If subjects on such trials correctly answer “9” at rates above chance, then that above-chance performance could not be explained by strategic responding after hearing “*** minus ***” (or even “9 minus ***”), since the optimal strategy in such cases of partial hearing would be not to answer “9”. More generally, consider all trials in which either (a) the problem included “plus” and the correct answer was the lesser of the two options, or (b) the problem included “minus” and the correct answer was the greater of the two options. These trials are ones in which hearing only “plus” or “minus” would lead you away from the correct answer — and so above-chance performance on such trials couldn’t be explained by such strategic responding.

In fact, when we carried out this analysis, it revealed that even on those trials in which strategic responding based on partial knowledge would produce the incorrect answer, subjects still performed far above chance: 67.8%, . This analysis provides especially strong evidence that successful performance on math problems posed in adversarial speech goes beyond mere strategic responding, and implies a fuller and more complete understanding of the messages contained in such audio signals.

General Discussion

What does it take to demonstrate that a human does not perceive a stimulus in a way that a machine does? Whereas previous work had identified a class of stimuli that machines comprehend but humans reportedly do not, here we showed that human subjects could display quite reliable and sophisticated understanding when tested in more sensitive ways. By taking adversarial speech as a case study, we showed that even when humans could not easily transcribe an adversarial speech stimulus (the previous benchmark for human understanding), they nevertheless could discriminate adversarial speech from closely matched non-speech (Experiments 1–2), finish common phrases started in adversarial speech (Experiment 3–4), and solve simple math problems posed in adversarial speech (Experiment 5) — even though such stimuli have been previously described as “unintelligible to human listeners”. Collectively, these results show how sensitive tests can reveal perceptual and cognitive abilities that were previously obfuscated by relatively “insensitive” tests, and in ways that directly inform comparisons between human and machine perception.

Increasing sensitivity

In Experiments 1–2, subjects reliably discriminated adversarial speech clips from the same clips played backwards. Whereas previous work suggested that subjects “believed our audio was not speech”, our tests showed that subjects not only can identify such clips as speech, but can do so even when compared to audio signals that are matched on relevant low-level properties (since they were just the very same clips played in reverse). At the very least, these experiments suggests that subjects can attend to the phonetic or phonological cues that minimally distinguish speech from non-speech, even when those cues are obscured by the adversarial-speech-generation process.

Experiments 3–4 showed that subjects can not only hear adversarial speech as speech, but can also comprehend the content of that speech. Whereas a subject could succeed in Experiments 1–2 simply by picking up patterns of prosody or segmentation, Experiments 3–4 asked subjects to fill in the last word of a phrase, and so required at least some comprehension of the adversarial speech clip. Moreover, this result couldn’t be explained by straightforward forms of “priming”. For example, Carlini et al. (Carlini:vl) rightly note that, if one is told in advance what to hear in an adversarial speech clip, it is surprisingly easy to hear that message. But in our Experiments 3–4, the subjects were given no advance knowledge about which words they would encounter. Indeed, even though subjects knew that the phrases would likely be familiar, simply being told that one will hear a familiar phrase says little about which words will appear in that phrase. Instead, it tells you only that, if you can make out one or two salient words (e.g., “picture” and “thousand”), you might be likely to correctly guess what will come next (e.g., “words”). But even this situation first requires understanding some aspects of the initially played speech clip, and so demonstrates a kind of comprehension that previous studies failed to reveal.

Finally, Experiment 5 created conditions that demonstrate complete (or nearly complete) comprehension of adversarial speech, by playing subjects spoken arithmetic problems and asking them to select the answer. For these problems (e.g., “six minus two”), every word of the problem was crucial to completing the task, since even a single misheard word would undermine one’s ability to answer it. In a sense, this task is nearly equivalent to the free transcription task from previous work (since it required near-perfect comprehension to succeed), except that this experiment involved a more constrained space of word possibilities (since the subjects knew to expect the numbers 0 through 9, and the words “plus” and “minus”). Nevertheless, there was still considerable uncertainty for any message: Even under these constraints, there were 110 possible problems that could have appeared on any given trial, such that subjects did not have a straightforward or trivial way to know what they were “supposed to hear” in the message.

Psychophysically inspired

Our approach here was motivated by a central insight from the human psychophysics literature: simply asking people to describe what they see, hear, feel, or know can severely underestimate the nature and extent of such knowledge. Of the many reasons for this, one is that subjects may adopt conservative or idiosyncratic criteria when generating such explicit reports. We perceptually experience more than we can hope to include in any report of such experiences, and so when faced with very unconstrained tasks (such as freely describing what we hear in an audio clip), we must make choices about which aspects of our experience to describe, and in how much detail. These choices will be influenced by a host of factors, including which aspects of our experience stand out as remarkable, which we think the experimenter wants to hear about, and even just how engaged and motivated we are by the task. By contrast, more constrained tasks of the sort we explore here have the ability to zero in on those aspects of the subject’s knowledge we are interested in, and in ways that require the subject to make very few (if any) choices about the relevance of various kinds of information. (Indeed, similar considerations may even apply to tests of machine perceptual knowledge itself. For relevant work on this issue, see Zoran_2015_ICCV, Ritter:2017uk.)

In fact, strictly speaking, what we have explored here is not literally a task that is somehow more sensitive, but rather a task that makes adequate a previously inadequate measure. Whereas previous work adopted “% correct” as the standard for comprehension of adversarial speech, this measure turns out to have been inadequate to reveal such comprehension when applied to a free transcription task. Perhaps some other measure (e.g., more advanced text mining) could, when applied to free transcription, capture the higher levels of comprehension that we observe here. But for our two-alternative forced-choice and forced-response tasks, the “% correct” measure was indeed adequate, because the task constrained subjects’ responses to the variables and dimensions of interest, thereby revealing knowledge that was hidden by previous approaches.

Consequences for human-machine comparisons

Increased understanding of the sort revealed here matters for at least two kinds of reasons:

First, and more practically, a major source of interest in adversarial attacks derives from the security concerns they raise about the vulnerability of various machine-learning applications. For example, adversarial images could be used to fool autonomous vehicles into misreading street signs, and adversarial speech could be used to attack smartphones and home assistants without a human supervisor knowing. But the present results suggest that, at least in some circumstances, humans may be more aware of such attacks than was previously evident. Indeed, at least for the adversarial speech attack explored here, our Experiments 1–2 suggests that humans may well know that they’re being attacked, even without knowing the precise way in which they’re being attacked. And our Experiments 3–5 suggest that they may even have more sophisticated knowledge of the content of such attacks. Indeed, the constraint in Experiment 5, involving numbers (“0”, “1”, “2”, …, “9”) and a few keywords (“plus”, “minus”) is closely analogous to certain kinds of attacks that malicious actors might actually deliver to a phone (e.g., “dial 911”).

But second, and more theoretically, human understanding of stimuli that fool machines is of interest to work in cognitive science (including both psychology and artificial intelligence) that explores similarities and differences between human perception and the analogous processes in various machine-learning systems. Adversarial attacks in particular are frequently invoked as evidence of “an astonishing difference in the information processing of humans and machines” (brendel2020adversarial; see also HendrycksG16b, Sabour:2015vd, Serre:2019cm, buckner2019comparative, ilyas2019adversarial, Zhou:2019jxa). But while it seems clear that humans and machines don’t perceive such stimuli identically, it is still of interest to know just how similar or different their perception of such stimuli is. As we show here, the answer to this question can be surprisingly subtle: Humans who cannot freely report the content of such stimuli may nevertheless decipher them under certain conditions, in ways directly relevant to claims about overlapping or non-overlapping “information processing” across such systems.

Of course, none of the present experiments imply that all adversarial attacks will be comprehensible to humans in this way (even though there are indeed other audio adversarial attacks that subjectively sound speechlike; Abdullah:2019uh). For example, some audio adversarial attacks involve ultrasonic frequencies beyond the range of human hearing (Zhang:2017kl); these attacks will certainly be incomprehensible to people. At the same time, that very fact can make them “weaker” or less threatening as attacks. For example, the developers of this attack note that an automated speech-recognition system could “defend” against it by restricting its processing to the frequency bands of human hearing (e.g., by modifying a microphone to “suppress any acoustic signals whose frequencies are in the ultrasound range”). Relatedly, it has also been shown that adversarial speech can be embedded in clips of normal human speech (carlini:v2, Qin:2019vz). These attacks seem particularly difficult to decipher, but at the same time the original authors note here too that they are audible if you “listen closely” (carlini:online), or more generally that they “can still be differentiated from the clean audio”, in such a way that could still make them detectable as attacks. All these cases, then, show just how much nuance is required to make valid comparisons across human and machine perception.

Broader lessons

Though the present experiments explore sensitive comparisons in a case study involving adversarial speech, this approach could apply much more broadly to nearly any comparison of human and machine perception. Indeed, even recent work that does not use the language of “sensitive tests” may still be considered within this framework. For example, it had previously been claimed that it is possible to generate bizarre visual images that machines recognize as familiar objects but which are “totally unrecognizable to human eyes” (Nguyen:2015cv). However, follow-up work using forced-choice responding showed that human observers can actually anticipate machine classifications when given relevant alternatives to choose from (Zhou:2019jxa).

More generally, the approach we advocate here — involving the use of sensitive comparisons

— could apply to much broader questions about the overlap of human and machine processing. Even beyond adversarial misclassification, recent work has shown that natural language inference systems often rely on surface-level heuristics to make judgements about whether one sentence logically entails another

(mccoy2019right). This work also includes a human-machine comparison, where it is concluded that “human errors are unlikely to be driven by the heuristics targeted” in that work. This work could also benefit from a more sensitive test, such as one that eliminates the ability to reread the sentences, or perhaps introduces a time constraint. Under these strict constraints, it is possible that human judgments would be more informed by surface-level heuristics, revealing that some aspect of human cognition are reflected in the mistakes of the natural language inference systems.

Another human-machine difference that could benefit from more sensitive tests is the finding that Deep Convolutional Neural Nets tend to classify images based on texture rather than shape (Baker:2018jp), whereas human subjects tend to classify based on shape rather than texture (Landau:1988gs). But the human subjects in (Baker:2018jp) were almost completely unconstrained in their testing conditions, being able to view the images and consider their labels for as long as they like. Perhaps forcing subjects to classify quickly, and after only a brief presentation, could bring the human and machine classification judgments into better alignment.

Whereas the present work has used sensitive tests to reveal machine-like capabilities on a difficult task, these two examples demonstrate ways in which sensitive testing can be used to reveal machine-like deficiencies on fairly straightforward tasks.

Finally, even though the example we explore here shows how sensitive tests can reveal similarities where there previously seemed to be dissimilarities, the opposite pattern of results is possible as well. First, for some future class of stimuli, sensitive tests could well reveal that humans cannot comprehend, perceive, or process them the way a machine does. In that case, researchers could become especially confident in a given human-machine difference that survives even a sensitive test. Second, sensitive tests could also isolate very specific differences in how a human and a machine classify a stimulus. For example, if a human views the “baseball” image from (Nguyen:2015cv) and consistently prefers a specific alternative label (e.g., “chainlink fence”), this would suggest all the more strongly that the two systems represent this image differently.

In conclusion

People experience more than they freely report. Though this is a familiar and often-studied problem in cognitive psychology and perception research, it is also relevant to research comparing human perception and cognition to the analogous processes in machines. Here, we have shown how lessons from human perception research can directly inform and advance such comparisons, including in ways that reveal latent or implicit knowledge that was not evident from initial (and perhaps insensitive) comparisons. We thus advocate the adoption of more sensitive tests of human and machine perception, so that we can better explore when humans and machines do — or don’t — perceive the world the same way.

Acknowledgments

For helpful discussion and/or comments on previous drafts, we thank Tom McCoy, Ian Phillips, and members of the JHU Perception & Mind Lab. For assistance with beeswarm plots, we thank Stefan Uddenberg. For resources relating to the production of adversarial speech commands, we thank Nicholas Carlini. This work was supported by a JHU ASPIRE Grant (M.L.) and the JHU Science of Learning Institute (C.F.).

Author Contributions

M.L. and C.F. designed the experiments and wrote the paper. M.L. ran the experiments and analyzed the data, under the supervision of C.F.

Data Availability

The data, materials, code, and pre-registrations supporting all of the above experiments are available at https://osf.io/x2ahn/.

References