Pythia: Grammar-Based Fuzzing of REST APIs with Coverage-guided Feedback and Learning-based Mutations

05/23/2020 ∙ by Vaggelis Atlidakis, et al. ∙ 0

This paper introduces Pythia, the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia uses a statistical model to learn common usage patterns of a target REST API from structurally valid seed inputs. It then generates learning-based mutations by injecting a small amount of noise deviating from common usage patterns while still maintaining syntactic validity. Pythia's mutation strategy helps generate grammatically valid test cases and coverage-guided feedback helps prioritize the test cases that are more likely to find bugs. We present experimental evaluation on three production-scale, open-source cloud services showing that Pythia outperforms prior approaches both in code coverage and new bugs found. Using Pythia, we found 29 new bugs which we are in the process of reporting to the respective service owners.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fuzzing [fuzzing-book] is a popular approach to find bugs in software. It involves generating new test inputs and feeding them to a target application which is continuously monitored for errors. Due to its simplicity, fuzzing has been widely adopted and has found numerous security and reliability bugs in many real-world applications. At a high level, there are three main approaches to fuzzing [God20]: blackbox random fuzzing, grammar-based fuzzing, and whitebox fuzzing.

Blackbox random fuzzing simply randomly mutates well-formed program inputs and then runs the program with those mutated inputs with the hope of triggering bugs. This process can be guided by code-coverage feedback which favors the mutations of test inputs that exercize new program statements [AFL]. Whitebox fuzzing [SAGE] can further improve test-generation precision by leveraging more sophisticated techniques like dynamic symbolic execution, constraint generation and solving, but at a higher engineering cost. All these blackbox, greybox, or whitebox mutation-based fuzzing techniques work well when fuzzing applications with relatively simple input binary formats, such as audio, image, or video processing applications [jpeg, mp3, mp4], ELF parsers [elf], and other binary utilities [binutils].

However, when fuzzing applications with complex structured non-binary input formats, such as XML parsers [xml], language compilers or interpreters [clang, gcc, python], and cloud service APIs [azure-apis], their effectiveness is typically limited, and grammar-based fuzzing is then a better alternative. With this approach, the user provides an input grammar specifying the input format, and may also specify what input parts are to be fuzzed and how [Peach, SPIKE, boofuzz, burp]. From such an input grammar, a grammar-based fuzzer then generates many new inputs, each satisfying the constraints encoded by the grammar. Such new inputs can reach deeper application states and find bugs beyond syntactic lexers and semantic checkers.

Grammar-based fuzzing has recently been automated in the domain of REST APIs by RESTler [restler]. Most production-scale cloud services are programmatically accessed through REST APIs that are documented using API specifications, such as OpenAPI [swagger]. Given such a REST API specification, RESTler automatically generates a fuzzing grammar for REST API testing. RESTler performs a lightweight static analysis of the API specification in order to infer dependencies among request types, and then automatically generates an input grammar that encodes sequences of requests (instead of single requests) in order to exercise the service behind the API more deeply, in a stateful manner. However, the generated grammar rules usually include few values for each primitive type, like strings and numeric values, in order to limit an inevitable combinatorial explosion in the number of possible fuzzing rules and values. These primitive-type values are either obtained from the API specification itself or from a user-defined dictionary of values. All these values remain static over time, and are not prioritized in any way. These limitations (fuzzing rules with predefined sets of values and lack of feedback) are typical in grammar-based fuzzing in general, beyond REST API fuzzing.

To address these limitations, we introdude Pythia 111 Pythia was an ancient Greek priestess who served as oracle, commonly known as the Oracle of Delphi, and was credited for various prophecies. , a new fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia’s mutation strategy helps generate grammatically valid test cases and coverage-guided feedback helps prioritize the test cases that are more likely to find bugs. This paper makes the following contributions:

  • We introduce Pythia, a new fuzzer that augments grammar-based fuzzing with coverage-guided feedback.

  • We implement a learning-based mutation strategy for Stateful REST API Fuzzing.

  • We present experimental evidence showing that by combining its learning-based mutation strategy and coverage-guided feedback, Pythia significantly outperforms prior approaches.

  • We use Pythia to test three productions-scale, open-source cloud services (namely GitLab, Mastodon, and Spree) with REST APIs specifying more than 200 request types.

  • We discover new bugs in all three services tested so far. In total, we found 29 new bugs and we discuss several of these.

The rest of the paper is organized as follows. Section II presents background information on REST API fuzzing and the motivation for this work. Section III presents the design of Pythia. Sections V and IV presents experimental results on three production-scale, open-source cloud services. Section VI discusses new bugs found by Pythia. Sections IX and VII discuss related work and conclusions.

Ii Background and Motivation

This paper aims at testing cloud services accessible through REpresentational State Transfer (REST) Application Programming Interfaces (APIs) [rest]. A REST API is a finite set of requests, where a request is a tuple , as shown below.

Field Description
Request Type (t) One of POST (create), PUT (create or update), GET (read), DELETE (delete), and PATCH (update).
Resource Path (p) A string identifying a cloud resource and its parent hierarchy with the respective resource types and their names.
Header (h) Auxilary information about the requested entity.
Body (b) Optional dictionary of data for the request to be executed successfully.

Consecutive REST API requests often have inter-dependencies w.r.t. some resources. For example, a request whose execution creates a new resource of type is called a producer of and a request which requires in its path or body is called a consumer of . A producer-consumer relationship between two requests is called a dependency. The goal of our fuzzer, which is a client program, is to test a target service through the APIs. The fuzzer automatically generates and executes (i.e., sends) various API requests with the hope of triggering unexpected, erroneous behaviours. We use the term test case to refer to a sequence of API requests and the respective responses.

Example REST API test case and detected bug. Figure 1 shows a sample Pythia test case for GitLab [gitlab], an open-source cloud service for self-hosted repository management. The test case contains three request-response pairs and exercises functionality related to version control commit operations. The first request (LABEL:req:1) POST creates a new GitLab project. It has a path without any resources and a body with a dictionary of a non-optional parameter specifying the desired name of the requested project (“21a8fa”). In response, it receives back metadata describing the newly created project, including its unique id (LABEL:res:1). The second request, also of type POST, creates a repository branch in an existing project (LABEL:req:2). It has a path specifying the previously created resource of type “project” and id “1243”, and a body with a parameter specifying the branch name (“feature1”), such that, the branch can be created within the previously created project. In response (LABEL:res:2), it receives back a dictionary of metadata describing the newly created branch, including its designated name. Finally, the last request (LABEL:req:3) uses the latest branch (in its path) as well as the unique project id (in its body) and attempts to create a new commit. The body of this request contains a set of parameters specifying the name of the existing target branch, the desired commit message (“testString”), and the actions related to the new commit (i.e., creation of a file). However, the relative path of the target file contains an unexpected value “admin\”, which triggers a 500 Internal Server Error (LABEL:res:3) because the unicode ‘x7’ is unhandled in the ruby library trying to detokenize and parse the relative file path. We treat “500 Internal Server Errors” as bugs. To generate new similar test cases with unexpected values, one has to decide which requests of a test case to mutate, what parts of that request to mutate, and what new values to inject in those parts.

Fig. 1: Pythia test case and bug found. The test case is a sequence of three API requests testing commit operations on GitLab. After creating a new project (first request) and a new branch (second request), issuing a commit with an invalid file path triggers an unhandled exception.

Complexity of REST API testing. The example of Figure 1 shows the sequence of events that need to take place before uncovering an error. It highlights the complexity of REST API testing due to the highly-structured, typed format of each API request and because of producer-consumer dependencies between API requests. For example, the second request in Figure 1 must include a structured body payload and also properly use the project id “1243” created by the first request. Similarly, the third request must include a body payload and properly use resources produced by the two preceding requests (one in its path and one in its body). Syntactic and semantic validity must be preserved within and across requests of a REST API test case. Each test case is a stateful sequence of requests, since resources produced by preceding requests may be used by subsequent requests.

Fig. 2: Pythia architecture.

Existing stateful REST API fuzzing. Stateful REST API fuzzing, introduced by RESTler [restler], is a grammar-based fuzzing approach that statically analyzes the documentation of a REST API (given in an API specification language, such as OpenAPI [swagger]), and generates a fuzzing grammar for testing a target service through its REST API. A RESTler fuzzing grammar contains rules describing (i) how to fuzz each individual API request; (ii) what the dependencies are across API requests and how can they be combined in order to produce longer and longer test cases; and (iii) how to parse each response and retrieve ids of resources created by preceding requests in order to make them available to subsequent requests. During fuzzing, each request is executed with various value combinations depending on its primitive types, and the values available for each primitive type are specified in a user-provided fuzzing dictionary. In the example of Figure 1, the value of the field “action” in the last request (LABEL:req3:body) will be one of “create”, “delete”, “move”, “update”, and “chmod” (i.e., the available mutations for this enum type) and the value of the field “commit_message” will be one of “testString” or “nil” (the default available mutations for string types). By contrast, the value of the field “branch,” which is a producer-consumer dependency, will always have the value “feature1” created by the previous request. Thus, the set of grammar rules driving stateful REST API fuzzing leads to syntactically and semantically valid mutations.

However, RESTler, and more broadly this type of grammar-based fuzzing, has two main limitations. First, the available mutation values per primitive type are limited to a small number in order to limit an inevitable combinatorial explosion in the number of possible fuzzing rules and values. Second, these static values remain constant over time and are not prioritized in any way.

Our contribution. To address the above limitations, in the next section, we introduce Pythia, a new fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia’s mutation fuzzing strategy generates many new grammatically-valid test cases, while coverage-guided feedback is used to prioritize test cases that are more likely to find new bugs.

Iii Pythia

 $S = sequence$
 $\Sigma = \Sigma_{http-methods} \cup ~\Sigma_{resource-ids} \cup ~\Sigma_{enum}$
  $\cup ~\Sigma_{bool} \cup ~\Sigma_{string} \cup ~\Sigma_{int} \cup ~\Sigma_{static} $
 $N = \{request,~ method,~ path,~ header,~ body,~ \beta_1,~\beta_2,~ \beta_3,$
  $~ producer,~ consumer, ~ fuzzable,~ enum,$
  $~ bool,~ string,~ int, ~ static\}$
 $R = \{sequence \rightarrow  request + sequence ~ | ~ \varepsilon $,
  $ request \rightarrow method + path + header + body$,
  $ method \rightarrow \Sigma_{http-methods}$ , $path \rightarrow  \beta_1 + path ~ | ~ \varepsilon$,
  $ header \rightarrow  \beta_1 + header ~ | ~ \varepsilon$, $ body \rightarrow  \beta_1 + body ~ | ~ \varepsilon$,
  $ \beta_1 \rightarrow  \beta_2 ~ | ~ \beta_3$, $ \beta_2 \rightarrow  producer ~ | ~ consumer$,
  $ producer \rightarrow  \Sigma_{resource-ids}$, $ consumer \rightarrow  \Sigma_{resource-ids}$,
  $ \beta_3 \rightarrow  static ~ | ~ fuzzable$, $ static \rightarrow \Sigma_{static} $,
  $ fuzzable \rightarrow string ~ | ~ int ~ | ~ bool ~ | ~ enum ~ | ~ uuid  $,
  $ string \rightarrow \Sigma_{string}, \dots \}$
Fig. 3: Regular Grammar (RG) with tail recursion for REST API test case generation. The production rules of with non-terminal symbols capture the properties of any REST API specification, while the alphabet of terminal symbols is API-specific since different APIs may contain different values for strings, integers, enums, and so on.
Fig. 4: RESTler seed test case & Pythia parse tree following

Pythia is a grammar-based fuzzing engine to fuzz cloud services through their REST APIs. Since these APIs are highly structured (see Section II), generating meaningful test cases (a.k.a. mutants) is a non-trivial task—the mutants should be structurally valid to bypass initial syntactic checks, yet must contain some erroneous inputs to trigger unhandled HTTP errors. Randomly mutating seed inputs often results in invalid structures, as we will see in Section IV-C2. One potential solution could be to sample the mutants from the large space of structurally valid inputs and inject errors to them. However, for complex grammars, like those defined for REST APIs, exhaustively enumerating all the valid structures is infeasible. As a workaround, Pythia first uses a statistical model to learn the common usage of a REST API from seed inputs, which are all structurally valid. It then injects a small amount of random noise to deviate from common usage patterns while still maintaining syntactic validity.

Figure 2 presents a high-level overview of Pythia. It operates in three phases: parsing, learning-based mutation, and execution. First, the parsing phase (Section III-A) parses the input test cases using a regular grammar and outputs the corresponding abstract syntax trees (ASTs). Input test cases can be generated either by using RESTler to fuzz the target service or by using actual production traffic of the target service. The next phase, learning-based mutation (Section III-B

), operates on these ASTs. Here, Pythia trains a sequence-to-sequence (seq2seq) autoencoder 

[seq2seq1, seq2seq2] in order to learn the common structure of the seed test cases. This includes the structure of API requests (i.e., primitive types and values) and the dependencies across requests for a given test case. The mutation engine then mutates the seed test cases such that the mutations deviate from the common usage, yet obey the structural dependencies. The mutated test cases are then executed by the execution engine. A coverage monitor tracks the test case executions on the target service and measures code coverage. Pythia uses the coverage feedback to select the test cases with unique code paths for further mutations.

Fig. 5: Mutations with new values that are *not* in the original test cases. The first mutation changes the request type from POST to GET and further pollutes it with random bytes. This leads an unhandled HTTP request type GTK. The second mutation changes “branch” using the value “developers_can_merge” from a completely different request definition. The later is further polluted with random bytes that turn it into “devexf1opers_can_merge”.

Iii-a Parsing Phase

In this phase, Pythia infers the syntax of the seed inputs by parsing them with a user-provided Regular Grammar (RG) with tail recursion. Such an RG is defined by a -tuple , where is a set of non-terminal symbols, is a set of terminal symbols, is a finite set of production rules of the form , where , , and is a distinguished start symbol. The syntactic definition of looks like a Context Free Grammar, but because recursion is only allowed on the right-most non-terminal and there are no other cycles allowed, the grammar is actually regular. Figure 3 shows a template for REST API test case generation. A test case that belongs to the language defined by is a sequence starting with the symbol sequence followed by a successions of production rules () with non-terminal symbols () and terminal symbols ().

Figure 4 shows how seed RESTler test cases are parsed by Pythia’s parsing engine. The successions of production rules in (see LHS of Figure 4) are applied to infer the corresponding Abstract Syntax Trees (ASTs) (see RHS); the tree internal nodes are nonterminals, and the leaves are terminals of . Pythia parses the tree in Depth First Search (DFS) order, which represents a sequence of grammar rules. For example, a simple test case X=‘‘GET /projects/1243/repo/branches" will be represented as a sequence of grammar production rules , as shown in the Figure. Given a set of seed inputs, thus, the output of this phase is a set of abstracted test cases, , which is passed to the training and mutation engines.

Fig. 6: Mutations with values available for the primitive types of the original test case. The value “password” is mutated using the value “” (available in the same seed) and is further polluted with random bytes that turn it into “”

Iii-B Learning-based Mutation Phase

The goal of this phase is first to learn the common structural patterns of the target APIs from the seed inputs (see Section III-B1), and then to mutate those structures (see Section III-B2) and generate new test cases. To learn the structural patterns from the existing test cases, Pythia uses an autoencoder model, , which is trained with the ASTs of the seed inputs (). An autoencoder consists of an encoder and a decoder (see Figure 7). represents an abstracted test case to an embedded feature space , which captures the latent dependencies of . decodes back to . To generate structurally valid mutants, Pythia then minimally perturbs the embedded feature and decodes it back to original space, say . Our key insight is that since the decoder is trained to learn the grammar, the output of the decoder from the perturbed hidden state will still be syntactically valid. Thus, will be syntactically valid mutant. This section illustrates this design in details.

Iii-B1 Training Engine

Given the abstracted test cases,

, the training engine learns their vector representations (i.e., encoding) using an autoencoder type of neural network 

[hinton2006autoencoders]. Pythia realizes the autoencoder with a simple seq2seq model trained over . Usually, a seq2seq model is trained to map variable-length sequences of one domain to another (e.g., English to French). By contrast, we train only on sequences of domain such that captures the latent characteristics of test cases.

A typical seq2seq model consists of two Recurrent Neural Networks (RNNs): an encoder RNN and a decoder RNN. The encoder RNN consists of a hidden state

and an optional output , and operates on a variable length input sequence . At each time t (which can be thought of as position in the sequence), the encoder reads sequentially each symbol of input , updates its hidden state by

, where f is a non-linear activation function, such as a simple a Long Short-Term Memory (LSTM) unit 

[lstm], and calculates the output by , where

is an activation function producing valid probabilities. At the end of each input sequence, the hidden state of the encoder is a summary

of the whole sequence. Conversely, the decoder RNN generates an output sequence by predicting the next symbol given the hidden state , where both and are conditioned on and on the summary of the input sequence. Hence, the hidden state of the decoder at time t is computed by , and the conditional distribution of the next symbol is computed by for given activation functions and .

We jointly train a seq2seq model on to maximize the conditional log-likelihood , where is the set of the learnt model parameters and each . As explained earlier, is trained on sequences of one domain (i.e., ) and is then given as input to the mutation engine.

Iii-B2 Mutation Engine

Input: seeds , RG grammar , model , batch size
1 while time_budget do
2        //Perturbation: Exponential search on random noise scale for  to  do

// Noise draw from normal distribution

// Bound and scale random noise // Add noise on decoder’s starting state
4        end for
5       // Select the prediction with smallest noise scale // Case 1: Grammar rules not seen in the current seed foreach  index in get_common_leafs() do
6               foreach  rule in rules do
8               end foreach
10        end foreach
11        // Case 2: Grammar rules from the new decoder’s prediction foreach index in get_different_leafs() do
12               foreach  rule in rules do
14               end foreach
16        end foreach
18 end while
Algorithm 1 Learning-based Pythia mutations
Fig. 7: Overview of Pythia Mutation Engine

For each test case , the mutation engine decides with what values to mutate each input location of . Since is a sequence of grammar rules (see Figure 4), the mutation strategy determines how to mutate each rule: whether to use alternative rules with different values not present in the current test case (example of Figure 5) or use rules available in the original seed test case (example of Figure 6). To mutate a seed test , Pythia first perturbs its embedded vector representation () by adding minimal random noise, and decodes it back to a new test case . The perturbation added by Pythia may create differences between and . These differences determine the mutation strategy on each location of the seed test case:

  • Locations where and are the same after perturbation indicate that the model has not seen many variations during training and mutations with new rules, not in the original input/output sequences, should be used (See example of Figure 5).

  • Locations where and

    differ indicates that the model has seen more variance during training and mutations with the rules seen by the model should be used (See example of

    Figure 6). In fact, these rules are auto-selected from the decoder.

Algorithm 1 presents the mutation strategy in details and Fig. 7 pictorially illustrates it. The algorithm takes a set of abstracted test cases , a regular grammar , a trained autoencoder model and its batch size as inputs, and continuously iterates over until the time budget expires (Line 1). At a high level, the mutation engine has two steps: identifying mutation types appropriate for each location and applying the changes to those locations.

Perturbation (lines 1 to 1): For each test case , the encoder of model obtains its embedding and then the embedded vector is perturbed with random noise. In particular, Pythia draws noise-values from a normal distribution, bounded by norm of and scaled exponentially in the range . The noise values are used to perturb independently times and get different perturbed vectors , which serve as different starting states of the decoder. In turn, they lead to different output sequences for each input . From these new outputs, Pythia selects which differs from and is obtained by the smallest (-norm) perturbation on .

The -step exponential search performed in order to find the smallest perturbation that leads to a new prediction helps avoid very pervasive changes that will completely destroy the embedded structure of

. Generally, norm-bounded perturbations is a common approach in the literature of adversarial machine learning  

[biggio2013evasion, goodfellow2014explaining, carlini2017towards] where, given a classification model and an input sample

originally classified to class

, the goal is to find a small perturbation that will change the original class of such that . Our use of perturbations in Algorithm 1 is different in two ways. First, the perturbations are random as opposed to typical adversarial perturbations that are guided by the gradients of the classification model . Second, the seq2seq model is not a classification model, but rather an autoencoder. The purpose of applying perturbations on the initial state of the decoder is, given a seed test case , to leverage the knowledge learnt from on and generate a new test case that is marginally different from the original one. We then compare and to determine the mutation strategy on each location of the seed test case.

Comparison & Mutation Strategies (lines 1 to 1): The result of the comparison between and determines the mutation strategy followed on each location of the seed test case. The two groups of nested for-loops implement the two different mutation strategies explained earlier. The first group of nested for-loops in the Algorithm targets leaf locations where and are the same (Case 1). For such positions of (lines 1 to 1), new mutations are generated by iteratively applying grammar rules with terminal symbols originally not in . The second group of nested for-loops (lines 1 to 1) targets leaf locations where and differ (Case 2). For such positions of new mutations are generated by iteratively applying grammar rules with terminal symbols in . In both cases, the new grammar rules are further augmented with random byte alternations on the byte representation of each rule’s terminal symbols. This augmentation with auxiliary payload mutations helps avoid repeatedly exercising identical rules.

Iii-C Execution Phase

In this step the execution engine takes as inputs new test cases generated by the mutations engine and executes them to the target service which is continuously being monitored by the coverage monitor. Executing a test case includes sending its requests to the target service over http(s) and receiving back the respective responses. Before testing, we statically analyze the source code of the target service, extract basic block locations, and configure it to produce code coverage information. During testing, the coverage monitor collects code coverage information produced by the target service and matches it with the respective test cases executed by the execution engine. Then, given the basic blocks statically extracted, each test case is mapped into a bitmap of basic blocks describing the respective code path activated. This helps distinguish test cases that reach new code paths and ultimatelly minimize an initially large corpus of seed test cases to a smaller one with test cases that reach unique code paths.

Iii-D Implementation

We use an off-the-shelf seq2seq RNN with input embedding, implemented in tensorflow 

[tensorflow]. The model has one layer of Gated Recurrent Unit (GRU) cell in the encoder as well as in the decoder. Dynamic input unrolling is performed using tf.nn.dynamic

RNN APIs and the encoder is initialized with a zero state. We train the model by minimizing the weighted cross-entropy loss for sequences of logits using the Adam optimizer 

[adam]. We use batches of sequences, iterate for training steps with a learning rate of , and an initial embedding layer of size . The vocabulary of the model depends on the number of production rules in the fuzzing grammar of each API family and ranges in couple of hundred of production rules. Similarly, the length of each sequence depends on the specific API and ranges from to items. Training such a model configuration in a CPU-only machine takes no more than two hours. All the experiments discussed in our evaluations were run on Ubuntu 18.04 Google Cloud VMs [google-cloud] with 8 logical CPU cores and 52GB of physical memory. Each fuzzing client is used to test a target service deployment running on the same machine.

Fig. 8: RQ1. Comparison of Pythia mutations strategies w.r.t. other baselines. Seed collection: Run RESTler on each API to generate seed test cases. The seed collection time is set to h for all APIs except for “issues” in which the respective time was extended to h. Within this time, RESTler reached a plateau for all the cases. Fuzzing: Use seed corpus to perform three individual h fuzzing sessions per API and let RESTler also run for an additional h additional hours. Comparison: Measure the number of new lines executed after the initial seed collection. Note that RESTler is run for h in total, but no new lines are discovered. Pythia performs best w.r.t.  all the baselines.

Iv Experimental Setup

Iv-a Study Subjects

Table I summarizes the APIs tested by Pythia. In total, we tested 6 APIs of GitLab [gitlab-doc], 2 APIs of Mastodon [mastodon], and 1 API from Spree [spree]. First, we test GitLab enterprise edition stable version 11-11 through its REST APIs related to common version control operations. GitLab is an open-source web service for self-hosted Git, its back-end is written in over 376K lines of ruby code using ruby-on-rails, and its functionality is exposed through a REST API. It is used by more than organizations, has millions of users, and has currently a 2/3 market share of the self-hosted Git market [gitlab-statistics]. We configure GitLab to use Nginx HTTP web server and Unicorn rails workers limited to up to GB of physical memory. We use postgreSQL for persistent storage configured with a pool of 20 workers and use the default GitLab default configuration for sidekiq queues and redis workers. According to GitLab’s deployment recommendations, such configuration should scale up to 4,000 concurrent users [gitlab-requirements]. Second, we test Mastodon, an open-source, self-hosted social networking service with more than M users [mastodon-statistics]. We follow the same configuration with GitLab regarding Unicorn rails workers and persistent storage. Third, we test Spree, an open-source e-commerce platform for Rails 6 with over M downloads [spree].

Table I shows the characteristics of the target service APIs under tests. All target APIs are related to common operation the users of the corresponding services may do. In principal, the total number of requests in each API family along with the average number of available primitive value combinations for each request indicate the size of the state space that needs to be tested. Furthermore, the existence of path or body dependencies, or both, among request types, capture another qualitative property indicative of how difficult it is to generate sequences of request combinations.

Iv-B Monitoring Framework & Initial Seed Generation

We statically analyze the source code of each target service to extract basic block locations and configure each service, using Ruby’s Class:TracePoint hooks, to produce stack traces of lines of codes executed during testing. During testing, all target services are being monitored by Pythia’s coverage monitor which converts stack traces to bitmaps of basic block activation corresponding to the test cases executed. In order to perform test suite minimization (seed distillation), we statically analyze the source code of each target service and extract basic blocks for GitLab, basic blocks for Mastodon, and basic blocks for Spree.

Pythia starts fuzzing using an initial corpus of seeds generated by RESTler, an existing, stateful REST API fuzzer [restler]. To produce these initial seeds, we run RESTler for a custom amount of time on each individual API family of each target service using its default fuzzing mode (i.e., Breadth First Search), its default fuzzing dictionary (i.e., two values for each primitive type), and by turning off its Garbage Collector (GC) to obtain more deterministic results.

Iv-C Evaluating Pythia

Iv-C1 Baselines.

We evaluate Pythia against three baselines.

(i) RESTler. We use RESTler both for seed test case generation and for comparison. On each target API, we run RESTler for days. The first day, seed collection phase, is used to generate seed test cases. The second day, fuzzing phase, is used for comparison. We compare the incremental coverage achieved by RESTler versus Pythia over the coverage achieved by the initial seed test cases.

(ii) Random byte-level mutations. This is the simplest form of mutations. As suggested by their name, byte-level mutations are random alternations on the bytes of each seed test case. In order to produce byte-level mutations, the mutation engine selects a random target position within the seed sequence and a random byte value (in the range ), and updates the target position to the random byte value. Naturally, this type of mutations are usually neither syntactically nor semantically valid (defined in Section II).

(iii) Random tree-level mutations. In order to produce random tree-level mutations, the mutation engine selects a random leaf of the respective tree representation and a random rule in with a terminal symbol, and flips the target leaf with using the random rule. The mutations are exclusively performed on the tree leafs, and not in internal nodes, in order to maintain the syntactic validity of each test case. However, since the target leafs and the new rules (mutations) are selected at random for each test case, the target state space for mutations on realistic tests cases is quite large. For example, the test case show in Fig. 1 is represented to a tree that consisting of 73 leaf nodes and the RG used to produce it has 66 rules with terminal rules. This, defines a state space of almost 5,000 feasible mutations only for one seed — let alone the total size of the state space defined by the complete corpus of seeds. Next, we evaluate Pythia’s learning-based mutation strategy which considers the intrinsic structure of each test case and significantly prunes the size of the search space.

Target API Total Request
Service Family Requests Dependencies
GitLab Commits 15 (*11) Path, Body
Brances 8  (*2) Path
Issues & Notes 25 (*20) Path
User Groups 53 (*2) Path
Projects 54 (*5) Path
Repos & Files 12 (*22) Path
Mastodon Accounts & Lists 26 (*3) Path, Body
Statuses 18 (*19) Path
Spree Storefront Cart 8 (*11) Path
TABLE I: Target service APIs. Shows number of distinct request types in each API family, (*) average number of primitive value combinations that are available for each request type, and the respective request dependencies.

Iv-C2 Evaluation

We answer the following questions:

  1. How do the three baselines compare with Pythia in terms of code coverage increase over time? (Section V-A)

  2. How does initial seed selection impact the code coverage achieved by Pythia? (Section V-B)

  3. What is the impact of seed distillation (test suite minimization) on code-coverage? (Section V-C)

  4. Can Pythia detect bugs across all three services? (Section V-D)

V Results

Fig. 9: Impact of initial seed collection. Seed collection: Run RESTler for h on each API. Fuzzing: Use each corpus to perform three individual h guided tree-level Pythia mutation sessions. Moreover, let RESTler run for h additional hours (h in total). Comparison: Show the number of new lines executed after the initial h of seed collection.

V-a RQ1. Code Coverage achieved by Pythia

In this RQ, we investigate Pythia’s impact on the total line coverage achieved across all the APIs shown in Table I. We compare Pythia against the three baseline fuzzers introduced in Section IV-C1. In particular, we check whether Pythia can find new lines once RESTler reaches a plateau. We run RESTler for h per API setting except “Issues & Notes”, in which the seed collection phase is extended to h due to a late plateau (explained in Section V-B). We train Pythia with these seeds and then fuzz with the generated new test inputs for additional h. The other two baselines also use RESTler generated seeds, mutate these seeds using their own strategies, and fuzz the target program for h. Figure 8 shows the results for GitLab APIs.

First, we observe that for all the APIs, Pythia exercised unique new lines of code during fuzzing. Since RESTler has plateaued after the initial h of seed collection (h for “Issues & Notes) no new lines are discovered by RESTler during the latter h of fuzzing. This type of plateau, which is usual in fuzzing, is expected in the case of RESTler because it has to explore an exponential state space as the number of requests in a test case increases. For example, after the first h in “Commits” API, RESTler has to explore a state space defined by sequences of length five and feasible renderings each, on average, before moving on to sequences of length six. This state-space explosion is similar across all APIs. Moreover, while stuck searching a large search space, RESTler uses repeatedly the same fuzzing values, generating likely-redundant mutations.

Further, across all APIs, both Pythia and the two random baselines discover new lines of code that were never executed by RESTler. This demonstrates the value of continuously attempting new mutation values instead of repeatedly applying a fixed set of ones in different combinations. Even the trivial random byte-level mutation finds at least additional lines, on top of those discovered by RESTler, in all cases. Compared to all baselines, Pythia always increases line coverage the most, ranging from additional lines (in “Groups & Member” APIs) to extra lines (in “Commits”).

We also observe that across all APIs the relative ordering of Pythia and the three baselines remains consistent over time: Pythia performs better than the random tree-level baseline, which, in turn, performs better than the random byte-level baseline. Such ordering is expected. As explained in Section V-C, and also motivated in Fig. 1 with a concrete example, raw byte-level mutations tend to violate both semantic and syntactic validity of the seed test cases and consequently underperform compared to tree-level mutations that obey syntactic validity. Although the latter produces syntactically valid mutations, it mutates without any guidance and thus, cannot target its mutation effort to the right direction that can have larger impacts on the code coverage. In contrast, Pythia learns the potential mutation location and values from the existing seed corpus and thus increase line coverage faster and higher.

We ran the same experiments across the APIs of Mastodon and Spree and observed that the relative comparison between Pythia and RESTler always yield the same conclusion: overall, Pythia always finds test cases that execute new, additional lines of code not executed by RESTler. Specifically, after h of fuzzing, Pythia finds new lines in “Accounts & List” and new lines in “Statuses” of Mastodon; and new lines in Spree’s “Storefront Chart”.

V-B RQ2. Impact of Seed Selection

Previously, we saw that well after RESTler plateaus, Pythia still discovers test cases that increase code coverage. However, it is unclear how the two tools compare before RESTler plateaus. This leads to question what is the impact of initial seed selection on the line coverage achieved by Pythia. We select “Issues & Notes” API, which takes a longer time to plateau among all the APIs (i.e., after h), and examine three configurations: initial seeds collected after h, h, and h of RESTler run. Fig. 9 shows the results.

In h and h settings, although Pythia started achieving better coverage, RESTler took off after a few hours of fuzzing. However, once RESTler plateaus after h, Pythia keeps on finding new code. Fig. 9 shows the union and intersection of the lines discovered by RESTler and Pythia to understand whether the two tools are converging or orthogonal in terms of discovering new lines. We observe across all the plots of Fig. 9 that the intersection remains constant while the union increases. This means that the two tools discover diverging sets of lines.

As explained earlier, Pythia finds new lines because it performs mutations with new values, whereas RESTler constantly uses a predefined set of values. In addition, it is now also clear that Pythia cannot cover lines covered by RESTler test cases. This is because, by construction, the mutations generated by Pythia exercise new rules and lead to syntactically, and largely semantically, valid test cases but they are not designed to mutate the request sequence semantics. In other words, no new request sequence combinations will be attempted by Pythia. Instead, Pythia focuses on mutations at the primitive types within individual requests. This limitation becomes evident when, before the last plateau, RESTler increases sequence length (and covers new lines by producing longer test cases), whereas Pythia has no means of deriving such test cases. The conclusions drawn by investigating the impact of the initial seed volume in “Issues & Notes” generalize across all APIs tested so far.

V-C RQ3. Impact of Seed Distillation

Fig. 10: Impact of distillation (test suite minimization). Seed collection: Run RESTler for h on each API in order to collect seed corpora. Fuzzing: Use each corpus to perform two individual 24h guided tree-level Pythia mutation sessions. One with test suite minimization (distillation) and another without. Comparison: Show the number of new lines executed after the initial seed collection.

All target services are being monitored by Pythia’s coverage monitor in order to perform test suite minimization, referred to as seed distillation. In order to investigate the impact of seed distillation we perform two independent experiments with and without distillation, on all GitLab APIs, using the same initial seeds and fuzzing for h. Figure 10 shows the total number of additional new lines executed during the fuzzing phase with and without distillation.

We observe that for most APIs, seed distillations help executing more new lines, and faster. The best incremental benefit is observed in “Projects”, while the worst (no benefit at all) is observed for “Branches”. The “Branches” APIs are relatively simple with total requests and primitive values. In such simple case distillation does not offer any benefit. Distillation also does not benefit in “Commits”. Although the setting with distillation outperforms the one without in the time frame between the fourth and the sixteenth hour, ultimately, the two settings converge on the same coverage.

V-D RQ4. Number of Bugs Found

Although code coverage is an indicative proxy regarding the effectiveness of bug finding tools, the ultimate metric is indeed the total number of bugs found. Pythia found new bugs across every API and every service tested so far. In total Pythia found 29 new bugs.

While fuzzing with Pythia, there is a high number of “500 Internal Server Errors” received and different instances of the same bugs were reported. These “500 Internal Server Errors” are potential server state corruptions that may have unknown consequences in the target service health. Since all the bugs found have to be manually inspected, it is desirable to report unique instances of each bug and avoid duplication. To this end, we use the code coverage information and group bugs using the following rule: out of all test cases triggering “500 Internal Server Error”, we report those as bugs that are generated by exercising unique code paths. According to the aforesaid rule, Table II shows the bugs found across all services tested. In hours, Pythia and RESTler generate the same order of magnitude of test cases. The test cases of both tools have similar execution time in the target services because the total number of requests per test case remains similar. (Pythia does not attempt new request sequence combinations.) However, Pythia’s learning-based mutations trigger many more s, which lead to more unique bugs. Pythia operates on seed test cases generated by RESTler which naturally trigger all bugs found by RESTler. We do not count bugs found by RESTler in the results reported for Pythia. Next, in  Section VI, we conduct case studies on an indicative subset of bugs found by Pythia.

Target APIs RESTler Pythia
Tests 500s Bugs Tests 500s Bugs
Commits K 0 K 3
Branches K 0 K 4
Issues K 0 K 5
User Groups K 0 K 4
Projects K 0 K 4
Repos & Files K 0 K 2
Accounts & Lists 0 K 3
Statuses 1 K 1
Storefront   Cart 1 K 3
Total - - 2 - - 29
TABLE II: Number of test cases generated, “500 Internal Server Errors” triggered, and unique bugs found by RESTler and Pythia after h of fuzzing.

Vi New Bugs Found

During our experiments with Pythia on local GitLab, Mastodon, and Spree deployments we found new bugs. All bugs were easily reproducible and we are in the process of reporting them to the respective service owners. We describe a subset of those bugs to give a flavor of what they look like and what test cases uncovered them.

Example 1: Bug in Storefront Cart. One of the bugs found by Pythia in Spree is triggered when a user tries to add a product in the storefront cart using a malformed request path ‘‘/storefront/|add_item?include=line_items’’. Due to erroneous input sanitization, the character ‘‘|’’ is not stripped from the intermediate path parts. Instead, it reaches the function split of library uri.rfc3986_parser.rb, which treats it as a delimiter of the path string. This leads to an unhandled InvalidURIError exception in the caller library actionpack, and causes a “500 Internal Server Error” preventing the application from handling the request and returning the proper error, i.e., “400 Bad Request”. This bug can be reproduced with a test case with two requests: (1) creating a user token and (2) adding a product in the chart using a malformed request path. Bugs related to improper input sanitization and unhandled values passed across multiple layers of software libraries are usually found when using fuzzing. Pythia found bugs due to malformed request paths in all the services tested.

Example 2: Bug in Issues & Notes. Another bug found by Pythia in GitLab’s Issues & Notes APIs is triggered when a user attempts to open an issue on an existing project, using a malformed request body. The body of this request includes multiple primitive types and multiple key-value pairs, including due_date, description, confidentiality, title, asignee_id, state_event, and others. A user can create an issue using a malformed value for the field title, such as {"title":"DELE\xa2"} which leads to a “500 Internal Server Error”. The malformed title value is not sanitized before reaching the fuction create of <class:Issues> that creates new issues. This leads to an unhandled ArgumentError exception due to an invalid UTF-8 byte sequence. This bug can reproduced by (1) creating a project and (2) trying to post an issue with a malformed title in the project created in (1).

Interestingly, adding malformed values in other fields of the request body does not necessarily lead to errors. For instance, the fields confidentiality and state_event belong to different primitive types (boolean and integer) which are properly parsed and sanitized. Furthermore, mutations that break the json structure of the request body or that do not use an existing project ids also do not lead to such errors. Brute-forcing all possible ways to break similar REST API request sequences is infeasible. Instead, Pythia learns common usage patterns of the target service APIs and then applies learning-based mutations breaking these common usage patterns, while still maintaining syntactic validity. Pythia found such input sanitization bugs, due to malformed request bodies, in all services tested. Similar bugs are shown in Figures 5, 6 and 1.

Other examples of unhandled errors found by Pythia are due to malformed headers and request types. All the bugs found in this work are currently being reported to the service owners.

Vii Related work

Our work aims at testing cloud services with REST APIs and relates with works across three broad domains: (i) blackbox grammar-based fuzzing, (ii) coverage-guided and fully-whitebox fuzzing approaches, and (iii) learning-based fuzzing approaches.

In blackbox grammar-based approaches, the user provides an input grammar specifying the input format, what input parts are to be fuzzed and how [boofuzz, burp, Peach, SPIKE]. A grammar-based fuzzer then generates new inputs, satisfying the constraints encoded by the grammar. These new inputs reach deep application states and find bugs beyond syntactic lexers and semantic checkers. Grammar-based fuzzing has recently been automated in the domain of REST APIs by RESTler [restler]. RESTler performs a lightweight static analysis of the API specification in order to infer dependencies among request types, and then automatically generates an input grammar that encodes sequences of requests in order to exercise the service more deeply, in a stateful manner. RESTler inherits two of the typical limitations of grammar-based fuzzing, namely fuzzing rules with predefined sets of values and lack of coverage feedback. Pythia addresses these limitations and augments blackbox grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy in the domain of stateful REST API fuzzing.

Fuzzing approaches based on code-coverage feedback [AFL] are particularly effective in domains with simple input formats but struggle in domains with complex input formats. Fully whitebox approaches can be used to improve test-generation precision by leveraging sophisticated program analysis techniques like symbolic execution, constraint generation and solving [DART, EXE, klee, SAGE, anand2007jpf, avgerinos2014enhancing, chipounov2012s2e]

but still fall short of grammar-based fuzzing when required to generate syntactically and semantically valid inputs. As an alternative, coverage-guided feedback and domain-specific heuristics asserting generation of semantically valid inputs have been combined 

[padhye2019semantic, pham2019smart]. More heavy-weight whitebox fuzzing techniques [MX07, GKL08, rawat2017vuzzer] have also been combined with grammar-based fuzzing. All these approaches are not learning-based. In contrast, Pythia uses a learning-based approach and utilizes initial seeds to learn common usage patterns which are then mutated while maintaining syntactic validity.

Learning-based approaches have recently been used in fuzzing for statistical modeling of test inputs  [godefroid2017learn, wang2017skyfire] and for generating regular or context-free input grammars [bastani2017synthesizing, autogram, wang2017skyfire]. These approaches do not utilize any coverage feedback during fuzzing. Other learning-based approaches aim at modeling the branching behavior of the target program [neuzz, rajpal2017not, bottinger2018deep] using a Neural Network (NN) model. The trained NNs can then be combined with a coverage-guided fuzzer [AFL] and used as a classifier to help avoid executing test inputs that are unlikely to increase code coverage [rajpal2017not]. Alternatively, the gradients of the trained NNs can be used to infer which input bytes should be mutated in order to cover specific branches [neuzz]. However, the trained NNs approximate only a small subset of all possible program behaviours, and these approaches have been applied only to domains with relatively simple input structures.

Viii Threats to validity

Some threats that affect the validity of our study are the following. First, the success of Pythia depends on the choices of different hyperparameters for training the seq2seq autoencoder. We empirically determine the optimal parameters to ensure maximum edge coverage. Furthermore, the static analysis of the source code of target services to extract basic block is imprecise. Since we analyze interpreted code, we miss a subset of basic block that are defined within complex if-else list comprehension structures. Yet, we track a significant number of basic block in all targets (

e.g.,  in GitLab). Finally, we only studied three target programs with nine APIs. Yet, we targeted complex production-scale services, with hundrends of API requests (see Table I) and we believe our results will generalize across cloud services with REST APIs.

Ix Conclusion

Pythia is the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia uses a statistical model to learn common usage patterns of a REST API from seed inputs, which are all structurally valid. It then generates learning-based mutations by injecting a small amount of noise deviating from common usage patterns while still maintaining syntactic validity. Pythia’s learning-based mutation strategy helps generate grammatically valid test cases and coverage-guided feedback helps prioritize the test cases that are more likely to find bugs. We presented detailed experimental evidence—collected across three productions-scale, open-source cloud services—showing that Pythia outperforms prior approaches both in code coverage achieved and, most crucially, in new bugs found. Pythia found new bugs in all APIs and all services tested so far. In total, Pythia found 29 bugs which we are in the process of reporting to to the respective service owners.