Smart contracts are deployed every day for a wide variety of applications ranging from financial services to the tracking of unique, high-value physical goods and intellectual property.
Because they represent essentially autonomous, difficult-to-modify systems with the authority to potentially allocate considerable financial resources, the security and correctness of smart contracts are extremely important. There is a large ongoing effort to describe smart contract vulnerabilities and produce tools that can find flaws in smart contracts. However, the research literature and larger discourse are largely based on a relatively small number of actually performed and sometimes highly visible [25, 21] exploits. Analyses based on all contracts on the blockchain, or even on “popular”  contracts that are not intended to be highly secure, tend to include many contracts that lack any real effort at correctness and security. These analyses may mis-represent the problems that will face the truly critical contracts likely to become the “infrastructure” of a smart-contract future. Unfortunately, it’s very hard for academic researchers to dig deeper into the specific flaws present in such contracts during their development. However, there is a source of precisely such information: High-profile smart contracts are often the target of paid security audits. Such audits, as a basis for understanding smart contract flaws, have key advantages. First, the audits are likely to be of high quality; it is difficult to find repeat customers for low-quality paid audits, especially if the results are made public to warn off other potential customers. Second, because the audits are paid, the contracts audited are likely to be “serious” contracts in which flaws have real consequences, and at least some significant effort has already been applied to producing correct code. This paper presents an analysis of the types of flaws detected in 23 Solidity/Ethereum [5, 32] smart contract audits performed by Trail of Bits (https://trailofbits.com), a leading company in the field. These results provide an empirical basis for smart contract analysis efforts, and support a set of recommendations for improving contract security.
2 Related Work
To our knowledge, no
previous work reports flaws detected in paid security audits of important smart contracts. We have not even found any manual examination of large numbers of smart contracts with reasonable criteria for removing uninteresting contracts (which would ensure quality analysis). However, there are other important efforts to classify or describe smart contract flaws. Atzei, Bartoletti, and Cimoli produced a taxonomy of possibleattacks on smart contracts, with examples of actual exploit code . Their categories have some overlap with those used in this paper, but are more focused on specific-exploit patterns and exclude some types of flaws that are not tied to a specific attack. We believe that every category present in their taxonomy is also represented by at least one finding in our set. Their purpose is largely orthogonal to ours and presents a useful alternative view of the topic, but one based more on speculation about exploits than on concrete data about the prevalence and seriousness of flaws in real contracts. Mense and Flatscher  combine a summary of known vulnerability types with a simple comparison of then-available tools, while Saad et al.  expand the scope of analysis to general blockchain attack surfaces, but provide a similar categorization of smart contract vulnerabilities. Dika’s thesis also  provides another, earlier summary of vulnerability types, analyses, and tools. In general, the types of flaws discussed in these works are a subset of those we discuss below.
Perez and Livshits provide a (provocatively titled) analysis of actual executed exploits on 21K contracts reported in various academic papers, which provides a useful additional perspective, but they use a very different data set with purposes almost completely unrelated to ours . They find that, while reentrancy is the most dangerous category of problem (over 65% of actual exploits in the wild), even reentrancy exploits have resulted in loss of less than 10K Ether to date. The relatively small size of exploits to date vs. potential future losses affirms that information about undetected flaws in audited high-value, high-visibility contracts is important to the community.
Smart contract analysis/verification research often touches on the topic of expected vulnerabilities [10, 31, 23, 4, 12, 7, 14, 15, 9, 11, 12, 19], but this research is, to our knowledge, always based on author perceptions of threats, not statistical inference from close examinations of high-quality/critical contracts.
3 Summary of Findings
The results below are based on 23 audits performed by Trail of Bits. Of these, all but five are public, and the reports are available online . The number of findings per audit ranged from 2-22, with a median and mean of 10 findings. Reports range in size from just under 2K words to nearly 13K words, with a total size of over 180K words.
3.1 Audit Timespan, Authorship, and Tool Use
The time allotted for audits ranged from one person-week to twelve person-weeks, with a mean of six person-weeks and a median of four person-weeks. The audits were prepared by a total of 24 different auditors, with most audits prepared by multiple individuals (up to five). The mean number of authors was 2.6, and the median was three. The most audits in which a single author participated was 12, the mean was 3.2; the median was only two audits. In general, while these audits are all the product of a single company, there is considerable diversity in the set of experts involved.
Most of these assessments used static and dynamic analysis tools in addition to manual analysis of code, but the primary source of findings was manual. In particular, a version of the Slither static analyzer  which included a number of detectors not available in the public version, was applied to many of the contracts. In some cases, property-based testing with Echidna  and symbolic analysis with Manticore [18, 26] were also applied to detect some problems. Only two audits did not use automated tools. Sixteen of the audits made use of Slither, sixteen made use of Manticore, and thirteen made use of Echidna. However, when Slither was used in audits, it was usually used much more extensively than Manticore or Echidna, which were typically restricted to a few chosen properties of high interest. Only four findings are explicitly noted in the findings as produced by a tool, all by Slither. However, other findings may have resulted from automated analyses in a less explicit fashion.
3.2 Smart Contract Findings
Our analysis is based on 246 total findings. Tables 1 and 2 summarize information on these findings, in terms of the severity (potential impact of an exploit) and difficulty (ease with which an exploit can be developed and carried out) of the finding, plus potential for automated detection (Table 2). The findings categories are sorted by the frequency of severity counts; ties in the high-severity findings count are broken by counting medium-severity findings, and further ties are broken by low-severity findings. Appendix A provides the exact counts for categories and severities/difficulties. Our raw data is available online .
The categories in these tables are generally the categories used in the audit reports submitted to clients, but in some cases we have corrected obviously incorrect categories, and we have introduced a few new categories in cases where findings were clearly placed in a category of dubious relevance due to the lack of a suitable category. The most significant systematic change is that we separated race conditions and front-running from all other timing issues, due to 1) the large number of race conditions relative to other timing issues; 2) the general qualitative difference between race conditions and other timing-based exploits (e.g., there is a large literature addressing the detection and mitigation of race conditions specifically, that does not apply to more general timing problems); and 3) the specific relevance of front-running to smart contracts. Our analysis calls special attention to findings classified as high-low, that is high severity and low difficulty. These offer attackers an easy way to inflict potentially severe harm. There were 27 high-low findings, all classified as one of eight categories: data validation, access controls, numerics, undefined behavior, patching, denial of service, authentication, or timing.
|denial of service||4%||10%||20%||30%||30%||20%||0%||50%||0%||40%||10%|
|auditing and logging||4%||0%||0%||0%||33%||44%||22%||33%||0%||56%||11%|
|Category||% Dynamic||% Static||Category||% Dynamic||% Static|
|access controls||50%||4%||data exposure||0%||0%|
|patching||17%||33%||auditing and logging||0%||38%|
|denial of service||40%||0%||missing-logic||67%||0%|
|error reporting||29%||14%||API inconsistency||0%||0%|
3.2.1 Data Validation
Data validation covers the large class of findings in which the core problem is that input received from an untrusted source (e.g., arguments to a public function of a contract) is not properly vetted, with potentially harmful coonsequences (the type of harm varies widely). Not only is this a frequently appearing problem, with more than three times as many findings as the next most common category, it is a serious issue in many cases, with the largest absolute number of high-low findings (10), and a fairly high percent of high-low findings (11%). Data validation can sometimes be detected statically, by using taint to track unchecked user input to a dangerous operation (e.g., a user supplies an index to an array de-reference), but in many cases the consequences are not obviously problematic unless one understands the purpose of the contract. Ironically, the safer execution semantics of Solidity/EVM make some problems that would clearly be security flaws in a C or C++ program harder to detect using automated tools. In a Solidity program, it is not always incorrect to allow a user to provide an array index: If the index is wrong, in many cases, the call will simply revert, and there is no general rule that contract code should never revert. From the point of view of a fuzzer or static analysis tool, distinguishing bad reverts from intended ones is difficult without guidance. Automated static or dynamic analysis to detect many of the instances of missing/incorrect data validation identified in the audits would require some user annotations, either in the form of properties or at least annotating some functions or statements as not expected to revert, but given that information would likely prove effective.
3.2.2 Access Controls
Access control findings describe cases where use of a legitimate operation of a contract should be restricted to certain callers (the owner, minters, etc.), but access control is either faulty or not implemented at all. Most often, access control findings are cases where access control is too permissive, but nearly a third of the findings in this category are cases where the access control is overly restrictive. While there are three times as many data validation findings as access control findings, there are nearly as many high-low findings for access control as for data validation. One in four access control findings is high-low, and 42% of access control findings are high severity. In general, automatic detection of access control problems without additional specification is often plausible. In four of our findings from audits, it would suffice to check standard ERC20 token semantics, or enforce the paused state for a contract, or assume that only certain users should be able to cause a contract to self-destruct. Cases where access controls are too restrictive would require additional specification, but, given that effort, are also often likely to be handled well by property-based testing.
3.2.3 Race Condition
Race conditions are cases in which the behavior of a contract depends (in a clearly unintended way) on an improperly restricted ordering of operations or events. Often, the consequence of one particular unexpected ordering is clearly incorrect. The race condition category had zero high-low findings, but was responsible for seven of the 60 total high-severity findings across all audits. The top three categories (data validation, access controls, and race conditions) make up over half of the total high-severity findings. A full 41% of race conditions are high severity. Nearly half (nine) of the race condition findings concern a known issue with ERC20 . These, at least, could certainly be identified automatically by a static analysis tool. Due to the nature of many blockchain race conditions, understanding the impact of the race would be hard for a dynamic approach, in many cases.
Numerics findings involve the semantics of Solidity arithmetic: Most are overflow errors, some are underflow errors, and a few involve precision losses. These findings also include cases where a “safe math” library or function is used, so there is no actual overflow/underflow resulting in an incorrect value, but the resulting revert causes problems. Three of the numerics findings are high-low (23%), and 31% are high severity. Rounding or precision (six findings) and overflow (three findings) are the most common types of numerics errors. Many rounding and overflow problems can likely be flagged for investigation using static analysis, but to determine whether the behavior is genuinely problematic would require custom properties.
3.2.5 Undefined Behavior
The undefined behavior category includes cases where a contract relies on unspecified or under-specified semantics of the Solidity language or the EVM, so the actual semantic intent of the contract is either currently unclear or may become unreliable in the future. Three (23%) of the undefined behavior findings are high-low, and 31% of undefined behavior findings are high severity. Undefined behavior can be simple to statically detect, when it involves actual abuse of the Solidity semantics.
Patching findings concern flaws in the process to upgrade or change contract behavior. The immutability of contract code on the blockchain requires the use of complex, hard-to-get-right methods to allow changes. Two (11%) of the patching findings are high-low, and 17% of patching problems are high severity. Many patching issues are complex environmental problems that would likely require human expertise to detect, though some common patterns of bad upgrade logic might be amenable to static detection, and a dynamic analysis could detect that a contract is broken after a faulty update.
3.2.7 Denial of Service
Denial of service covers findings that are not well described by another class (e.g., if lack of data validation causes denial of service, we classified that as a data validation finding), and where the consequence of a flaw is either complete shut-down of a contract or at least significant operational inefficiency. If we included all cases where denial of service is an important potential consequence of a flaw, or even the only important consequence, the category would be larger. One denial of service finding was high-low, and 20% of findings were high severity. Few denial of service findings seem easily amenable to anything less than fairly complex custom properties specifying system behavior, in part because “simple” denial of service due to some less complex cause falls under another category.
Authentication findings specifically concern cases where the mechanism used to determine identity or authorization is flawed, as opposed to cases where the access rules are incorrect (which would be access control findings). That is, in authentication problems, the logic of who is allowed to do what is correct, but the determination of “who” is flawed. While only one authentication finding is high-low, fully half of all authentication problems are high severity; in fact, authentication is tied with the infamous reentrancy problem in terms having the greatest perecentage of high severity issues. Three of the observed authentication problems are highly idiosyncratic, and may not even be automatically detectable with complex custom properties. However, the remaining problem should be dynamically detectable using “off-the-shelf” ERC20 token semantics properties.
Reentrancy is a widely discussed and investigated flaw in Ethereum smart contracts . In a reentrancy attack, a contract calls an external contract, before “internal work” (primarily state changes) is finished. Through some route, the external contract re-enters
code that expected the internal work to be complete. No reentrancy problems detected in audits were high-low, but 50% of the findings were high severity. Reentrancy is a serious problem, but, due to its well-defined structure, is usually amenable to static and dynamic detection. In particular, static detection with relatively few false positives is probably already possible using Slither, for most important reentrancies.
3.2.10 Error Reporting
Error reporting findings involve cases in which a contract does not properly report, propagate, or handle error conditions. There are no high-low error reporting findings in the audits, but 29% of error reporting findings are high severity. In some cases error reporting is a difficulty category to capture without further specification, and specifying that errors should be reported or handled in a certain way generally requires the same understanding that would have produced correct code in the first place. However, ERC20 semantics make some error reporting problems easy to automatically detect. Incorrect error propagation is also usually statically detectable ; however, this was not the type of error reporting problem discovered in audits.
Configuration findings generally describe cases in which a faulty choice of configuration may lead to bad behavior even when the contract code itself is correct. In smart contracts, this is often related to financial effects, essentially bad market/pricing parameters. Again, there are no high-low findings in this category, but 40% of findings are high priority. Configuration problems are usually fairly subtle, or even economic/financial in nature, and detection is likely to rely on manual analysis.
Logic findings describe incorrect protocols or business logic, where the implementation is as intended, but the reasoning behind the intention itself is incorrect. Somewhat surprisingly, this category has no high-low findings, and only three fundamental logic flaws were described in the audits. One of the three logic flaws described was high severity, however. Based on the small number of findings it is hard to guess how often custom properties might allow dynamic detection of logic flaws. If the bad logic often leads to a violation of the expected invariants of a contract, then it can be detected, but if the fault is in the understanding of desirable invariants (which may often be the case), manual inspection by another set of expert eyes may be the only plausible detection method.
3.2.13 Data Exposure
Data exposure findings are those in which information that should not be public is made public. There are no high-low data exposure findings, but 33% of data exposure issues are high severity. Most data exposure problems are not likely to be amenable to automatic detection.
Timing findings concern cases (that are neither race conditions nor front-running) where manipulation of timing has negative consequences. For the most part, these findings involved assuming intervals between events (especially blocks) that may not hold in practice. One of the four timing findings (the only high severity one) was high-low. Timing problems can be amenable to automated detection in that static or dynamic analysis can certainly recognize when code depends on, for instance, the block timestamp.
Coding-bug is a catch-all category for problems that, whatever their consequences, amount to a “typo” in code, rather than a likely intentional error on a developer’s part. Off-by-one loop bounds that do not traverse an entire array are a simple example. There were no high-low or high-severity coding bugs in the smart contracts audited, which suggests that the worst simple coding problems may be detected by existing unit tests or human inspection of code, in the relatively small code bases of even larger smart contracts. On the other hand, 67% of coding bugs were medium severity, the second-highest rate for that severity; only one other class exceeded 41% medium-severity findings.
Front-running generalizes the financial market concept of front-running, where a trader uses advance non-public knowledge of a pending transaction to “predict” future prices and/or buy or sell before the pending state change. In smart contracts, this means that a contract 1) exposes information about future state changes (especially to a “market”) and 2) allows transactions that take advantage of this knowledge. It is both a timing and data exposure problem, but it is assigned its own category because the remedy is not always to be found in the typical approaches to these categories. Front-running is a well-known concern in smart contracts, but in fact no high-low or even high-severity front-running problems were detected in our audits. On the other hand, front-running had the largest percent of medium-severity findings (80%), so it is not an insignificant problem. Front-running, by its nature, is probably hard to detect dynamically, and very hard to detect statically.
3.2.17 Auditing and Logging
Auditing and logging findings describe inadequate or incorrect logging; in most cases these findings concern incorrect or missing contract events. There were no high-low, high-severity, or medium-severity auditing or logging findings. If explicit checks for events are included in (automated) testing, such problems can easily be detected, but if such checks are included, the important events are also likely to be present and correct, so in general this is not a great fit for dynamic analysis. On the other hand, it is often easy to statically note when an important state change is made in code, and no event is associated with the state change.
Missing-logic findings are cases in which—rather than incorrect logic for handling a particular set of inputs, or missing validation to exclude those inputs—there is a correct way to handle the inputs, but it is missing. Structurally, missing-logic means that code should add another branch to handle a special case of input. Interestingly, while this seems like a potentially serious issue, there were no high-low or even medium-severity missing-logic findings. The ease of detecting missing logic with custom properties depends entirely on the consequences of not handling the inputs correctly; static analysis seems very unlikely to be useful for true missing-logic problems.
Cryptography findings concern cases where incorrect or insufficient cryptography is used. In our smart contract audits, the one (low severity, high difficulty) cryptography finding concerned use of an improper pseudo-random number generator, which is something a static analysis tool can often flag in the blockchain context, where bad sources of randomness are fairly limited.
Documentation findings describe cases where the contract code is not incorrect, but there is missing or erroneous documentation. As you would expect, this is never a high- or even medium-severity issue, and is not amenable to automated detection.
3.2.21 API Inconsistency
API inconsistencies are cases in which a contract’s individual functions are correct, but the calling pattern or semantics of related functionalities differs in a way likely to mislead a user and produce incorrect code calling the contract. All of these issues were informational, and while it is conceivable that machine learning approaches could identify API inconsistencies, it is not a low-hanging fruit for automated detection.
Finally, code quality issues have no semantic impact, but involve code that is hard to read or maintain. As expected, such issues are purely informational. Code quality problems in general would seem to be highly amenable to static analysis, but not to dynamic analysis, since there are no runtime consequences.
3.3 Comparison to Non-Smart-Contract Audits
It is interesting to compare the distribution of finding types for smart contract audits to other security audits  performed by the same company. Table 3 compares smart contract audit frequencies with those for a random sample of 15 non-smart contract audits, with categories never present in smart contract audits or only present in smart contract audits removed.
|denial of service||23||30%||-26%||authentication||5||6%||-4%|
|access controls||14||18%||-8%||auditing and logging||2||3%||+1%|
|undefined behavior||7||9%||-4%||error reporting||1||1%||+2%|
The largest changes are categories of findings that are common in other audits, but not common in smart contracts. One of these, denial of service, may be primarily due to the re-categorization of denial of service findings with a clear relevance to another category in the smart contract findings. Changing the five findings whose type was clarified back to denial of service still leaves a significant gap, however. This is likely due to the different nature of interactions with the network in non-smart-contract code; in a sense, many denial of service problems and solutions are delegated to the general Ethereum blockchain, so individual contracts have less responsibility and thus fewer problems.
A more general version of the same difference likely explains why configuration problems are far less prevalent in smart contract code. At heart, smart contracts are more specialized and focused, and live in a simple environment (e.g., no OS/network interactions), so the footprint of configurations, and thus possible mis-configurations, is smaller. Similarly, the temptation to roll your own cryptography in a smart contract is much smaller. For one thing, implementing any custom cryptography in Solidity would be impractical enough to daunt even those unwise enough to attempt it, and gas costs would be prohibitive. Data validation is also easier in a world where, for the most part, transactions are the only inputs. Data exposure problems are probably less common because it is well understood that information on the blockchain is public, so the amount of data that is presumed unexposed is much smaller, or, in many cases, non-existent.
4 Threats to Validity
Contracts submitted for audit varied in their level of maturity; some assessments were performed on contracts essentially ready for release (or already released) that reflected the final stage of internal quality control processes. Others were performed on much more preliminary implementations and designs. This does not invalidate the findings, but some flaw types may be more prevalent in less polished contracts. Of course, the primary threat to validity is that the data is all drawn from a set of 23 audits performed by one company over a period of about two years. We address this concern in Section 7.
5 Discussion: How to Find Flaws in Smart Contracts
5.1 Property-Based Testing and Symbolic Execution
Property-based testing [6, 16, 13] involves 1) a user defining custom properties (usually, in practice, reachability properties declaring certain system states or function return values as “bad”), and then 2) using either fuzzing or symbolic execution to attempt to find inputs or call sequences violating the properties. Some variant of property-based testing is a popular approach to smart contract analysis. Automated testing with custom properties is both a significant low-hanging fruit and anything but a panacea. Of the 246 findings, only 91 could be possibly labeled as detectable with user-defined properties, or with automated testing for standard semantics of ERC20 tokens and other off-the-shelf dynamic checks. On the other hand, 17 of the 27 most important, high severity, low difficulty, findings, were plausibly detectable using such properties. While not effective for some classes of problems, analysis using custom properties (and thus, likely, dynamic rather than static analysis), might have detected over 60% of the most important findings. This mismatch in overall (37%) and high-low (63%) percent of findings amenable to property-based testing is likely due to the fact that categories almost never detectable by automated testing—code quality, documentation, auditing and logging—are seldom high-low, and those where it is most effective—data validation, access controls, and numerics—constitute a large portion of the total set of high-low findings. Also, intuition tells us that if a finding has major detrimental consequences (high severity) but is not extremely hard to exploit (low difficulty) this is precisely the class of problems a set of key invariants plus effective fuzzing or symbolic execution is suited to find.
5.2 Static Analysis
The full potential of static analysis is harder to estimate. Four of the issues in these findings were definitely detected using the Slither static analysis tool, which has continued to add new detectors and fix bugs since the majority of the audits were performed. Of these four issues, one was high severity, undetermined difficulty, a classic reentrancy. An additional four issues are certainly detectable using Slither (these involve deletion of mappings, which is also the root issue in one of the findings that was definitely detected by Slither). Some of the overflow/underflow problems, as noted above, might also be statically detectable if false positives are allowed. There are likely other individual findings amenable to static analysis, but determining the practicality of such detection is in some ways more difficult than with dynamic analysis using a property-based specification. The low-hanging fruit for static analysis is general patterns of bad code, not reachability of a complex bad state. While some cases in which we speculate that a finding is describable by a reachability property may not, in fact, prove practical—current tools may have too much trouble generating a transaction sequence demonstrating the problem—it is fairly easy to determine that there is indeed an actual state of the contract that can be identified with the finding. Whether a finding falls into a more general pattern not currently captured by, for instance, a Slither detector, is harder to say, since the rate of false positives and scalability of precision needed to identify a problem is very hard to estimate. Our conservative guess is that perhaps 65 of the 246 findings (26%), and 9 of the high-low findings (33%), are plausibly detectable by static analysis. While these are lower percentages than for dynamic approaches, the effort required is much, much lower: The dynamic analysis usually depends on a user actually thinking of, and correctly implementing, the right property, as well as a tool reaching the bad state. For the statically detectable problems, issues like those in these findings would almost always be produced, without any effort other than running the static analysis tool.
5.3 Unit Testing
There was no additional unit testing as part of the security audits performed. It is therefore impossible to say how effective adding unit tests would be in discovering flaws during audits, based on this data. However, it is possible to examine the relationship between pre-existing unit tests and the audit results. Fourteen of the contracts audited had what appeared to be considerable unit tests; it is impossible to determine the quality of these tests, but there was certainly quantity, and significant development effort. Two of the contracts had moderate unit tests; not as good as the 14 contracts in the first category, but still representing a serious effort to use unit testing. Two contracts had modest unit tests: non-trivial, but clearly far from complete tests. Three had weak unit tests; technically there were unit tests, but they are practically of almost no value in checking the correctness of the contract. Finally, two contracts appeared to have no unit tests at all. Did the quantity of unit tests have an impact on audit results? If so, the impact was far from clear. The contracts that appeared to lack unit tests had nine and four findings, respectively: fewer than most other contracts. The largest mean number of issues (11.5) was for contracts with modest unit tests, but essentially the mean finding counts for considerable (11.1), moderate (10.5), modest (11.5), and weak (11) unit tests were indistinguishable. Furthermore, restricting the analysis to counting only high-severity findings also produces no significant correlation. For total findings, Kendall correlation is an extremely weak 0.09, with a -value of 0.61, indicating even this correlation is likely to be pure chance. For high-severity findings, the correlation drops to 0.5, with a -value of 0.78. Note further that these weak/unsupported correlations are in the “wrong” direction. It seems to be reasonable to say that having extensive unit tests is not a highly effective way to avoid the problems that are often found in a high-quality security audit.
5.4 Manual Analysis
With few exceptions, the findings all demonstrate the effectiveness of manual analysis. Expert attention from experienced external auditors can reveal serious problems even in well-tested code bases. While four of the audits produced no high-severity findings, 11 audits found three or more high-severity issues. As far as we can tell, all of the high-low severity issues were the result of manual analysis alone, though there were recommendations for how to use tools to detect/confirm correction in some cases.
The set of findings that could possibly be detected by either dynamic or static analysis is slightly more than 50%, and, most importantly, includes 21 of the 27 high-low findings. That is, making generous assumptions about scalability, property-writing, and willingness to wade through false positives, a skilled user of both static and dynamic tools could detect more than three out of four high-low issues. Note that the use of both approaches is key: 61 findings overall and 12 high-low findings are likely to only be detectable dynamically, while 35 findingsl, four of them high-low, are likely to only by found using static analysis.
While static analysis alone is significantly less powerful than manual audits or dynamic analysis, the low effort required, and thus the high cost-benefit ratio, makes the use of all available high-quality static analysis tools an obvious recommendation. (Also, printers and code understanding tools often provided by static analyzers make manual audits more effective .) Some of the findings in these audits could have been easily detected even by developers using then-current versions of the best tools.
When 35% of high-severity findings are not likely to be detected even with considerable tool improvement and manual effort to write correctness properties, it is implausible to claim that tools will be a “silver bullet” for smart contract security. It is difficult, at best, to imagine that nearly half of the total findings and almost 25% of the high-low findings would be detected even with high-effort, high-expertise construction of custom properties and the use of better-than-state-of-the-art dynamic and static analysis. Therefore, manual audits by external experts will remain a key part of serious security and correctness efforts for smart contracts for the foreseeable future.
On the other hand, the gap between current tool-based detection rates (very low) and our estimated upper limit on detection rates (50% of all issues, and over 75% of the most important issues) suggests that there is a large potential payoff from improving state-of-the-art standards for analysis tools and putting more effort into property-based testing. The experience of the security community using AFL, libFuzzer, and other tools also suggests that there are “missing” findings. The relatively immature state of analysis tools when most of these audits were performed likely means that bugs unlikely to be detected by human reasoning were probably not detected. The effectiveness of fuzzing in general suggests that such bugs likely exist in smart contracts as well, especially since the most important target category of findings for dynamic analyses, data validation, remains a major source of smart contract findings. In fact, a possible additional explanation for the difference of 36% data validation findings for smart contract audits and 51% for non-smart-contract audits could be that non-smart-contract audits have access to more powerful fuzzers. Eliminating the low-hanging fruit for automated tools will give auditors more time to focus on the vulnerabilities that require humans-in-the-loop and specialized skills. Moreover, effort spent writing custom properties is likely to pay off, even if dynamic analysis tools are not yet good enough to produce a failing test. Just understanding what invariants should hold is often enough to alert a human to a flaw.
Finally, while it is impossible to make strong claims based on a set of only 23 audits, it seems likely that unit tests, even quite substantial ones, do not provide an effective strategy for avoiding the kinds of problems detected during audits. Unit tests, of course, have other important uses, and should be considered an essential part of high-quality code development, but developer-constructed manual unit tests may not really help detect high-severity security issues. It does seem likely that the effort involved in writing high-quality unit tests would be very helpful in dynamic analysis: Generalizing from unit tests to invariants and properties for property-based testing seems likely to be an effective way to detect some of what the audits exposed.
7 Audits From Other Companies
In order to partially validate our findings, we also performed an analysis of audits prepared by two other leading companies in the field , ChainSecurity and ConsenSys Diligence. While differences in reporting standards and categorizations, and the fact that we do not have access to unpublished reports (which could bias statistics), make it difficult to analyze these results with the same confidence as our own reports, the overall picture that emerged was broadly compatible with our conclusions. The assignment of findings to semantically equivalent difficulties and severities, and the assessment of potential for automated analysis methods, was performed by a completely independent team. The results summarized here are for 225 findings in public reports for ChainSecurity and 168 from ConsenSys Diligence, over 19 and 18 audits, respectively. Appendix B provides detailed results on these findings.
First, the potential of automated methods is similar. For ChainSecurity, 39% of all issues were plausibly detectable by dynamic analysis (e.g., property-based testing, possibly with a custom property), and 22% by automated static analysis. For ConsenSys Diligence, those numbers were 41% and 24%. Restricting our interest to high-low findings, the percentages were 67% and 63% for dynamic analysis and 11% and 38% for static analysis, respectively. Combining both methods, the potential detection rates were 51% and 52% for all findings, and 67% and 75% for high-low findings. The extreme similarity of these results to ours affirms that our results concerning detection methods are unlikely to be an artifact of our audit methods or the specific set of contracts we audited.
Second, while the category frequencies were quite different than those in our audits (e.g., more numerics and access controls problems, fewer data validation problems), there were no new categories, and all of our categories were present (though ChainSecurity found no race conditions). Reentrancy was again, not, as previous literature might lead one to suspect, a prominent source of high-low problems, or even a notably common problem (there was only one high-low reentrancy between the two companies).
Understanding how best to protect high-value smart contracts against attackers (and against serious errors by non-malicious users or the creators of the contract) is difficult in the absence of information about the actual problems found in high-value smart contracts by experienced auditors using state-of-the-art technologies. This paper presents a wealth of empirical evidence to help smart-contract developers, security researchers, and security auditors improve their understanding of the types of faults found in contracts, and the potential for various methods to detect those faults. Based on an in-depth examination of 23 paid smart contract audits performed by Trail of Bits, validated by a more limited examination of publc audits performed by ChainSecurity and ConsenSys Diligence , we conclude that 1) the literature is somewhat misleading with respect to the most important kinds of smart contract flaws, which are more like flaws in other critical code than one might think; 2) there is likely a large potential payoff in making more effective use of automatic static and dynamic analyses to detect the worst problems in smart contracts; 3) nonetheless, many key issues will never be amenable to purely-automated or formal approaches, and 4) that high-quality unit tests alone do not provide effective protection against serious contract flaws. As future work, we plan to extend our analysis of other companies’ audits to include unit test quality, and examine issues that cut across findings categories, such as the power of ERC20 standards to help find flaws.
Attack vector on ERC20 API. Note: https://github.com/ethereum/EIPs/issues/20#issuecomment-263524729 Cited by: §3.2.3.
-  (2017) A survey of attacks on Ethereum smart contracts SoK. In International Conference on Principles of Security and Trust, pp. 164–186. External Links: Cited by: §2, §3.2.9.
-  On contract popularity analysis. Note: https://github.com/smartanvil/smartanvil.github.io/blob/master/_posts/2018-03-14-on-contract-popularity-analysis.md Cited by: §1.
-  (2018) Vandal: a scalable security analysis framework for smart contracts. CoRR abs/1809.03981. Cited by: §2.
-  (2013) Ethereum: a next-generation smart contract and decentralized application platform. Note: https://github.com/ethereum/wiki/wiki/White-Paper Cited by: §1.
-  (2000) QuickCheck: a lightweight tool for random testing of Haskell programs. In International Conference on Functional Programming (ICFP), pp. 268–279. Cited by: §5.1.
-  (2017) Mythril: a security analysis tool for ethereum smart contracts. Note: https://github.com/ConsenSys/mythril-classic Cited by: §2.
-  (2017) Ethereum smart contracts: security vulnerabilities and security tools. Master’s Thesis, NTNU. Cited by: §2.
SmartAnvil: open-source tool suite for smart contract analysis. Technical report Technical Report hal-01940287, HAL. Cited by: §2.
-  (2019) Slither: a static analysis framework for smart contracts. In International Workshop on Emerging Trends in Software Engineering for Blockchain, Cited by: §2, §3.1, §6.
-  (2018) A semantic framework for the security analysis of ethereum smart contracts. Note: arXiv:1802.08660Accessed:2018-03-12 External Links: Cited by: §2.
-  (2018) EtherTrust: sound static analysis of ethereum bytecode. Cited by: §2.
-  (2018) TSTL: the template scripting testing language. International Journal on Software Tools for Technology Transfer 20 (1), pp. 57–78. Cited by: §5.1.
-  (2018) TeEther: gnawing at ethereum to automatically exploit smart contracts. In USENIX Security ), Cited by: §2.
-  (2016) Making smart contracts smarter. CCS ’16. Cited by: §2.
-  (2013-03) Hypothesis: test faster, fix more. Note: http://hypothesis.works/ Cited by: §5.1.
-  (2018) Security vulnerabilities in ethereum smart contracts. In Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services, iiWAS2018, New York, NY, USA, pp. 375–380. External Links: Cited by: §2.
-  Manticore: a user-friendly symbolic execution framework for binaries and smart contracts. In IEEE/ACM International Conference on Automated Software Engineering, Note: accepted for publication Cited by: §3.1.
-  (2018) Finding the greedy, prodigal, and suicidal contracts at scale. In ACSAC, Cited by: §2.
-  (2019) Smart contract vulnerabilities: does anyone care?. Cited by: §2.
-  (June 18, 2016 (acceded on Jan 10, 2019)) Analysis of the dao exploit. Note: http://hackingdistributed.com/2016/06/18/analysis-of-the- dao-exploit/ Cited by: §1.
-  (2009) Error propagation analysis for file systems. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 270–280. External Links: Cited by: §3.2.10.
-  (2018) SmartCheck: static analysis of ethereum smart contracts. WETSEB. Cited by: §2.
-  (2019) Exploring the attack surface of blockchain: a systematic overview. arXiv preprint arXiv:1904.03487. Cited by: §2.
-  (Oct 8, 2018 (acceded on Jan 10, 2019)) We got spanked: what we know so far. Note: https://medium.com/spankchain/we-got-spanked-what-we-know -so-far-d5ed3a0f38fe Cited by: §1.
-  (2017) Manticore: symbolic execution for humans. Note: https://github.com/trailofbits/manticore Cited by: §3.1.
-  (2018) Echidna: ethereum fuzz testing framework. Note: https://github.com/trailofbits/echidna Cited by: §3.1.
-  (2019) Analysis of external audits. Note: https://github.com/trailofbits/publications/tree/master/datasets /smart_contract_audit_findings/other_audit_sources Cited by: §7, Appendix B: Detailed Results for ChainSecurity and ConsenSys Diligence Smart Contract Audits.
-  (2019) Smart contract audit findings. Note: https://github.com/trailofbits/publications/tree/master/datasets /smart_contract_audit_findings Cited by: §3.2.
-  (2019) Trail of bits security reviews. Note: https://github.com/trailofbits/publications#security-reviews Cited by: §3.3, §3.
-  (2018) Securify: practical security analysis of smart contracts. CCS ’18. Cited by: §2.
-  (2014) Ethereum: a secure decentralised generalised transaction ledger. Note: http://gavwood.com/paper.pdf Cited by: §1.
Appendix A: Raw Counts for Finding Categories
This table provides exact counts for categories, and severities within categories, for our analysis.
|denial of service||10||1||2||3||3||2||0||5||0||4||1|
|auditing and logging||9||0||0||0||3||4||2||3||0||5||1|
Appendix B: Detailed Results for ChainSecurity and ConsenSys Diligence Smart Contract Audits
The process for analyzing findings in other companies’ audits involved 1) mapping the category of the finding to our set, which was not always simple or obvious, and 2) translating a different formulation of worst-case impact and probability estimation into our high-low severity and high-low difficulty schemes. For information on the original categorizations of issues (using a different severity and likelihood scheme), see the full data set online . A potential source of bias in these results is that we do not know the results for non-public audits for these companies; for our own audits, there was no obvious difference between public and non-public audits, however. Due to the lack of access to source code versions associated with audits, we were unfortunately unable to correlate unit test quality at time of audit with issue counts for external audits.
Note that both companies reported a large number of code quality issues that would not have been considered findings at all in our own audits, but simply noted in a Code Quality appendix to an audit report. We removed 66 and 168 such relatively trivial (“lint-like”) findings, respectively, for ChainSecurity and ConsenSys Diligence; including these would greatly increase the counts for informational issues and the code-quality category.
The first two tables show severity and difficulty distributions for finding categories for other company audits, as in Table 1. In all cases, the first table in each pair of tables is for ChainSecurity, and the second is for ConsenSys Diligence.
|denial of service||5%||0%||17%||25%||58%||0%||0%||67%||33%||0%||0%|
|auditing and logging||3%||0%||0%||14%||29%||57%||0%||14%||0%||86%||0%|
|denial of service||2%||0%||0%||67%||0%||0%||0%||0%||33%||33%||0%|
|auditing and logging||3%||0%||0%||0%||0%||100%||0%||0%||0%||100%||0%|
The next two tables show absolute severity and difficulty counts for finding categories for other company audits, as in Appendix A.
|denial of service||12||0||2||3||7||0||0||8||4||0||0|
|auditing and logging||7||0||0||1||2||4||0||1||0||6||0|
|denial of service||3||0||0||2||0||0||0||0||1||1||0|
|auditing and logging||5||0||0||0||0||5||0||0||0||5||0|
The final two tables report the estimated automated dynamic and static analysis detection potential for the categories in the other companies’ audits.
|Category||% Dynamic||% Static||Category||% Dynamic||% Static|
|denial of service||33%||25%||front-running||0%||0%|
|configuration||29%||0%||auditing and logging||0%||0%|
|Category||% Dynamic||% Static||Category||% Dynamic||% Static|
|coding-bug||50%||10%||denial of service||67%||0%|
|error reporting||20%||40%||auditing and logging||0%||0%|