Vulnerability Detection in Open Source Software: An Introduction

by   Stuart Millar, et al.

This paper is an introductory discussion on the cause of open source software vulnerabilities, their importance in the cybersecurity ecosystem, and a selection of detection methods. A recent application security report showed 44 of applications contain critical vulnerabilities in an open source component, a concerning proportion. Most companies do not have a reliable way of being directly and promptly notified when zero-day vulnerabilities are found and then when patches are made available. This means attack vectors in open source exist longer than necessary. Conventional approaches to vulnerability detection are outlined alongside some newer research trends. A conclusion is made that it may not be possible to entirely replace expert human inspection of open source software, although it can be effectively augmented with techniques such as machine learning, IDE plug-ins and repository linking to make implementation and review less time intensive. Underpinning any technological advances should be better knowledge at the human level. Development teams need trained, coached and improved so they can implement open source more securely, know what vulnerabilities to look for and how to handle them. It is the use of this blended approach to detection which is key.




Vulnerability Detection is Just the Beginning

Vulnerability detection plays a key role in secure software development....

A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

Advancing our understanding of software vulnerabilities, automating thei...

Sharing of vulnerability information among companies -- a survey of Swedish companies

Software products are rarely developed from scratch and vulnerabilities ...

Perspectives on risk prioritization of data center vulnerabilities using rank aggregation and multi-objective optimization

Nowadays, data has become an invaluable asset to entities and companies,...

The most frequent programming mistakes that cause software vulnerabilities

All computer programs have flaws, some of which can be exploited to gain...

Timelines for In-Code Discovery of Zero-Day Vulnerabilities and Supply-Chain Attacks

Zero-day vulnerabilities can be accidentally or maliciously placed in co...

Exploring the Security Awareness of the Python and JavaScript Open Source Communities

Software security is undoubtedly a major concern in today's software eng...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Open source software (OSS) is developed collaboratively in the public domain with a licence granting rights to the user base that are usually reserved for copyright holders. A well-known OSS licence is the GNU General Public Licence that allows free distribution under the condition that further developments are also free. In a globally connected software society, a sizeable amount of development work is effectively crowd-sourced to an international community of OSS developers, with little awareness of the potential security problems this creates [12]. OSS libraries increase development speed but there is a tangible increase in risk also, with the Heartbleed bug in OpenSSL being a prime example. Research into vulnerability detection in OSS is crucial given its prevalence with many companies using vulnerable OSS components and vulnerable libraries being repackaged in software. This OSS uptake shows no sign of reversing or slowing, with a recent survey [2] indicating that 43% of respondents think OSS is superior to its commercial equivalent.

At the crux of OSS vulnerability is that today’s applications commonly use thirty or more libraries which in turn can comprise up to 80% of the code in any such application. These libraries have the same full privileges of the application that use them, letting them access data, write to files or send data to the internet. Anything the application can do, the library can do. Some estimate that custom built Java applications contain 5-10 vulnerabilities per 10,000 lines of code

[7]. A library can have on average 10,000 to 200,000 lines of code, therefore the chances a library has never had a vulnerability are very slim, with it perhaps being more likely in fact that it has not even been examined for vulnerabilities. Hence libraries with no vulnerabilities should not automatically be considered ‘safe’. Most vulnerabilities are undiscovered and it might be said the only way to deal with the risk of unknown vulnerabilities is to have someone who understands security manually analyse the source code. Tool support provides hints but is not a replacement for experts because the lack of context within libraries makes it very difficult for current tools at time of writing to conclusively identify vulnerabilities.

This paper contributes a short introductory discussion on vulnerability detection in OSS, and is organised as follows: Section II is a background overview and Section III contains information on conventional and emerging detection methods. Conclusions are presented with ideas for future work in Section IV.

Ii Background

A study carried out in 2012 [7] found that more than 50% of the Fortune Global 500 companies have downloaded vulnerable OSS components, security libraries and web frameworks. This report analysed 113 million Java framework and security library downloads by more than 60,000 commercial, government and non-profit organisations from the Central Repository. Central is the software industries most widely used repository of OSS with more than 300,000 libraries. It was found the vast majority of library flaws remain undiscovered, the presence of a vulnerability (or an absence of one) is not a security indicator, and that typical Java applications are likely to include at least one vulnerable library. Furthermore, the same study showed most organisations did not have a strong process in place for ensuring the libraries they rely upon are up-to-date and free from vulnerabilities. The authors of [7] stress there are no shortcuts and they go as far as saying the only useful indicator of library security is a thorough review that finds minimal vulnerabilities – in other words, software assurance, or the measure of how safe the software is to use, needs to be generated internally. One might say this is surprising, as in many other product or service industries, this assurance – consider it some kind of warranty or seal of approval perhaps – is offered up by the supplier without hesitation to help build trust and sell to the customer. [11] adopt a similar stance, agreeing that recurring vulnerabilities in software are due to reuse. This reuse includes the same code base with an identical or very similar code structure, method calls and variables. Interestingly these attributes form the basis of a proposed method of detecting unreported vulnerabilities in one system by consulting knowledge of reported vulnerabilities in other systems that reuse the same code.

Linus’ Law [16] is often quoted in relation to OSS, which is “given enough eyeballs, all bugs are shallow”, meaning with a large enough number of developers looking at code, errors can be found. However, this can questioned from a scientific viewpoint, and an empirical study of Linus’ Law appeared to show more collaboration meant more vulnerabilities, not less. [8] found that files with changes from nine or more developers were sixteen times more likely to have a vulnerability than files changed by fewer than nine developers. Thus inherent collaborative nature of OSS creates potentially unavoidable vulnerabilities that require addressing.

Iii OSS Vulnerability Detection

Iii-a Conventional Detection Methods

There are not currently an abundance of publications on vulnerability detection in OSS. However, those that have been written thus far describe three conventional methods – static analysis, dynamic analysis and code reviews.

Iii-A1 Static Analysis

Many black-box static analysis techniques and tools scan source code and detect vulnerabilities in software after it has been written, which encourages late detection and produces a lot of false positives . [13] explicitly referenced the cut and thrust of the software development process, saying that external static tools for secure programming don’t fit into such a workflow, since they don’t work with the IDE and are retrospective. [17] concur that static analysis produces high levels of false positives, as do [5] and [10]. [14] point out it is hard to know which vulnerabilities a static analysis tool deals with, and there are difficulties in obtaining and maintaining up-to-date tooling.

[4] specifically wrote about the capability of static code analysis to detect vulnerabilities, concluding that tools are not effective. They tested three widely used commercial tools and found 27% of C/C++ vulnerabilities and 11% of Java vulnerabilities in their dataset were missed by all three. In some cases, they were comparable to or worse than random guessing. They too make the point about tools being prone to false positives, and this consolidates the need to find other methods of detection rather than rely solely on static analysis. That is not to say static analysis is of little use, as some compliance regulations require inventories of OSS components so that risks can be addressed. Static tools can scan open source code and create an inventory, so when a new vulnerability is disclosed, it is known which applications use the vulnerable OSS. The OWASP Dependency-Check tool [9] analyses code and creates reports on associated CVE entries.

Iii-A2 Dynamic Analysis

White-box dynamic analysis can also be called run-time analysis. Fuzzing is often used here, where inputs are changed using random values to detect unwanted behavior [14]. [6] researched the nuances of how vulnerabilities were discovered by researchers, and how those same researchers shared their findings with the OSS community. They found running a fuzzer and debugging was the chosen method for developers exploring binary executables to find buffer overflows. Vulnerability researchers tend to make their own fuzzing tools, seeing it as part of the learning process and preferring this approach over more systematic exploration methods. [5] notes the usefulness of fuzzing, and that it needs only basic knowledge to undertake. However fuzzing does not allow the control of program execution, large campaigns are needed for results, and it is time consuming. [17] contends fuzzing doesn’t scale if dynamic symbolic execution is used, as it explores code paths simultaneously which could create large workloads. Symbolic execution uses symbolic values for variables instead of concrete values to execute all paths in a program.

Iii-A3 Manual Code Reviews

These involve manual inspection of the source code in a white-box manner. Consequently, this method requires a lot of human effort [10]. Working on source code manually does however detect vulnerabilities [6], and recall that [7] argued code reviews, conducted by someone with appropriate security knowledge, may be the only way to properly deal with vulnerabilities.

Fig. 1: Hub and spoke layout for distributed demand-driven security testing

Iii-B Emerging Detection Methods

The issues with the conventional methods in the main are that static analysis produces too many false positives, dynamic analysis does not scale, and code reviews are time consuming. Research into newer methods tries to address these problems via some interesting and novel approaches.

Iii-B1 Distributed demand-driven security testing

Proposed by [17], this involves many clients using OSS and one main testing server, in a hub and spoke style per Figure 1. When a new path in a program is about to be exercised by user input, it is sent to the testing hub for security testing. Symbolic execution is applied to the execution trace to check potential vulnerabilities on this new path, and if one is detected then a signature is generated and updated back to all the clients for protection. If a path exercised by an input triggers any vulnerability that has already been detected, the execution is terminated. This allows testing to focus on paths being used and helps stop attackers exploiting unreported vulnerabilities at a client site.

However, questions remain over how to handle large time and space overheads at client sites, how sensitive data is transmitted and handled, and actual implementation details are scarce. That said, the principle of increasing test coverage of important paths as users exercise them is sound, and [17] offers an interesting conclusion that machine learning can in future identify patterns of bugs at the testing server and use them to predict problematic code.

Iii-B2 Use of Execution Complexity Metrics

[15] examined complexity metrics collected during code execution, considering them potential indicators of vulnerable code locations. Table I describes these metrics. They measure the frequency of function calls and duration of execution functions using Callgrind, a Valgrind tool for profiling programs. The collected data consists of the number of instructions executed on a run, their relationship to source lines, and call relationship among functions together with call counts. Firefox and Wireshark were analysed with Callgrind to gather the metrics and results showed execution complexity metrics may be better indicators of vulnerable code than the conventional static complexity metric of lines of code, or LoC.

Name Definition
NumCalls The number of calls to the functions defined in a file.
InclusiveExeTime Execution time for the set of functions, , defined in a file including all the execution time spent by the functions called directly or indirectly by the functions in .
ExclusiveExeTime Execution time for the set of functions, , defined in a file excluding the execution time spent by the functions called by the functions in .
TABLE I: Execution complexity metrics defined in [15]

Their initial results, shown in Table II, indicate the percentage of vulnerable files in execution is higher than the percentage of vulnerable files in total, and hence execution complexity metrics could be good indicators of vulnerability. This can reduce the code inspection effort as prioritisation can take place based on the metrics.

Program % of vulnerable files % of vulnerable files in executed files
Firefox 3.8 11
Wireshark 7.8 19
TABLE II: Execution statistics from [15]

Iii-B3 IDE Plugins for Early Detection


attempted to detect vulnerabilities earlier in the development process by using an Eclipse Java plug-in, arguing developers should be aware of security vulnerabilities as they are coding. To reduce false positives, they proposed context-sensitive data flow analysis which uses a program’s context of variables and methods when searching for vulnerabilities instead of pattern matching.

[18] presented interactive static analysis, also known as IDE static analysis. They too developed an Eclipse Java plug-in for detecting code patterns that gives a two-way interaction between the IDE and the developer. According to [18], their tool detected multiple zero day vulnerabilities. Figure 2 shows a screenshot where the developer is instructed to annotate access control logic for a highlighted sensitive method call.

Fig. 2: An IDE static analysis tool from [18]

Iii-B4 Machine Learning

Machine learning is a type of artificial intelligence where computers use algorithms to learn iteratively, teaching themselves to recognise patterns. Most OSS code is managed using version control systems like Git or CVS, with vulnerable code inserted via commits from the developer to the main data repository. But many tools can’t run on a small code snippet in an individual commit, and checking the whole project is time consuming.


implemented a type of machine learning algorithm called a Support Vector Machine (SVM) that used metadata from commits made to OSS repositories. The SVM used features from the metadata such as the number of added, deleted or modified functions and how often a contributor had contributed to a given project before. Their results showed that false positives were reduced by over 99% compared to those generated by a static analysis tool - to be exact, their SVM driven tool generated 36 false positives compared to 5,460 generated from the static analysis tool. The goal of their work was to reduce the chance of vulnerabilities getting from a vulnerable commit into the fully deployed software.

[5] also developed a machine learning tool to predict vulnerabilities for large scale software like operating systems. They took the popular Debian OS as an example, since it has 30,000 programs and 80,000 bug reports. Clearly, code flaws can be hard to find manually in a code base of that size, so the application of machine learning is of interest. Their classification results were not conclusive but nevertheless, as an initial study, they showed promise for large-scale vulnerability detection only using binary executables, an approach which does not appear to have been attempted elsewhere.

Iii-B5 Further Knowledge Formalisation and Linking Repositories

[1] discussed formalising knowledge representation to determine transitive dependencies in software. The idea is the various vulnerability repositories that exist online like the NIST National Vulnerability Database (NVD), or the Common Weakness Enumeration (CWE) database can be linked and simultaneously used to find out if a project is indirectly dependent on vulnerable components.

Iv Conclusions & Future Work

The global use of OSS presents such a huge number of attack vectors that discovering novel techniques of vulnerability detection is an essential area of research. Of the new methods mentioned in this paper, machine learning, early detection IDE plug-ins and linking repositories show much promise for future work. Machine learning lends itself well to feature-rich OSS which speeds up classification of vulnerable code and reduces the time burden on development teams. Early detection IDE plug-ins will help developers implementing OSS to grow and consolidate their secure coding knowledge. Linking repositories ensures better value from the separate, unconnected datastores of vulnerabilities as they presently exist. It may also be possible to use machine learning to levarage all these options in a modular system. Improvements in OSS vulnerability detection might be quicker to realise than one would think – [3] mention Pareto’s law, where 80% of effects can be contributed to 20% of causes, and so identifying a small proportion of problematic OSS code then focusing testing efforts using a selection of detection methods could improve code quality and time-to-release, whilst reducing development and maintenance costs. The exact mix of techniques might vary from one OSS scenario to another but in the first instance a strategy using a blend of methods that augment each other is likely to be considerably more performant than one approach in isolation.


  • [1] S. S. Alqahtani, E. E. Eghan, and J. Rilling (2016) Tracing known security vulnerabilities in software repositories – a semantic web enabled modeling approach. Science of Computer Programming 121, pp. 153–175. Note: Special Issue on Knowledge-based Software Engineering External Links: ISSN 0167-6423, Document, Link Cited by: §III-B5.
  • [2] Black Duck Software ([Online: last accessed March 2017]) 2015 Future of Open Source Survey. Note: Cited by: §I.
  • [3] M. English, C. Exton, I. Rigon, and B. Cleary (2009) Fault detection and prediction in an open-source software project. In Proceedings of the 5th International Conference on Predictor Models in Software Engineering, PROMISE ’09, New York, NY, USA. External Links: ISBN 9781605586342, Link, Document Cited by: §IV.
  • [4] K. Goseva-Popstojanova and A. Perhinschi (2015) On the capability of static code analysis to detect security vulnerabilities. Information and Software Technology 68, pp. 18–33. External Links: ISSN 0950-5849, Document, Link Cited by: §III-A1.
  • [5] G. Grieco, G. L. Grinblat, L. Uzal, S. Rawat, J. Feist, and L. Mounier (2016) Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, CODASPY ’16, New York, NY, USA, pp. 85–96. External Links: ISBN 9781450339353, Link, Document Cited by: §III-A1, §III-A2, §III-B4.
  • [6] M. Hafiz and M. Fang (2016-10) Game of detections: how are security vulnerabilities discovered in the wild?. Empirical Softw. Engg. 21 (5), pp. 1920–1959. External Links: ISSN 1382-3256, Link, Document Cited by: §III-A2, §III-A3.
  • [7] J. Williams, A. Dabirsiaghi ([Online: last accessed March 2017]) The Unfortunate Reality of Insecure Libraries. Note:
    Cited by: §I, §II, §III-A3.
  • [8] A. Meneely and L. Williams (2009) Secure open source collaboration: an empirical study of linus’ law. In Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS ’09, New York, NY, USA, pp. 453–462. External Links: ISBN 9781605588940, Link, Document Cited by: §II.
  • [9] OWASP ([Online: last accessed March 2017]) OWASP Dependency Check. Note: Cited by: §III-A1.
  • [10] H. Perl, S. Dechand, M. Smith, D. Arp, F. Yamaguchi, K. Rieck, S. Fahl, and Y. Acar (2015) VCCFinder: finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, New York, NY, USA, pp. 426–437. External Links: ISBN 9781450338325, Link, Document Cited by: §III-A1, §III-A3, §III-B4.
  • [11] N. H. Pham, T. T. Nguyen, H. A. Nguyen, and T. N. Nguyen (2010) Detection of recurring software vulnerabilities. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10, New York, NY, USA, pp. 447–456. External Links: ISBN 9781450301169, Link, Document Cited by: §II.
  • [12] S. Koussa ([Online: last accessed March 2017]) Techbeacon: 13 tools for checking the security risk of open-source dependencies. Note: Cited by: §I.
  • [13] L. Sampaio and A. Garcia (2016) Exploring context-sensitive data flow analysis for early vulnerability detection. Journal of Systems and Software 113, pp. 337–361. External Links: ISSN 0164-1212, Document, Link Cited by: §III-A1, §III-B3.
  • [14] N. Shahmehri, A. Mammar, E. Montes De Oca, D. Byers, A. Cavalli, S. Ardi, and W. Jimenez (2012-09) An advanced approach for modeling and detecting software vulnerabilities. Inf. Softw. Technol. 54 (9), pp. 997–1013. External Links: ISSN 0950-5849, Link, Document Cited by: §III-A1, §III-A2.
  • [15] Y. Shin and L. Williams (2011) An initial study on the use of execution complexity metrics as indicators of software vulnerabilities. In Proceedings of the 7th International Workshop on Software Engineering for Secure Systems, SESS ’11, New York, NY, USA, pp. 1–7. External Links: ISBN 9781450305815, Link, Document Cited by: §III-B2, TABLE I, TABLE II.
  • [16] Wikipedia ([Online: last accessed March 2017]) Linus’ Law. Note: Cited by: §II.
  • [17] D. Zhang, D. Liu, C. Csallner, D. Kung, and Y. Lei (2014) A distributed framework for demand-driven software vulnerability detection. Journal of Systems and Software 87, pp. 60–73. External Links: ISSN 0164-1212, Document, Link Cited by: §III-A1, §III-A2, §III-B1, §III-B1.
  • [18] J. Zhu, J. Xie, H. R. Lipford, and B. Chu (2014) Supporting secure programming in web applications through interactive static analysis. Journal of Advanced Research 5 (4), pp. 449–462. Note: Cyber Security External Links: ISSN 2090-1232, Document, Link Cited by: Fig. 2, §III-B3.