Open source software (OSS) is developed collaboratively in the public domain with a licence granting rights to the user base that are usually reserved for copyright holders. A well-known OSS licence is the GNU General Public Licence that allows free distribution under the condition that further developments are also free. In a globally connected software society, a sizeable amount of development work is effectively crowd-sourced to an international community of OSS developers, with little awareness of the potential security problems this creates . OSS libraries increase development speed but there is a tangible increase in risk also, with the Heartbleed bug in OpenSSL being a prime example. Research into vulnerability detection in OSS is crucial given its prevalence with many companies using vulnerable OSS components and vulnerable libraries being repackaged in software. This OSS uptake shows no sign of reversing or slowing, with a recent survey  indicating that 43% of respondents think OSS is superior to its commercial equivalent.
At the crux of OSS vulnerability is that today’s applications commonly use thirty or more libraries which in turn can comprise up to 80% of the code in any such application. These libraries have the same full privileges of the application that use them, letting them access data, write to files or send data to the internet. Anything the application can do, the library can do. Some estimate that custom built Java applications contain 5-10 vulnerabilities per 10,000 lines of code. A library can have on average 10,000 to 200,000 lines of code, therefore the chances a library has never had a vulnerability are very slim, with it perhaps being more likely in fact that it has not even been examined for vulnerabilities. Hence libraries with no vulnerabilities should not automatically be considered ‘safe’. Most vulnerabilities are undiscovered and it might be said the only way to deal with the risk of unknown vulnerabilities is to have someone who understands security manually analyse the source code. Tool support provides hints but is not a replacement for experts because the lack of context within libraries makes it very difficult for current tools at time of writing to conclusively identify vulnerabilities.
This paper contributes a short introductory discussion on vulnerability detection in OSS, and is organised as follows: Section II is a background overview and Section III contains information on conventional and emerging detection methods. Conclusions are presented with ideas for future work in Section IV.
A study carried out in 2012  found that more than 50% of the Fortune Global 500 companies have downloaded vulnerable OSS components, security libraries and web frameworks. This report analysed 113 million Java framework and security library downloads by more than 60,000 commercial, government and non-profit organisations from the Central Repository. Central is the software industries most widely used repository of OSS with more than 300,000 libraries. It was found the vast majority of library flaws remain undiscovered, the presence of a vulnerability (or an absence of one) is not a security indicator, and that typical Java applications are likely to include at least one vulnerable library. Furthermore, the same study showed most organisations did not have a strong process in place for ensuring the libraries they rely upon are up-to-date and free from vulnerabilities. The authors of  stress there are no shortcuts and they go as far as saying the only useful indicator of library security is a thorough review that finds minimal vulnerabilities – in other words, software assurance, or the measure of how safe the software is to use, needs to be generated internally. One might say this is surprising, as in many other product or service industries, this assurance – consider it some kind of warranty or seal of approval perhaps – is offered up by the supplier without hesitation to help build trust and sell to the customer.  adopt a similar stance, agreeing that recurring vulnerabilities in software are due to reuse. This reuse includes the same code base with an identical or very similar code structure, method calls and variables. Interestingly these attributes form the basis of a proposed method of detecting unreported vulnerabilities in one system by consulting knowledge of reported vulnerabilities in other systems that reuse the same code.
Linus’ Law  is often quoted in relation to OSS, which is “given enough eyeballs, all bugs are shallow”, meaning with a large enough number of developers looking at code, errors can be found. However, this can questioned from a scientific viewpoint, and an empirical study of Linus’ Law appeared to show more collaboration meant more vulnerabilities, not less.  found that files with changes from nine or more developers were sixteen times more likely to have a vulnerability than files changed by fewer than nine developers. Thus inherent collaborative nature of OSS creates potentially unavoidable vulnerabilities that require addressing.
Iii OSS Vulnerability Detection
Iii-a Conventional Detection Methods
There are not currently an abundance of publications on vulnerability detection in OSS. However, those that have been written thus far describe three conventional methods – static analysis, dynamic analysis and code reviews.
Iii-A1 Static Analysis
Many black-box static analysis techniques and tools scan source code and detect vulnerabilities in software after it has been written, which encourages late detection and produces a lot of false positives .  explicitly referenced the cut and thrust of the software development process, saying that external static tools for secure programming don’t fit into such a workflow, since they don’t work with the IDE and are retrospective.  concur that static analysis produces high levels of false positives, as do  and .  point out it is hard to know which vulnerabilities a static analysis tool deals with, and there are difficulties in obtaining and maintaining up-to-date tooling.
 specifically wrote about the capability of static code analysis to detect vulnerabilities, concluding that tools are not effective. They tested three widely used commercial tools and found 27% of C/C++ vulnerabilities and 11% of Java vulnerabilities in their dataset were missed by all three. In some cases, they were comparable to or worse than random guessing. They too make the point about tools being prone to false positives, and this consolidates the need to find other methods of detection rather than rely solely on static analysis. That is not to say static analysis is of little use, as some compliance regulations require inventories of OSS components so that risks can be addressed. Static tools can scan open source code and create an inventory, so when a new vulnerability is disclosed, it is known which applications use the vulnerable OSS. The OWASP Dependency-Check tool  analyses code and creates reports on associated CVE entries.
Iii-A2 Dynamic Analysis
White-box dynamic analysis can also be called run-time analysis. Fuzzing is often used here, where inputs are changed using random values to detect unwanted behavior .  researched the nuances of how vulnerabilities were discovered by researchers, and how those same researchers shared their findings with the OSS community. They found running a fuzzer and debugging was the chosen method for developers exploring binary executables to find buffer overflows. Vulnerability researchers tend to make their own fuzzing tools, seeing it as part of the learning process and preferring this approach over more systematic exploration methods.  notes the usefulness of fuzzing, and that it needs only basic knowledge to undertake. However fuzzing does not allow the control of program execution, large campaigns are needed for results, and it is time consuming.  contends fuzzing doesn’t scale if dynamic symbolic execution is used, as it explores code paths simultaneously which could create large workloads. Symbolic execution uses symbolic values for variables instead of concrete values to execute all paths in a program.
Iii-A3 Manual Code Reviews
These involve manual inspection of the source code in a white-box manner. Consequently, this method requires a lot of human effort . Working on source code manually does however detect vulnerabilities , and recall that  argued code reviews, conducted by someone with appropriate security knowledge, may be the only way to properly deal with vulnerabilities.
Iii-B Emerging Detection Methods
The issues with the conventional methods in the main are that static analysis produces too many false positives, dynamic analysis does not scale, and code reviews are time consuming. Research into newer methods tries to address these problems via some interesting and novel approaches.
Iii-B1 Distributed demand-driven security testing
Proposed by , this involves many clients using OSS and one main testing server, in a hub and spoke style per Figure 1. When a new path in a program is about to be exercised by user input, it is sent to the testing hub for security testing. Symbolic execution is applied to the execution trace to check potential vulnerabilities on this new path, and if one is detected then a signature is generated and updated back to all the clients for protection. If a path exercised by an input triggers any vulnerability that has already been detected, the execution is terminated. This allows testing to focus on paths being used and helps stop attackers exploiting unreported vulnerabilities at a client site.
However, questions remain over how to handle large time and space overheads at client sites, how sensitive data is transmitted and handled, and actual implementation details are scarce. That said, the principle of increasing test coverage of important paths as users exercise them is sound, and  offers an interesting conclusion that machine learning can in future identify patterns of bugs at the testing server and use them to predict problematic code.
Iii-B2 Use of Execution Complexity Metrics
 examined complexity metrics collected during code execution, considering them potential indicators of vulnerable code locations. Table I describes these metrics. They measure the frequency of function calls and duration of execution functions using Callgrind, a Valgrind tool for profiling programs. The collected data consists of the number of instructions executed on a run, their relationship to source lines, and call relationship among functions together with call counts. Firefox and Wireshark were analysed with Callgrind to gather the metrics and results showed execution complexity metrics may be better indicators of vulnerable code than the conventional static complexity metric of lines of code, or LoC.
|NumCalls||The number of calls to the functions defined in a file.|
|InclusiveExeTime||Execution time for the set of functions, , defined in a file including all the execution time spent by the functions called directly or indirectly by the functions in .|
|ExclusiveExeTime||Execution time for the set of functions, , defined in a file excluding the execution time spent by the functions called by the functions in .|
Their initial results, shown in Table II, indicate the percentage of vulnerable files in execution is higher than the percentage of vulnerable files in total, and hence execution complexity metrics could be good indicators of vulnerability. This can reduce the code inspection effort as prioritisation can take place based on the metrics.
|Program||% of vulnerable files||% of vulnerable files in executed files|
Iii-B3 IDE Plugins for Early Detection
attempted to detect vulnerabilities earlier in the development process by using an Eclipse Java plug-in, arguing developers should be aware of security vulnerabilities as they are coding. To reduce false positives, they proposed context-sensitive data flow analysis which uses a program’s context of variables and methods when searching for vulnerabilities instead of pattern matching. presented interactive static analysis, also known as IDE static analysis. They too developed an Eclipse Java plug-in for detecting code patterns that gives a two-way interaction between the IDE and the developer. According to , their tool detected multiple zero day vulnerabilities. Figure 2 shows a screenshot where the developer is instructed to annotate access control logic for a highlighted sensitive method call.
Iii-B4 Machine Learning
Machine learning is a type of artificial intelligence where computers use algorithms to learn iteratively, teaching themselves to recognise patterns. Most OSS code is managed using version control systems like Git or CVS, with vulnerable code inserted via commits from the developer to the main data repository. But many tools can’t run on a small code snippet in an individual commit, and checking the whole project is time consuming.
implemented a type of machine learning algorithm called a Support Vector Machine (SVM) that used metadata from commits made to OSS repositories. The SVM used features from the metadata such as the number of added, deleted or modified functions and how often a contributor had contributed to a given project before. Their results showed that false positives were reduced by over 99% compared to those generated by a static analysis tool - to be exact, their SVM driven tool generated 36 false positives compared to 5,460 generated from the static analysis tool. The goal of their work was to reduce the chance of vulnerabilities getting from a vulnerable commit into the fully deployed software. also developed a machine learning tool to predict vulnerabilities for large scale software like operating systems. They took the popular Debian OS as an example, since it has 30,000 programs and 80,000 bug reports. Clearly, code flaws can be hard to find manually in a code base of that size, so the application of machine learning is of interest. Their classification results were not conclusive but nevertheless, as an initial study, they showed promise for large-scale vulnerability detection only using binary executables, an approach which does not appear to have been attempted elsewhere.
Iii-B5 Further Knowledge Formalisation and Linking Repositories
 discussed formalising knowledge representation to determine transitive dependencies in software. The idea is the various vulnerability repositories that exist online like the NIST National Vulnerability Database (NVD), or the Common Weakness Enumeration (CWE) database can be linked and simultaneously used to find out if a project is indirectly dependent on vulnerable components.
Iv Conclusions & Future Work
The global use of OSS presents such a huge number of attack vectors that discovering novel techniques of vulnerability detection is an essential area of research. Of the new methods mentioned in this paper, machine learning, early detection IDE plug-ins and linking repositories show much promise for future work. Machine learning lends itself well to feature-rich OSS which speeds up classification of vulnerable code and reduces the time burden on development teams. Early detection IDE plug-ins will help developers implementing OSS to grow and consolidate their secure coding knowledge. Linking repositories ensures better value from the separate, unconnected datastores of vulnerabilities as they presently exist. It may also be possible to use machine learning to levarage all these options in a modular system. Improvements in OSS vulnerability detection might be quicker to realise than one would think –  mention Pareto’s law, where 80% of effects can be contributed to 20% of causes, and so identifying a small proportion of problematic OSS code then focusing testing efforts using a selection of detection methods could improve code quality and time-to-release, whilst reducing development and maintenance costs. The exact mix of techniques might vary from one OSS scenario to another but in the first instance a strategy using a blend of methods that augment each other is likely to be considerably more performant than one approach in isolation.
-  (2016) Tracing known security vulnerabilities in software repositories – a semantic web enabled modeling approach. Science of Computer Programming 121, pp. 153–175. Note: Special Issue on Knowledge-based Software Engineering External Links: Cited by: §III-B5.
-  ([Online: last accessed March 2017]) 2015 Future of Open Source Survey. Note: https://www.slideshare.net/blackducksoftware/2015-future-of-open-source-survey-results Cited by: §I.
-  (2009) Fault detection and prediction in an open-source software project. In Proceedings of the 5th International Conference on Predictor Models in Software Engineering, PROMISE ’09, New York, NY, USA. External Links: Cited by: §IV.
-  (2015) On the capability of static code analysis to detect security vulnerabilities. Information and Software Technology 68, pp. 18–33. External Links: Cited by: §III-A1.
-  (2016) Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, CODASPY ’16, New York, NY, USA, pp. 85–96. External Links: Cited by: §III-A1, §III-A2, §III-B4.
-  (2016-10) Game of detections: how are security vulnerabilities discovered in the wild?. Empirical Softw. Engg. 21 (5), pp. 1920–1959. External Links: Cited by: §III-A2, §III-A3.
([Online: last accessed March 2017])
The Unfortunate Reality of Insecure Libraries.
24196 Cited by: §I, §II, §III-A3.
-  (2009) Secure open source collaboration: an empirical study of linus’ law. In Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS ’09, New York, NY, USA, pp. 453–462. External Links: Cited by: §II.
-  ([Online: last accessed March 2017]) OWASP Dependency Check. Note: https://www.owasp.org/index.php/OWASP_Dependency_Check Cited by: §III-A1.
-  (2015) VCCFinder: finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, New York, NY, USA, pp. 426–437. External Links: Cited by: §III-A1, §III-A3, §III-B4.
-  (2010) Detection of recurring software vulnerabilities. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10, New York, NY, USA, pp. 447–456. External Links: Cited by: §II.
-  ([Online: last accessed March 2017]) Techbeacon: 13 tools for checking the security risk of open-source dependencies. Note: https://techbeacon.com/13-tools-checking-security-risk-open-source-dependencies-0 Cited by: §I.
-  (2016) Exploring context-sensitive data flow analysis for early vulnerability detection. Journal of Systems and Software 113, pp. 337–361. External Links: Cited by: §III-A1, §III-B3.
-  (2012-09) An advanced approach for modeling and detecting software vulnerabilities. Inf. Softw. Technol. 54 (9), pp. 997–1013. External Links: Cited by: §III-A1, §III-A2.
-  (2011) An initial study on the use of execution complexity metrics as indicators of software vulnerabilities. In Proceedings of the 7th International Workshop on Software Engineering for Secure Systems, SESS ’11, New York, NY, USA, pp. 1–7. External Links: Cited by: §III-B2, TABLE I, TABLE II.
-  ([Online: last accessed March 2017]) Linus’ Law. Note: https://en.wikipedia.org/wiki/Linus%27s_Law Cited by: §II.
-  (2014) A distributed framework for demand-driven software vulnerability detection. Journal of Systems and Software 87, pp. 60–73. External Links: Cited by: §III-A1, §III-A2, §III-B1, §III-B1.
-  (2014) Supporting secure programming in web applications through interactive static analysis. Journal of Advanced Research 5 (4), pp. 449–462. Note: Cyber Security External Links: Cited by: Fig. 2, §III-B3.