PDF-Malware: An Overview on Threats, Detection and Evasion Attacks

07/27/2021 ∙ by Nicolas Fleury, et al. ∙ Université Polytechnique Hauts-de-France 6

In the recent years, Portable Document Format, commonly known as PDF, has become a democratized standard for document exchange and dissemination. This trend has been due to its characteristics such as its flexibility and portability across platforms. The widespread use of PDF has installed a false impression of inherent safety among benign users. However, the characteristics of PDF motivated hackers to exploit various types of vulnerabilities, overcome security safeguards, thereby making the PDF format one of the most efficient malicious code attack vectors. Therefore, efficiently detecting malicious PDF files is crucial for information security. Several analysis techniques has been proposed in the literature, be it static or dynamic, to extract the main features that allow the discrimination of malware files from benign ones. Since classical analysis techniques may be limited in case of zero-days, machine-learning based techniques have emerged recently as an automatic PDF-malware detection method that is able to generalize from a set of training samples. These techniques are themselves facing the challenge of evasion attacks where a malicious PDF is transformed to look benign. In this work, we give an overview on the PDF-malware detection problem. We give a perspective on the new challenges and emerging solutions.



There are no comments yet.


page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Malicious attackers compromise systems to install malware [18, 26] to gain access and privilege, to compromise personal or sensitive data, to sabotage systems, or to use them in other attacks such as DDOS [34]. Preventing the compromise of information systems is practically impossible. In fact, attackers succed the intrusions in a variety of manners, such as drive-by-downloads with websites exploiting browser vulnerabilities [2] or network-accessible vulnerabilities [25]. Besides, social engineering attacks such as Phishing attacks, and malicious email attachments allow user-authorized installation of malicious binaries [1].

Regular end users are easily able to see the threat of a clear binary and executable files. Their awareness is also increasing against many threat vectors such as Microsoft Office documents including macros. However, despite the complexity PDF format, end users still tend to consider that PDF files are harmless static documents. This implicit assumption mainly results from ignoring the fact that what PDF file displays is the execution output of a potentially complex program; mainly javascript code running in the background. In 2010 Symantec [10] reported a large rise in PDF-driven attacks, mainly justifying it with a corresponding rise in the vulnerabilities identified in the Adobe Reader software. More recently, Ke Liu reported [15] about his discovery since December 2015 of more than 150 vulnerabilities in the most common PDF reader software products. This latter news shows how, even today, PDF is an important infection vector that provides a large attack surface. As shown in Figure 1, the amount of PDF attacks Symantec has recorded [10] have increased dramatically, which shows that the PDF file format is being targeted more often. The spikes on these graphs coincide with the release of specific PDF-related CVEs.

Fig. 1: Number of attacks: Microsoft Office vs. PDF [10].

The potential attack vector of PDF files combined with a widespread wrong assumption of harmlessness makes the detection of malicious PDF an important topic for the information security community.

Malware developers typically exploit the possibility to supply Javascript to the PDF reader interpretation engine to execute malicious code. Such code is usually sandboxed for execution, but it may still exploit unpatched vulnerabilities to escape the environment boundaries and execute shellcode at the user level. Complex payloads can be included in the PDF as obfuscated text to evade inspection techniques, or can be downloaded from the Internet as soon as the attacker takes control of the user shell. Malicious PDF files are then delivered through different methods [10]: from drive-by downloads, to targeted attacks or mass mailing approaches.

This paper aims at presenting a brief overview on the main PDF-malware threats, the main detection techniques and gives a perspective on emerging challenges in detecting PDF-malware.

The remainder of the paper is organized as follows: Section 2 presents a brief background on PDF format as well as on machine learning. Section 3 presents the PDF-based threat used by attackers. Section 4 gives an overview on state of the art malware detection techniques. The evasion attacks challenge is explained in Section 5. We finally give concluding remarks in Section 6.

Ii Background

Ii-a The Portable Document Format

Fig. 2: Simplified structure of a PDF file.

The Portable Document Format is the world’s most widely used for both paper and online format for printed documents. PDF was defined in 1993 by Adobe Systems and used until today to exchange and print documents regardless the underlying hardware architecture, software platform, and operating system. In 2008, PDF became an open standard released as ISO 32000-1. A PDF file may contain a mix of textual and binary data and is composed by different abstraction layers. The layers define the sequential flow by which a PDF viewer application reads the contents and renders them on the screen. According to the PDF Reference [29], the internal structure of a PDF file is made up of the elements depicted in Figure 2.

The PDF contains four sections: header, body, cross-reference table and trailer.

  • The header is used to identify the file format and the version, where is a number between 0 and 7. However, the header could be placed anywhere in the first 1024 bytes. If PDF file contains data, that line is followed by a comment line containing at least four binary characters whose codes are 128 or greater.

  • The body contains multiple types of objects, and these are the most important: (i) Objects: They may be either direct (embedded in another object) or indirect. Indirect objects are identified with an object number and a generation number (object’s version number) and defined between the and keywords if residing in the document root.
    (ii) A dictionary: object that starts with ”<<” and end with ”>>” and is enclosed by and keywords.
    (iii) A stream object: it is represented by a sequence of bytes and may be unlimited in length, which is why images, javascript and other big-size data blocks are usually represented as streams. Stream object can make use of a special feature called filters. Filters can be used for different purposes such as encoding or decoding of content, compression and decompression. Furthermore, multiple filters can be applied on a stream object.

  • The cross-reference table: it indexes all objects’ locations in the file. This table can have multiple subsection containing objects, represented by 2 numbers : the first number corresponds to the object number, while the second line states the number of objects in the current subsection, so if the object number is and we have objects, we will have objects and . Objects are represented by one entry, which is 20 bytes: 10 first bytes are the object offset from the start of the PDF document to the beginning of that object, followed by a space separator with another number specifying the object’s generation number. After that there is an other space separator followed by a letter or indicating if the object is free or in use.

  • The trailer is the first thing to be processed in a PDF and it specifies how the application reading the PDF document should find the cross-reference table and other special objects. The trailer’s dictionary generally contains the document’s catalogue object, and sometimes the document’s information dictionary in which we can find the creation and modification dates of the file, together with some simple metadata.

An important but critical feature of PDF comes from the fact that a document can be modified or updated in an incremental way. This means that if a file is updated by adding a new body, cross-reference table and trailer without changing anything in the rest of the file. That feature allows any user coming back with original data by cancelling the modifications.

Ii-B Machine Learning

Fig. 3: Design methodologies of different machine learning categories [5].

Machine Learning (ML) is one of the most useful tools nowadays. It has shown in recent years an impressive capability to effectively deal with a plethora of complex real-life problems. The main characteristic in ML computing paradigm is to create knowledge from data. A ML algorithm goes through a training phase with a dataset until converging to a trained state where it can be tested and then validated on a separate data set. A trained model is expected to be able to generalize to unseen samples. As shown in Figure [5], the training depends mostly on the available data structure: it can be supervised when data is labeled, unsupervised when no labels are available, or semi-supervised when data is partially labeled.

There is four important steps in ML design. The First is to determine the category that suits the problem. There are four main categories: clustering, classification, regression and rule extraction. Secondly, once the category is fixed, a specific model corresponding to the category needs to be identified. For example, one could choose Artificial Neural Network, Random Forest, Naive Bayesian, a Support Vector Machine (SVM), etc. Then, using the available data set, the model goes through a training process to identify the optimal parameters of the model that solves the considered problem. The final step is to test the model with data that has not been seen in training process. This step is also important because it’s where we will get all the metrics to validate or not the model. The accuracy and false negative rate are the more representative of the model but there is also others such as average precision, specificity, F1-score, etc.

Iii PDF Malware

It is important to know that PDF can be a great attack vector because a lot of people believes it’s safe and don’t even suspect a PDF to be potentially dangerous. Email attachments combined with social engineering are among many attack vectors cybercriminals take advantage of. In addition to email attachments, the use of web malware exploitation is one of the most widely used attack vectors.

There are different ways to perform malicious actions using PDF. The most common attack vector for malware PDFs derives from embedded JavaScript code that can be executed by the PDF reader. Indeed, many surveyed papers consider features derived in different ways from embedded JavaScript code [7, 12, 14, 16, 24, 32].

The following are some well-known PDF-based attack scenarios:

  • OpenAction feature can be used to set an exploit when the file is opened. An action is a legitimate PDF feature. Some potentially dangerous actions include Launch, Go-to, Universal Resource Indicator (URI), Named and JavaScript actions [33, 17].

  • Launch action, giving the possibility to launch special commands on the operating system, and could run an executable if the user clicks OK on the confirmation windows that is opened [33].

  • Embedded files, which can be extracted and opened by the reader. This may be used to hide malicious executables or malicious PDF, Embedded Flash applications stored as embedded SWF files or malicious ActionScript code [33, 17].

  • GotoEmbedded action can be dangerous as PDF files can contain embedded PDF files, which can be encrypted. When a user loads the main PDF file, it could immediately load its embedded PDF file. This allows attackers to hide malicious PDF files inside other PDF files, fooling antivirus scanners by preventing them from examining the hidden PDF file [33, 17].

  • URI action allows access to a remote resources by mean of an Universal Resource Indicator. This way, an attacker could redirect an user to a malicious website [11] or exfiltrate data [19] by combining that feature with Javascript, OpenAction or using PDF forms (with the Submit Form action).

Iv PDF Malware Detection

The most commonly used way for detecting PDF malware is to search files for signatures or patterns of known malware. While this widely used techniques in classical anti-virus software is fast and pragmatic, they are easily fooled and overcome by attacker through simple evasion and obfuscation techniques. In fact, in addition to its ineffectiveness against zero-days, even if the vulnerable APIs that malware uses as an attack vector might be known, detecting them syntactically can be evaded by an attacker through obfuscation. Several public datasets are available to develop PDF-malware detection techniques; Contagio [21] is one of the most widely used ones.

Fig. 4: Output of PDFiD [8].

Iv-a Static analysis

Static analysis can be done by looking directly at the content of the file or using specific tools. PDFiD [27] or peepdf 111https://github.com/jesparza/peepdf are among the most widely used tools to statically analyze PDFs. PDFiD is fine if you need a quick overview of what is in your PDF file but if you want a better and deeper analysis peepdf might be a better choice. PDFiD python script was designed by Didier Stevens [27]. This script scans through a PDF file, and counts the number of occurences of each features. These 21 features are commonly found in malicious files. PDFiD gives a simple and fast overview 4 of what the PDF contains (Javascript, Open action, Launch action…)

Iv-B Dynamic analysis

Fig. 5: Example system call trace of process (truncated to 10 calls) [6].
Fig. 6: An overview on PDF-malware detection features.

Unlike the static analysis, dynamic analysis is performed at runtime. One of the challenges that are specific to PDF-malware is the fact that PDF documents are not executable and are launched through a PDF reader. Then, the analysis needs to be performed on a vulnerable machine so that the payload, if any, can be triggered and thereby analysed. If the payload does not run due to security measures, the results are useless. Varying the PDF viewer is also essential since some malicious PDF are made for a specific viewer or even for a specific version of a viewer ( ex: Adobe Acrobat Reader DC ). Once again, the priority with dynamic analysis is to capture what happen when the payload is running. Once this is achieved, the analysis process has to collect APIs and are system calls. These are the main traces that are useful to detect the potentially malicious requests for operating system services [6]. An illustration of such traces is given in Figure 5.

The tools that you can use are for example strace 222https://github.com/strace for Linux or dtrace 333https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/dtrace under Windows OS. Once traces are collected, postprocessing is needed to make them more human readable; for example by sorting them by type of syscalls.

Iv-C Hardware Malware Detection

While most of the existing malware analysis approaches tackle the problem from sofware abstraction level, a number of works have looked at using low-level features. These approaches are referred to as Hardware Malware Detectors (HMDs) and rely on micro-architecture features such as frequency of opcodes [4], evaluation of opcode sequence signatures [23]. These features are collected while a binary is running and analysed for malware behavior detection. In [22], offline analysis is performed through opcode sequence similarity graphs. In the same direction, Demme et al. [9] proposed collecting performance counter statistics for programs and malware under execution and used them to show that offline detection of malware is effective. Then, a real-time hardware malware detector was built by Ozsoy et al. [20]. Tang et al. [30]

used unsupervised learning to detect malware exploits, which will make the regular program deviate from the baseline execution model. Kazdagli et al

[13] identified some pitfalls in training and evaluating HMDs for mobile malware, and proposed several improvements to them.

Iv-D Machine-learning based techniques

The main goal of using machine learning for malware detection is to build a classifier that is able to detect malicious PDFs that he has never seen. Ideally, that should help to prevents new attacks and that should be more robust than a classic antivirus. One can extract features using static analysis and perform an analysis using an artificial neural network.

As Explained in section IV.A. PDFiD can be used to extract features to train a model. For example related work [8] implemented this solution. They used a dataset of clean and malicious PDF documents. The model was a SVM implemented in Python using of the dataset for training and the other 40% for test. They obtain an accuracy of and a false-positive rate of .

Notice that static, dynamic, software and hardware features can be used to design a ML-based PDF-malware detection system.

V Evasion Attacks

V-a Attack Mechanism

The main goal of these Evasion attacks is to fool the classifier by changing the features of the infected PDF files so that the classifier considers them as clean. To have an effective attack, these modifications should not be noticeable by the defender by scanning the appearance of the file. Removing objects is not effective for evasion because most of the time, it will change the behavior and probably the display of the file. On the other hand, adding empty objects seems to be a good and easy way to modify a PDF file without damaging its original content. We consider a white box adversary. In this model, the adversary has access to everything the defender has, namely: the training dataset used to train the classifier, the classifier algorithm, the classifier parameters (kernel, used features for vector, etc.), and infected PDF files that are detected by the classifier. We experimented an Evasive attack against a Classifier (ANN) we trained on Contagio dataset

[21]. Our intuition was that, in our dataset and in general, infected PDF only contain the payload with no more content. Hence, a simple solution was to add enough object in an infected PDF to make it look like a normal PDF (in term of number objects). Most of the PDF readers were able to find PDF file’s objects even if the objects location in the cross-reference are wrong. This means that adding objects in a PDF file is easy and doesn’t affect the PDF behavior in most of the cases. We implemented this attack to generate evasive examples and we obtained more than attack success rate, i.e., the classifier was not able to recognize the evasive malware. A very similar attack has already been implemented [8], the attacker picks one feature and increments it until the vector is considered as clean by the classifier.

Other techniques inspired from adversarial attacks in image applications are based on gradient-descent to analytically find the minimum noise needed to fool the system. It has been used to evade Support Vector Machines (SVMs) and neural networks classifiers [8, 3]. Moreover, this approach is applicable to any classifier with a differentiable discriminant function.

V-B Defenses & Perspectives

A counter-measure that can be applied to counter the first attack we proposed is to use a maximum value for our features, and that totally blocks the attack when that value was set up to 1. The gradient-descent attack works very well because the algorithm has a huge degree of freedom due to the possibility of increasing every component of the vector as much as required. Selecting robust features could be a solution, but that would a deeper analysis of the PDF

[35] and use a threshold for our features could counter the gradient-descent. [8]

However, we could simply train our classifier using the files we used to evade it : it is called adversarial learning [31].

In previous work [28], combination of static and dynamic features seems to improve the detection rate of malicious Mobile App, and we think that it is worth exploring to utilize it in PDF-malware context. We believe that combining static, dynamic and hardware features can enhance the classifier robustness against evasion attacks.

Vi Conclusion

In this paper, we present a brief overview on the main PDF-malware threats, the main detection techniques. We give a perspective on emerging challenges in detecting PDF-malware and suggest ideas to enhance PDF malware detectors robustness.


  • [1] S. Abraham and I. Chengalur-Smith (2010) An overview of social engineering malware: trends, tactics, and implications. Technology in Society 32 (3), pp. 183 – 196. External Links: ISSN 0160-791X, Document, Link Cited by: §I.
  • [2] S. Bandhakavi, N. Tiku, W. Pittman, S. T. King, P. Madhusudan, and M. Winslett (2011-09) Vetting browser extensions for security vulnerabilities with vex. Commun. ACM 54 (9), pp. 91–99. External Links: ISSN 0001-0782, Link, Document Cited by: §I.
  • [3] B. Biggio, I. Corona, D. Maiorca, B. Nelson, P. Laskov, G. Giacinto, and F. Roli (2013-01) Evasion attacks against machine learning at test time. pp. 387–402. External Links: Document Cited by: §V-A.
  • [4] D. Bilar (2007-01) Opcodes as predictor for malware. Int. J. Electron. Secur. Digit. Forensic 1 (2), pp. 156–168. External Links: ISSN 1751-911X, Link, Document Cited by: §IV-C.
  • [5] R. Boutaba, M. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-Solano, and O. Caicedo Rendon (2018-05) A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9, pp. . External Links: Document Cited by: Fig. 3, §II-B.
  • [6] R. Canzanese, S. Mancoridis, and M. Kam (2015) System call-based detection of malicious processes. In 2015 IEEE International Conference on Software Quality, Reliability and Security, Vol. , pp. 119–124. Cited by: Fig. 5, §IV-B.
  • [7] I. Corona, D. Maiorca, D. Ariu, and G. Giacinto (2014) Lux0R: detection of malicious pdf-embedded javascript code through discriminant analysis of api references. In

    Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop

    AISec ’14, New York, NY, USA, pp. 47–57. External Links: ISBN 9781450331531, Link, Document Cited by: §III.
  • [8] B. Cuan, A. Damien, C. Delaplace, and M. Valois (2018-01) Malware detection in pdf files using machine learning. pp. 578–585. External Links: Document Cited by: Fig. 4, §IV-D, §V-A, §V-A, §V-B.
  • [9] J. Demme, M. Maycock, J. Schmitz, A. Tang, A. Waksman, S. Sethumadhavan, and S. Stolfo (2013-06) On the feasibility of online malware detection with performance counters. SIGARCH Comput. Archit. News 41 (3), pp. 559–570. External Links: ISSN 0163-5964, Link, Document Cited by: §IV-C.
  • [10] N. F. Gutierrez (2010)(Website) Note: Accessed: 2020-05-18 External Links: Link Cited by: Fig. 1, §I, §I.
  • [11] V. Hamon (2013-05) Malicious uri resolving in pdf documents. Journal of Computer Virology and Hacking Techniques 9, pp. . External Links: Document Cited by: 5th item.
  • [12] S. Karademir, T. Dean, and S. Leblanc (2013) Using clone detection to find malware in acrobat files. In Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’13, USA, pp. 70–80. Cited by: §III.
  • [13] M. Kazdagli, L. Huang, V. J. Reddi, and M. Tiwari (2016) EMMA: a new platform to evaluate hardware-based mobile malware analyses. ArXiv abs/1603.03086. Cited by: §IV-C.
  • [14] P. Laskov and N. Šrndić (2011) Static detection of malicious javascript-bearing pdf documents. In Proceedings of the 27th Annual Computer Security Applications Conference, ACSAC ’11, New York, NY, USA, pp. 373–382. External Links: ISBN 9781450306720, Link, Document Cited by: §III.
  • [15] K. Liu (2017) Dig Into the Attack Surface of PDF and Gain 100+ CVEs in 1 Year. Technical report Black Hat Asia. External Links: Link Cited by: §I.
  • [16] X. Lu, J. Zhuge, R. Wang, Y. Cao, and Y. Chen (2013) De-obfuscation and detection of malicious pdf files with high accuracy. In 2013 46th Hawaii International Conference on System Sciences, Vol. , pp. 4890–4899. Cited by: §III.
  • [17] D. Maiorca, G. Giacinto, and I. Corona (2012-07)

    A pattern recognition system for malicious pdf files detection

    Vol. 7376, pp. 510–524. External Links: ISBN 9783642315367, Document Cited by: 1st item, 3rd item, 4th item.
  • [18] G. McGraw and G. Morrisett (2000) Attacking malicious code: a report to the infosec research council. IEEE Software 17 (5), pp. 33–41. Cited by: §I.
  • [19] J. Müller, F. Ising, V. Mladenov, C. Mainka, S. Schinzel, and J. Schwenk (2019-11) Practical decryption exfiltration: breaking pdf encryption. pp. 15–29. External Links: ISBN 978-1-4503-6747-9, Document Cited by: 5th item.
  • [20] M. Ozsoy, C. Donovick, I. Gorelik, N. Abu-Ghazaleh, and D. Ponomarev (2015) Malware-aware processors: a framework for efficient online malware detection. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 651–661. Cited by: §IV-C.
  • [21] M. Parkour(Website) External Links: Link Cited by: §IV, §V-A.
  • [22] N. Runwal, R. M. Low, and M. Stamp (2012-05) Opcode graph similarity and metamorphic detection. J. Comput. Virol. 8 (1–2), pp. 37–52. External Links: ISSN 1772-9890, Link, Document Cited by: §IV-C.
  • [23] I. Santos, F. Brezo, J. Nieves, Y. K. Penya, B. Sanz, C. Laorden, and P. G. Bringas (2010) Idea: opcode-sequence-based malware detection. In Proceedings of the Second International Conference on Engineering Secure Software and Systems, ESSoS’10, Berlin, Heidelberg, pp. 35–43. External Links: ISBN 3642117465, Link, Document Cited by: §IV-C.
  • [24] F. Schmitt, J. Gassen, and E. Gerhards-Padilla (2012) PDF scrutinizer: detecting javascript-based attacks in pdf documents. In 2012 Tenth Annual International Conference on Privacy, Security and Trust, Vol. , pp. 104–111. Cited by: §III.
  • [25] H. Shacham (2007) The geometry of innocent flesh on the bone: return-into-libc without function calls (on the x86). In Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS ’07, New York, NY, USA, pp. 552–561. External Links: ISBN 9781595937032, Link, Document Cited by: §I.
  • [26] E. Skoudis and L. Zeltser (2003) Malware: fighting malicious code. Prentice Hall PTR, USA. External Links: ISBN 0131014056 Cited by: §I.
  • [27] D. Stevens (2009)(Website) External Links: Link Cited by: §IV-A.
  • [28] M. Su, J. Chang, and K. Fung (2017-07) Machine learning on merging static and dynamic features to identify malicious mobile apps. pp. 863–867. External Links: Document Cited by: §V-B.
  • [29] A. Systems (2008)(Website) Note: Accessed: 2020-05-18 External Links: Link Cited by: §II-A.
  • [30] A. Tang, S. Sethumadhavan, and S. J. Stolfo (2014) Unsupervised anomaly-based malware detection using hardware features. In Research in Attacks, Intrusions and Defenses, A. Stavrou, H. Bos, and G. Portokalidis (Eds.), Cham, pp. 109–129. External Links: ISBN 978-3-319-11379-1 Cited by: §IV-C.
  • [31] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. External Links: 1705.07204 Cited by: §V-B.
  • [32] Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos (2011) Combining static and dynamic analysis for the detection of malicious documents. In Proceedings of the Fourth European Workshop on System Security, EUROSEC ’11, New York, NY, USA. External Links: ISBN 9781450306133, Link, Document Cited by: §III.
  • [33] C. Ulucenk, V. Varadharajan, V. Balakrishnan, and U. Tupakula (2011-12) Techniques for analysing pdf malware. pp. 41–48. External Links: Document Cited by: 1st item, 2nd item, 3rd item, 4th item.
  • [34] W. Xu, G. Hu, D. W. C. Ho, and Z. Feng (2019) Distributed secure cooperative control under denial-of-service attacks from multiple adversaries. IEEE Transactions on Cybernetics (), pp. 1–10. Cited by: §I.
  • [35] W. Xu, Y. Qi, and D. Evans (2016-01) Automatically evading classifiers: a case study on pdf malware classifiers. pp. . External Links: Document Cited by: §V-B.