Toward the Detection of Polyglot Files

03/14/2022
by   Luke Koch, et al.
1

Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16 respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2023

Open Image Content Disarm And Reconstruction

With the advance in malware technology, attackers create new ways to hid...
research
04/13/2022

A Natural Language Processing Approach for Instruction Set Architecture Identification

Binary analysis of software is a critical step in cyber forensics applic...
research
06/02/2022

The match file format: Encoding Alignments between Scores and Performances

This paper presents the specifications of match: a file format that exte...
research
02/19/2020

Detection and Analysis of Drive-by Downloads and Malicious Websites

A drive by download is a download that occurs without users action or kn...
research
05/25/2016

As-exact-as-possible repair of unprintable STL files

The class of models that can be represented by STL files is larger than ...
research
07/24/2020

Detecting malicious PDF using CNN

Malicious PDF files represent one of the biggest threats to computer sec...
research
06/09/2023

AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

When investigating a malicious file, searching for related files is a co...

Please sign up or login with your details

Forgot password? Click here to reset