Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

02/09/2023
by   Michael Robinson, et al.
0

This paper proposes an unsupervised classification method that partitions a set of files into non-overlapping dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not. Patterns of messages can be used to classify files into dialects. A dialect is defined by a subset of messages, called the required messages. Once files are conditioned upon dialect and its required messages, the remaining messages are statistically independent. With this definition of dialect in hand, we present a greedy algorithm that deduces candidate dialects from a dataset consisting of a matrix of file-message data, demonstrate its performance on several file formats, and prove conditions under which it is optimal. We show that an analyst needs to consider fewer dialects than distinct message patterns, which reduces their cognitive load when studying a complex format.

READ FULL TEXT
research
01/20/2022

Statistical detection of format dialects using the weighted Dowker complex

This paper provides an experimentally validated, probabilistic model of ...
research
12/15/2020

Looking for non-compliant documents using error messages from multiple parsers

Whether a file is accepted by a single parser is not a reliable indicati...
research
09/17/2019

Breaking Imphash

There are numerous schemes to generically signature artifacts. We specif...
research
08/17/2018

Single-Server Multi-Message Private Information Retrieval with Side Information

We study the problem of single-server multi-message private information ...
research
07/11/2021

On the Expressiveness of Assignment Messages

In this note we prove that the class of valuation functions representabl...
research
08/23/2022

Null Messages, Information and Coordination

This paper investigates the transfer of information in fault-prone synch...
research
02/13/2020

Scheduling periodic messages on a shared link

Cloud-RAN is a recent architecture for mobile networks where the process...

Please sign up or login with your details

Forgot password? Click here to reset