Notarial timestamps savings in logs management via Merkle trees and Key Derivation Functions

10/05/2021 ∙ by Andrea Barontini, et al. ∙ 0

Nowadays log files handling imposes to ISPs (intended in their widest scope) strict normative duties apart from common technological issues. This work analyses how retention time policies and timestamping are deeply interlinked from the point of view of service providers, possibly leading to costs rise. A new schema is proposed trying to mitigate the need for third-party suppliers, enforcing cryptographic primitives well established in other fields of Information Technology but perhaps not yet widespread in logs management. The foundations of timestamping are recapped, and properties of cryptographic primitives introduced as a natural way to bypass legacy schema inefficiency and as an extra level of protection: these choices are justified by savings estimation (with regard to different ISP magnitudes) and by some basic security considerations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


1 The context

In the last years consciousness and concerns about privacy have raised, data handling by Internet and telecommunication providers is under the spotlight as never before, lawmakers are eventually entering in the field, a fair balance between cloud services convenience and worries about digital identities is far from being reached.

It’s easy to understand that applications, servers, network logs are no more just an IT operations’ commodity useful to troubleshoot technical problems, but duties about their management are emerging; the landscape is various, and this work deals only with a couple of very basic requirements and their link:

  • retention times, i.e. the minimum amount of time a log must be available upon request and the maximum amount of time it can be conserved (e.g.: police can ask a provider for some data one week after it has been recorded, but after -let’s say- five years that data must be deleted);

  • tamper-proof-ness, meaning that data since recorded cannot be changed without invalidating it.

Retention times are usually enforced by disincentives to circumvent their requirements: if service owner is discovered without needed data or taking advantage of data later than its retention time limit, it should be sanctioned. To be able to claim for data integrity, a technological proof in the form of a notarized timestamp of the log file is instead required, given how easily an unfaithful service provider could alter the logs forging a new file (how it’s really easier than deleting a log only far after the time limit is a moot point, but that’s how the world goes). It’s common to apply timestamps during log files rotation.

It’s worth noting that retention times actually are a property of each event recorded in a log file, not of the log file as a whole: for example on a network device a root login and an interface UP/DOWN will probably have very different retention times requirements even if they could be collected in the same file. It means that, even considering a data-lake aggregation strategy producing a single huge log file (and that’s not always the case), a per-event-type disaggregation is anyway needed to separately timestamp different files, each containing only events sharing the same retention policies: this way all conservation time constraints can be respected without breaking any timestamp.

Organizations under a legislation (or with a security department) mandating many different retentions, or not being able to process their logs as a unique data-lake, or needing frequent rotations due to data amount could easily see a growing number of needed notarized timestamps: they often are a paid third-party service, so it makes sense to propose a way to reduce that quantity.

2 Timestamps fundamentals

Notarization of a file is not only a technological matter: dealing with legal validity it’s of course also driven by law constraints: for example lawmakers could value guarantees coming from externalities, e.g. emitting organization’s assets. This kind of choices influences the type of infrastructure and (de)centralization level of the issuer [tskinds]; some examples, without pretending completeness:

  • timestamps based on a Proof-of-Work blockchain with granular-enough blocks mining schedule (please note that even if fully decentralized this case is not free of charge because of transactions fees) [blocknot];

  • timestamps as digital certificates delivered to the applier by an accredited issuer [RFC3161];

  • timestamps as an entry in a carefully protected database managed by an organization guaranteeing its existence for a certain number of years [bucap];

  • some mix of the previous ones.

However all solutions have a common technological ground: to save space all of them use a digest of the file produced by a cryptographic hash (we will simply refer to it as “hash output” or even as just “hash”) and associate it to a time tag (writing the digest in a block at a specific height referable to that time, signing the association, storing the association in a secure DB, …).

A cryptographic hash is employed because [preimage]:

  • preimage resistance guarantees that a file to be notarized must exist before timestamp emission, because we cannot timestamp a fake hash output, and only later forge a document corresponding to the fake hash (put it simply, the hash is one-way so we cannot invert it);

  • second-preimage resistance makes impossible, since a document is hashed and timestamped, to forge a second different document with the same hash (so we cannot pretend to have timestamped a document different from the original one).

As background for next sections considerations, it’s important to note that:

  • the foregoing hash properties are computational properties, meaning they hold because of limits of our computation resources: having an unrealistically fast computer or -equally- an unrealistic long time in hand we could break them by brute force (trying any hash input until success).

  • from a general point of view we cannot say if conditioning hash input to be a valid log file (so considering only “well-formed” inputs, respecting log grammar/syntax) strengthen or weaken the hash properties: roughly speaking it depends on how hash output distribution “reacts” to the restricted case. Regarding brute force attacks, the input restriction for sure could reduce the domain; however tries-set would still be so “big” to get searched that it’s not obviously advantageous to also add the overhead of a priori input selection, considering how fast hashing is. Anyway we will be only interested in comparing a new timestamping schema to the current one: so we will consider their relative security, ignoring their effectiveness with respect to general hash theory and assuming security of the current timestamping procedure as an accepted matter of fact.

In the light of what we have said, we now know that to separately timestamp many files we usually invoke the timestamping service many independent times, each time for a different hash: so to gain savings we need to reduce the number of third-party-processed hashes, trying at the same time to not lose mutual independence of each file’s timestamp markers (please note that now we have added the “markers” qualifier to underline a more complex structure: if the number of processed hashes will be lower than the number of files, emitted timestamps will not be enough -by themselves- for each file, so we’ll also need something else).

3 Merkle trees as Accumulators

The “to reduce the number of third-party-processed hashes” above is actually an euphemism: it’s possible to use just one hash to timestamp files (which of course will share the time certification, nightly rotated logs are good candidates). To get this result we exploit the accumulator nature of Merkle trees [merkletree]:

  • in Merkle trees each internal node is the hash of its two children nodes concatenated; each leaf node is instead the hash of one of the tree’s inputs (in our case, the log files). If an internal node has only one child (situation occurring when cardinality of input-set is not a power of 2) the single child is concatenated with itself. The top level node is called tree root. E.g.:

    Of course with 6 input files we would have and would be the only node with just one single child.

  • It’s easy to get convinced that “membership” of an input file in a Merkle tree can always be proved knowing the tree root representing the whole tree and other nodes values (where is the number of inputs): that’s the accumulator nature of Merkle trees. E.g.: to prove membership of file in our example tree we need to use file contents and the black nodes values to calculate the tree root and check if it is equal to the one we pre-know:

    The ordered set (from inputs side to tree root) of black nodes values is called Merkle path.

    Lastly, please note that check of inputs whose way towards tree root meets single-child nodes (e.g. our file) could need less values; however to avoid dealing with special cases we choose to use the “canonical” proof style for them as well (repeating the same value when needed).

4 The new timestamping schema and its savings

We now have all the elements to go back to our original problem: how to save timestamps. At this point the strategy should be self-evident:

  • the only timestamped hash will be the tree root;

  • for each log file, the earlier minted timestamp markers will be made up of the same, unique timestamped tree root and file’s Merkle path;

  • to prove a file hasn’t been changed since timestamping, we check that Merkle root calculated from its content and Merkle path is equal to the previously timestamped one.

It seems a win-win situation: let’s compare legacy and new schema (we assume one logs rotation per-day, 32 bytes SHA256 hashes, Timestamp Authority TSA providing timestamps as X.509 certificates with common size of about 5 KBytes; per-timestamp price comes from a quick search of bundles available on the market during Q3 2021 [ufficiocame]; needed storage values are rounded to the greatest-less-than-themselves Byte’s multiples to be more readable; furthermore they take into account just timestamps or timestamp markers occupations, of course not source log files):

5 Security considerations

An easy (and someway correct) objection to the proposed schema is that Merkle paths provide search space for a cheating log-manager trying to post-forge a fake file after timestamping procedure. Calculation of Merkle root during verification can be seen as:

No matter if function contains hashes , as well: now from our point of view it’s just a function, with log file and Merkle path as inputs. Note that the schema doesn’t require the Merkle path to be advertised earlier than timestamp verification phase, so a cheating log-manager could try a second-preimage attack to the hash function:

The fake Merkle path will offer a relevant search space: considering that each of its elements is itself an hash output, the input bits to play with -once a forged log file has been chosen- are a multiple of outer codomain size.

It’s known collision resistance is weaker than second-preimage and preimage resistance, given that in the former we can choose both preimages: so if we discover it’s not affordable to find a collision, then a stronger difficulty limits the other attacks. Literature (i.e. Birthday paradox [birthdayp] [birthdaya]) states that the expected number of tries to find an hash collision is proportional to the square root of the hash codomain. Considering SHA256:

Let’s try to understand what tries means with an example.

The maximum time a cheating log-manager could have to find a collision is given by the sum of the log retention time and of the access-request maximum handling time (in the unlikely case the search begins soon after the log collection, the access request is made near the end of the retention time, and all permitted time is used to deal with the request). Italian Telephone and Internet Service Providers can reach 6 years retention of metadata; it’s common to store so old logs offline, so let’s consider 30 days for their handling: more or less seconds overall. Succeeding to cheat within this time means trying fake Merkle paths per-second. Is it possible?

State-of-the-art in hash computation (double round hash actually) is currently -Q3 2021- represented by Bitcoin ASIC miners, delivering 100 TH/s [bitmain]: about double round hashes per-second. Even “forgetting” that calculation of Merkle root involves multiple hashes, it’s evident that today computational power doesn’t allow collisions under the given assumptions (so, as said earlier, we are safe from second-preimage attacks as well).

With the foregoing result it’s also worth noting that we don’t strictly need a reduction from new timestamp markers attacks to legacy timestamps attacks (or, seeing it from a more technical point of view, from Merkle root attacks to hash attacks), because we have seen it’s unfeasible to exploit the new extra possibilities of attack, if any (and it’s not a surprise because depends only on hash codomain size, so it’s the same in the two cases).

The “given assumptions” statement can be a risky point when dealing with security, so let’s explicitly state ours: the hashes are modeled as Random Oracles [rom] returning random and uniform outputs. Unluckily actual employed hashes are not ideal ROs, anyway a pragmatic approach about that is widely accepted (at least in Information Technology field where, e.g., SHA256 are not under discussion and Merkle tree structures have an important role in Bitcoin decentralized consensus process [masteringmerkle]).

6 Schema improvement

Given the foregoing security considerations, we could take some extra precautions to complicate (and so discourage) brute force attacks:

  • we could make cheaters’ life more difficult by using huge codomain hashes, but computational capabilities are always improving and needing to periodically change hash type to preserve security level wouldn’t seem a feasible time-proof approach;

  • better to use a tuneable solution using a slowing Key Derivation Function [kdf] “between” Merkle root and third-party timestamping service: imposing enough KDF repetitions, it would take longer to try enough fake Merkle paths during an attack:

The salt is a random value avoiding attacks by means of pre-computed rainbow tables, while repetitions is the number of KDF iterations (so the slowness tuneable parameter):

  • both of them (or any other KDF parameter) could also be committed, by means of an extra hash -not explicitly shown in the picture- compressing them and KDF output before TSA processing, so to constrain their values;

  • their retention needs only minor increase in storage requirements.

It’s not between the aims of this work to give an indication about which Key Derivation Function to use (and with which parameters), anyway it seems right to cite an often adopted one, PBKDF2 (used in BIP39 Bitcoin wallets, for example [masteringpbkdf2]), and the “new” (2015) kid-in-town Argon2. Their parameters should be chosen to enforce the maximum process delay compatible with logs timestamping allowed delay (e.g. if logs have to be timestamped in 10 minutes since rotation, KDF parameters cannot be set to force a 15 minutes delay on process); and to find this compromise in term of KDF parameters, conservative assumptions have to be made about computational resources of log collectors at the time of timestamping (this conservative estimation together with usual computational capabilities improvement over time could result in a decreasing effectiveness of process delay, anyway an overall speed reduction would be attained).

Lastly, the number of timestamped files can also be committed in the same way of salt and repetitions

: remembering it determines the cardinality of Merkle path, it imposes a constraint on the size of its representation as bit-string, so reducing attacker’s degrees of freedom.