Defensive Distillation

What is Defensive Distillation?

Defensive distillation is an adversarial training technique that adds flexibility to an algorithm’s classification process so the model is less susceptible to exploitation. In distillation training, one model is trained to predict the output probabilities of another model that was trained on an earlier, baseline standard to emphasize accuracy.

The first model is trained with “hard” labels to achieve maximum accuracy, for example requiring a 100% probability threshold that the biometric scan matches the fingerprint on record. The problem is, the algorithm doesn’t match every single pixel, since that would take too much time. If and when an attacker learns what features and parameters the system is scanning for, the scammer can send a fake fingerprint image with just a handful of the right pixels that meet the system’s programming, which generates a false positive match.

The first model then provides “soft” labels with a 95% probability that a fingerprint matches the biometric scan on record. This uncertainty is used to train the second model to act as an additional filter. Since now there’s an element of randomness to gaining a perfect match, the second or “distilled” algorithm is far more robust and can spot spoofing attempts easier. It’s now far more difficult for a scammer to “game the system” and artificially create a perfect match for both algorithms by just mimicking the first model’s training scheme.

What are the Advantages and Disadvantages of Defensive Distillation?

The biggest advantage of the distillation approach is that it’s adaptable to unknown threats. Since the other most effective adversarial defense training method requires continuously feeding the signatures of all known vulnerabilities and attacks into the system, distillation is more dynamic and requires less human intervention.

The biggest disadvantage is that while the second model has more wiggle room to reject input manipulation, it is still bound by the general rules of the first model. So with enough computing power and fine-tuning on the attacker’s part, both models can be reverse-engineered to discover fundamental exploits. Distillation models are also vulnerable to so-called poisoning attacks, where the initial training database is corrupted by a malicious actor.