Adversarial Machine Learning

What is Adversarial Machine Learning?

Adversarial Machine Learning is a collection of techniques to train neural networks on how to spot intentionally misleading data or behaviors. This differs from the standard classification problem in machine learning, since the goal is not just to spot “bad” inputs, but preemptively locate vulnerabilities and craft more flexible learning algorithms.

While there are countless types of attacks and vectors to exploit machine learning systems, in broad strokes all attacks boil down to either:

  • Classification evasion: The most common form of attack, where the adversary seeks to hide malicious content to pass the algorithm’s filters. 
  • Data poisoning: This more sophisticated attack tries to manipulate the learning process by introducing fake or misleading data that compromises the algorithm’s outputs.

Note: this field of training is security-oriented, and not the same as generative adversarial networks (GAN), which is an unsupervised machine learning technique that pits two neural networks against one another to speed up the learning process.

Adversarial Machine Learning Defenses

The most successful techniques to train AI systems to withstand these attacks fall under two classes:

  1. Adversarial training

    – This is a brute force supervised learning method where as many adversarial examples as possible are fed into the model and explicitly labeled as threatening. This is the same approach the typical antivirus software used on personal computers employs, with multiple updates every day. While quite effective, it requires continuous maintenance to stay abreast of new threats and also still suffers from the fundamental problem that it can only stop something that has already happened from occurring again.

  2. Defensive distillation

    – This strategy adds flexibility to an algorithm’s classification process so the model is less susceptible to exploitation. In distillation training, one model is trained to predict the output probabilities of another model that was trained on an earlier, baseline standard to emphasize accuracy.

The biggest advantage of the distillation approach is that it’s adaptable to unknown threats. While not full proof, distillation is more dynamic and requires less human intervention than adversarial training. The biggest disadvantage is that while the second model has more wiggle room to reject input manipulation, it is still bound by the general rules of the first model. So with enough computing power and fine-tuning on the attacker’s part, both models can be reverse-engineered to discover fundamental exploits