How can we train our models so that they are harder to break with adversarial attacks?

Adversarial Training

A basic idea is to adversarial samples to our training set and re-train. That helps somewhat, especially for samples that we added.

What if we built this process into our training? We can incorporate a mini adversarial attack into every gradient step while trainig.

TRADES

‘TRadeoff-inspired Adversarial DEfense via Surrogate loss minimization” uses an idea like this.

Consider:

  • Model
  • Dataset with inputs and targets .

Then, is the predicted class of . The classification is correct if .

The classification loss can be written as:

where we are using the indicator function to count how many times we get a wrong class.

If we want to consider how our model will perform under adversarial attack, we consider the robust loss

This has some built-in pessimism, that looks for the worst-case in the neighborhood of .

Instead of counting misclassified points directly, to make it differentiable we approximate it use a surrogate loss function, :

Then, the natural loss becomes:

We can train a robust model with the combined loss

  • The first term ensures that each is correctly classified
  • Adds a penalty for models that put within of the decision boundary

Implementation:

  • For each gradient descent step
    • Run several steps of gradient ascent to find
    • Evaluate the joint loss , where is a regularization parameter
    • Use the gradient of the joint loss for each gradient step