Adversarial Defence

How can we train our models so that they are harder to break with adversarial attacks?

Adversarial Training

A basic idea is to adversarial samples to our training set and re-train. That helps somewhat, especially for samples that we added.

What if we built this process into our training? We can incorporate a mini adversarial attack into every gradient step while training.

‘TRadeoff-inspired Adversarial DEfense via Surrogate loss minimization” uses an idea like this.

Consider:

Then, $sign (f (x))$ is the predicted class of $x$ . The classification is correct if $f (X) T > 0$ .

The classification loss can be written as:

"Natural loss" R_{nat} (f) = E_{(X, T)} 1 {f (X) T \leq 0}

where we are using the indicator function $1$ to count how many times we get a wrong class.

If we want to consider how our model will perform under adversarial attack, we consider the robust loss

"Robust loss" R_{rob} (f) = E_{(X, T)} 1 {\exists x^{'} \in B (x, ϵ) ∣ f (x^{'}) T \leq 0}

This has some built-in pessimism, that looks for the worst-case in the neighborhood of $X$ .

Instead of counting misclassified points directly, to make it differentiable we approximate it use a surrogate loss function, $g$ :

Then, the natural loss becomes:

f min E_{(X, T)} [g (f (x) T)]

We can train a robust model with the combined loss

f min E_{(X, T)} [g (f (x) T) + X^{'} \in B (x, ϵ) max g (f (x) f (x^{'}))]

Implementation:

For each gradient descent step
- Run several steps of gradient ascent to find $x^{'}$
- Evaluate the joint loss $g (f (x) T) + β g (f (x) f (x^{'}))$ , where $β$ is a regularization parameter
- Use the gradient of the joint loss for each gradient step