Adversarial Attacks

We can easily fool a classify network into misclassifying an input.

Suppose we are given a dataset.

D = {(x, t) ∣ x \in X, t \in {1, \dots, K}}

for $X \subset X$ , where the class of $x$ is $t$ .

Let $f : X \to P^{K}$ be a classifier network, where

P^{K} = {y \in R^{K} ∣ 0 \leq y_{i} \leq 1, i \sum y_{i} = 1}

represents probability vectors, such as those produced by a softmax function.

The classification error is defined as:

R (f) \equiv E_{D} [1 {i argmax (y_{i}) \neq = t} ∣ (x, t) \in D, y = f (x)]

where:

$1$ counts the number of occurrences
$i argmax y_{i}$ gives the index of the largest element of $y$

Let’s define the $ϵ$ -ball (neighborhood) of an input $x$ as:

B (x, ϵ) = {x^{'} \in X ∣ ∣∣ x^{'} - x ∣∣ \leq ϵ}

We ask ourselves, given an $(x, t) \in D$ , is there $x^{'} \in B (x, ϵ)$ such that

i argmax (y_{i}) \neq = t, y = f (x^{'})

In other words, is there a very nearby input that would fool the network and yield an incorrect classification?

These can actually be found quite easily. This is called an adversarial attack. There are two main classes of adversarial attacks:

Whitebox: Attacker has access to the whole model, e.g., weights, activations, etc.
Blackbox: Attacker only has access to inputs and outputs

Gradient-Based Whitebox Attack

This is a common whitebox attack method. Recall that learning is done by gradient descent:

θ = θ - κ \nabla_{θ} E

where $E$ is our loss function. Using backpropagation, we propagate the gradient of the cost function down through the layers of the network:

We can calculate $\nabla_{x} E$ using $\nabla_{z^{1}} E$ :

z^{1} \nabla_{x} E = x W^{0} + b^{1} = \nabla_{z_{1}} E \frac{\partial z ^{1}}{\partial x} = \nabla_{z^{1}} E \cdot (W^{0})^{T}

This gives us the gradient of the loss with respect to the input, telling us how to adjust out input in order to decrease (or increase) the loss.

Untargeted attack:

x^{'} = x + k \nabla_{x} E (f (x; θ), t (x))

This is essentially gradient ascent, pushing image in a direction to increase loss.

Targeted attack:

x^{'} = x - k \nabla_{x} E (f (x; θ), l)

where $l \neq = t (x)$ . This is gradient descent nudging the input to decrease loss for the wrong target class.

For example, a change in pixel intensity of $1$ in an 8-bit image is imperceptible to the human eye. If we want to perturb our image by $1$ for each pixel, then we let the perturbation be:

Δ x = sign (\nabla_{x} E)

such that $Δ x = \pm 1$ . This means that $∣∣Δ x ∣ ∣_{\infty} = 1$ .

The more general version of this is FGSM:

Fast Gradient Sign Method

FGSM adjusts each pixel by $ϵ$ , such that $Δ x = ϵ sign (\nabla_{x} E)$ .

For example, for a 24-bit image where $(R, G, B) \in {0, \dots, 255}^{3}$ , the perturbed image is computed as:
$(R, G, B)^{'} = (R, G, B) \pm ϵ sign (\nabla_{x} E)$
which ensures that the perturbation follows $∣∣Δ x ∣ ∣_{\infty} = ϵ$ .

Instead of applying a fixed perturbation, one can also search for the smallest $∣∣Δ x ∣∣$ that causes misclassification:
$∣∣Δ x ∣∣ min [i argmax (y_{i} (x)) \neq = t (x)]$

Intuition

Why are classification networks so easily fooled?

Consider the input space of $28 \times 28$ for MNIST, such that there are $784$ total dimensions. That’s a lot of space. The classification partitions this high-dimensional space into 10 regions, one for each class.

It turns out that most points are not too far away from a decision boundary.

Learning is:

θ min E_{D} [L (f (x), t)]

Untargeted attack:

x^{'} \in B (x, ϵ) max L (f (x^{'}), t)

Targeted attack:

x^{'} \in B (x, ϵ) min L (f (x^{'}), ℓ), ℓ \neq = t

/notes/

Recent

Japanese Denim Chords

CS Cards

LayerNorm

Adversarial Attacks

Gradient-Based Whitebox Attack

Intuition

Graph View

Table of Contents

Backlinks