Problem 3.1

What kind of mapping from input to output would be created if the activation function in equation

was linear so that ? What kind of mapping would be created if the activation function was removed, so ?

If each activation function was linear, the input-output mapping would also be a linear function. We would have

Grouping terms gives us a linear function of :

If the activation function was removed, it would also simply be a linear function.

If , we would simply have

which is also a linear function of .

Problem 3.2

For each of the four linear regions in figure 3.3j, indicate which hidden units are inactive and which are active (i.e., which do and do not clip their inputs).

  • Region 1: is active
  • Region 2: and are active
  • Region 3: are all active
  • Region 4: and are active

Problem 3.3

Derive expressions for the positions of the “joints” in function in figure 3.3j in terms of the ten parameters and input . Derive expressions for slopes of the four linear regions.

Joints are created by ReLU activations when a hidden unit has a joint at that point. Thus, we set each by the function computed before the ReLU is applied to zero.

  • Joint 1:
  • Joint 2:
  • Joint 3:

Slopes for each region are determined by which hidden units are active in that region.

  • Region 1: is active, so we have
  • Region 2: and are active, so we have
  • Region 3: are all active, so we have
  • Region 4: and are active, so we have

Problem 3.4

Draw a version of figure 3.3 where the -intercept and slope of the 3rd hidden unit have changed as in 3.14c. Assume that the remaining parameters remain the same.

Problem 3.5

Prove that the following property holds for :

This is known as the non-negative homogeneity property of the ReLU function.

We have

Thus,

Problem 3.6

Following on from problem 3.5, what happens to the shallow network defined in equations 3.3 and 3.4 when we multiply the parameters and by a positive constant and divide the slope by the same parameter ? What happens if is negative?

Equation 3.3 defines hidden units:

Equation 3.4 defines the network:

Multiplying by and then dividing by it again (by way of ) would have no effect:

Problem 3.7

Consider fitting the model in equation 3.1 using a least squares loss function. Does this loss function have a unique minimum? i.e., is there a single “best” set of parameters.

The least squares loss function for this model does not necessarily have a unique minimum, because the non-linearities introduced by the activation function can create a non-convex optimization landscape. Furthermore, different parameter combinations to produce the same predictions, leading to equivalent solutions.

Problem 3.8

Consider replacing the ReLU activation function with (i) the Heaviside step function , (ii) the hyperbolic tangent function , and (iii) the rectangular function , where

and

Redraw a version of figure 3.3 for each of these functions. The original parameters were: Provide an informal description of the family of functions that can be created by neural networks with one input, three hidden units, and one output for each activation function.

Heaviside: We would produce piecewise constant functions with up to four linear regions. Each hidden unit creates a single step in the output at its -intercept. The output weights and biases of the network can adjust the heights of these constant segments, but the general shape remains a sequence of horizontal steps.

Hyperbolic tangent: This network would produce smooth piecewise non-linear functions. These functions can approximate continuous, differentiable curves, with complexity determined by the arrangement and blending of the sigmoids created by the hidden units.

Rectangular: This network would produce piecewise constant functions with rectangular pulses. The functions can have up to three non-overlapping or overlapping “bumps,” depending on how the hidden units’ activations interact.

Problem 3.9

Show that the third linear region in figure 3.3 has a slope that is the sum of the slopes of the first and fourth linear regions

From problem 3.3, we had:

  • Region 1: is active, so we have
  • Region 3: are all active, so we have
  • Region 4: and are active, so we have

Thus, it is to show that:

Problem 3.10

Consider a neural network with one input, one output, and three hidden units. The construction in figure 3.3 shows how this creates four linear regions. Under what circumstances could this network produce a function with fewer than four linear regions?

The four linear regions result from the 3 hidden units creating 3 joints from the non-linearities caused by the ReLU functions. We can have less joints if they occur at the same place, i.e. two or more of the hidden units have the same -intercept. This could result from one or more hidden units remaining inactive for all relevant inputs, or if the weights and biases of the hidden units are proportional to each other.

Problem 3.11

How many parameters does the model in figure 3.6 have?

Each of the four hidden unit has 1 weight and 1 bias. Each of the two output unit has four weights and 1 bias. In total:

Problem 3.12

How many parameters does the model in figure 3.7 have?

Each of the 3 hidden units has 2 weights and 1 bias (see Equation 3.9). The output unit has 3 weights and 1 bias:

Problem 3.13

What is the activation pattern for each of the seven regions in figure 3.8? In other words, which hidden units are active (pass the input) and which are inactive (clip the input) for each region?

  • Region 1:
  • Region 2:
  • Region 3: None
  • Region 4:
  • Region 5:
  • Region 6:
  • Region 7:

Problem 3.14

Write out the equations that define the network in figure 3.11. There should be three equations to compute the three hidden units from the inputs and two equations to compute the outputs from the hidden units.

Hidden units:

Outputs:

Problem 3.15

What is the maximum possible number of 3D linear regions that can be created by the network in figure 3.11?

Each hidden unit divides the input space into two regions: one where the linear function inside is positive, and one where it is zero. The maximum number of regions created by 3 planes intersecting in 3D space, following figure 3.10c, would be 8 regions.

Can also use Zaslavsky’s result (see problem 3.18) to get:

Problem 3.16

Write out the equations for a network with two inputs, four hidden units, and three outputs.

  • Two inputs:
  • Four hidden units:
  • Three outputs:

Problem 3.17

Equations 3.11 and 3.12 define a general neural network with inputs, one hidden layer containing hidden units, and outputs. Find an expression for the number of parameters in the model in terms of , , and .

See 3.11 and 3.12 for examples.

Problem 3.18

Show that the maximum number of regions created by a shallow network with -dimensional input, -dimensional output, and hidden units is seven, as in figure 3.8j. Use the result of Zaslavsky (1975) that the maximum number of regions created by partitioning a -dimensional space with hyperplanes is . What is the maximum number of regions if we add two more hidden units to this model, so ?

Original:

With :