Gradient Calculations

1A

Write an expression for  using the symbols: x_iy_itheta and theta_0, where  is the derivative of the ​ function described above with respect to . Remember that you can use @ for matrix product, and you can use transpose(v) to traanspose a vector. Note that this  function is just the derivative with respect to a single data point.

refers to squared loss, such that:

So we want to take the derivative of this with respect to :

The terms in the second expression simplify to because is treated as a linear function of , while and are treated as constants, since they do not change depending on the value of .

1B

What is the gradient of the squared loss/empirical risk now with ?

1C

Next we’re interested in the gradient of the empirical loss with respect to ​, but now for a whole data set  (of dimensions  by ).

We can recognize that is the square of the -th element of . So, a sum of over the whole dataset is just is equivalent to the norm squared:

The square of a norm is is equivalent to taking the dot product of itself:

Using the chain rule to take our derivative:

where .

Checking dimensions:

  • is
  • is
  • is
  • Therefore, is
  • The matrix product of a  matrix with an  matrix (the transpose of ) thus gives us a gradient that is , as expected (i.e.,  is a vector where each element is the gradient of  with respect to the corresponding element of ).

Sources of Error

2A. Penalizing during training can reduce estimation error.

  • Structural error is the error due to selecting an inadequate model class
  • Estimation error arises when parameters of a hypothesis were not estimated well during training.
  • Adding is not selecting for a model class (aka selecting the order of polynomial basis) but for preventing overfitting – thus it reduces estimation error.

Minimizing empirical risk

3A

Let data matrix be , let target output vector be and recall that . Then we can write the whole linear regression as . Write an equation expressing the mean squared loss of in terms of .

3B

What is in terms of ?

Let’s re-write the above expression so that we can find the gradient:

Taking the derivative with respect to :

…or something like that?

3B

What if we set the above equation to 0 and solve for , the optimal ?

3C+D

Converting back to the data matrix format we’ve been using:

In code form: np.linalg.inv(X@np.transpose(X))@X@np.transpose(Y)

Evaluation