Gradient Calculations
1A
Write an expression for using the symbols:
x_i
,y_i
,theta
andtheta_0
, where is the derivative of the function described above with respect to . Remember that you can use@
for matrix product, and you can usetranspose(v)
to traanspose a vector. Note that this function is just the derivative with respect to a single data point.
refers to squared loss, such that:
So we want to take the derivative of this with respect to :
The terms in the second expression simplify to because is treated as a linear function of , while and are treated as constants, since they do not change depending on the value of .
1B
What is the gradient of the squared loss/empirical risk now with ?
1C
Next we’re interested in the gradient of the empirical loss with respect to , , but now for a whole data set (of dimensions by ).
We can recognize that is the square of the -th element of . So, a sum of over the whole dataset is just is equivalent to the norm squared:
The square of a norm is is equivalent to taking the dot product of itself:
Using the chain rule to take our derivative:
where .
Checking dimensions:
- is
- is
- is
- Therefore, is
- The matrix product of a matrix with an matrix (the transpose of ) thus gives us a gradient that is , as expected (i.e., is a vector where each element is the gradient of with respect to the corresponding element of ).
Sources of Error
2A. Penalizing during training can reduce estimation error.
- Structural error is the error due to selecting an inadequate model class
- Estimation error arises when parameters of a hypothesis were not estimated well during training.
- Adding is not selecting for a model class (aka selecting the order of polynomial basis) but for preventing overfitting – thus it reduces estimation error.
Minimizing empirical risk
3A
Let data matrix be , let target output vector be and recall that . Then we can write the whole linear regression as . Write an equation expressing the mean squared loss of in terms of .
3B
What is in terms of ?
Let’s re-write the above expression so that we can find the gradient:
Taking the derivative with respect to :
…or something like that?
3B
What if we set the above equation to 0 and solve for , the optimal ?
3C+D
Converting back to the data matrix format we’ve been using:
In code form: np.linalg.inv(X@np.transpose(X))@X@np.transpose(Y)