A supervised learning model defines a mapping from one or more inputs to one or more outputs. The model is a mathematical equation; when the inputs are passed through tis equation, it computes the output (inference). The model equation also contains parameters. Different parameter values change the outcome of the computation; the model equation describes a family of possible input-output mappings, and the parameters specify the particular relationship.

When we train or learn a model, we find parameters that describe the true relationship between inputs and outputs. A learning algorithm takes a training set of input/output pairs and manipulates the parameters until the inputs predict their corresponding outputs as closely as possible.

Specifically, we aim to build a model that takes an input and outputs a prediction , both of which are vectors. To make the prediction, we need a model that takes the input and returns , so:

The model is a mathematical equation with a fixed form, representing a family of relations between input and output. The model contains parameters , where the choice of parameters determines the particular relation between input and output:

We learn these parameters using a training dataset of pairs of input and output examples . We aim to select parameters that map each training input to its associated output as closely as possible. We quantify the degree of mismatch in this mapping with the loss . This is a scalar value that summarizes how poorly the model predicts the training outputs from their corresponding inputs for parameters .

We can treat the loss as a function of these parameters. When we train the model, we are seeking parameters that minimize this loss function:

If the loss is small after this minimization, we have found model parameters that accurately predict the training outputs from the training inputs .

After training a model, we assess its performance by running the model on a separate test data to see how well it generalizes to examples that it didn’t observe during training.

Linear Regression Example

We can make the idea above more concrete with a simple example of regression. We consider a model that predicts a single output from a single input . A 1D linear regression model describes a straight line:

This model has two parameters , where is the -intercept of the line and is the slope. Different choices for the intercept and the slope result in different relations, hence the model defines a family of possible input-output relations.

For this model, the training dataset consists of input/output pairs . The mismatch between the model predictions and the ground truth . This loss is quantified using a sum of squares, over all training pairs:

This is called least-squares loss. The squaring operations means that the direction of the deviation (above/below the data) is unimportant.

The goal of the training process is then to find the parameters that minimize this quantity:

There are only two parameters, so we can calculate the loss for every combination of values and visualize the loss function as a surface. The “best” parameters are at the minimum of this surface.

For training and testing, the basic method is to choose the initial parameters randomly and then improve them by “walking down” the loss function until we reach the bottom. One way to do this is to measure the gradient of the surface at the current position and take a step in the direction that is most steeply downhill. Then we repeat this process until the gradient is flat and we can improve no further.

Having trained the model, we test it by computing the loss on a separate set of test data. This shows how well the training data generalizes.

  • A simple model like a line might not be able to capture the true relationship between input and output. This is known as underfitting.
  • Conversely, a very expressive model may describe statistical peculiarities of the training data that are atypical and lead to unusual predictions. This is known as overfitting.