Supervised Learning Setup and Bias-Variance Trade-off

9 min readMar 19, 2021

Let us recall that the goal of Supervised Learning is to find the best function that estimate the mapping of some inputs X to some outputs Y. The mapping function is what we call the target function. The goal of this lecture is to measure the performance of a model. Keep in mind that when we train a machine learning model, we don’t just want it to learn to model the training data (not a simple inference). But, we want it to generalize to data it hasn’t seen before. In order to do so, we keep a held-out set that we call test data, consisting of examples it hasn’t seen before. The goal is that the model has a good performance on the test data and the training data.

0. Setup

Examples of label spaces:

classification: {0,1}, {-1,1}…. for example spam -1, not spam 1.
multiclassifcation: {1,2,3,4,….k}
Regression: Real number (R)

Examples of feature vectors:

Client data: (id, name, age, country, number of purchases, amount spent,…)

Image: rgb value for pixels

Hypothesis class

Before we can find a function h, we must specify what type of function it is that we are looking for. It could be an artificial neural network, a decision tree or many other types of classifiers. We call the set of possible functions the hypothesis class. By specifying the hypothesis class, we are encoding important assumptions about the type of problem we are trying to learn. Why is important to define a hypothesis class? simply because we need to limit the space where we are searching, we can’t go through all possible functions. Take an example of mapping a vector with 100 components to {0,2} we have 2¹⁰⁰ possible functions…

Loss function

Once we define the hypothesis class where we will be looking for the best estimate function that makes fewer mistakes, we will need some kind of evaluation. The loss function (risk function) tells us how bad is it. The higher the loss, the worse it is — a loss of zero means it makes perfect predictions. There are so many loss function, choosing the best one depends on the problem we are dealing with and the type of hypothesis class we are working on.

Generaliztion error

The goal is to find the function withinn the hypothesis class that minimizes the risk for seen and unseen data, that means a function that is able to generalize.

Think about it this way, suppose you have built a “memorizer” function, that return the exact label of any seen data ( with some kind of if condition) and return a random value for any unseen data. This function perform perform perfectly on the seen data but very bad on the unseen data.

Train/Test split

This is why we evaluate the model on a new split called test split after training it on the training split. How to split the data? this is a question that we are going to answer in a next lecture.

I. Training Set , Validation Set and Test set

There are three different sets:

Training set: is the set of couple inputs and outputs used to train the model. The model sees and learns from this data.
Validation set: is the set used for hyperparameters tuning (finding the best hyperparameters). Examples of a hyperparameter: for an ANN includes the number of hidden units in each layer, for a polynomial regression includes the degree… The model sees this data, it doesn’t learn from it. We use the validation set results to dedcide hyperparameters. So the validation set affects a model, but only indirectly through the hyperparameters. The validation set is also known as the Dev set or the Development set.
A test set is a set of examples used only to assess the performance.

Leaving the test set out, a dataset can be repeatedly split into a training dataset and a validation dataset: this is known as cross-validation. We will get there later.

II. Loss Function

Suppose you have a model and you want a metric to tell how good is your model, especially if you want to introduce your model to a client, or compare the performance of a set of models to choose which one is better, this is where loss functions comes into play.

A loss function maps decisions to their associated costs.

I want to emphasize this here — although cost function and loss function are synonymous and used interchangeably, they are different. A loss function is for a single training example. It is also sometimes called an error function. A cost function, on the other hand, is the average loss over the entire training dataset. The optimization strategies aim at minimizing the cost function.

There’s no one-size-fits-all loss function to algorithms in machine learning. There are various factors involved in choosing a loss function for specific problem such as type of machine learning algorithm chosen and ease of calculating the derivatives.

Loss functions can be classified into two major categories : Regression Losses for regression problems (Y continuous) and Classification Losses for classification problems (Y discrete). I will introduce some examples wihout talking about advantages, drawbacks and use cases, we will have a specific lecture for this.

Regression Losses

1.Squared Error Loss/ L2 Loss:

The corresponding cost function is the Mean of these Squared Errors (MSE).

2. Absolute Error Loss/ L1 Loss:

The corresponding cost function is the Mean of these Absolute Errors (MAE).

3.Huber Loss

The Huber loss combines the best properties of MSE and MAE. It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter

Classification Losses

For binary classification (two classes), there is Binary Cross Entropy Loss/ Log Loss and Hinge Loss

For multi-class classification, there is Multi-Class Cross Entropy Loss and KL-Divergence.

We will get back to these loss functions in details in following lectures.

III. Generalization Error

Once we have a loss function, we can measure the errors of prediction made for any dataset.

Let us define some notations:

The noise represents an irreducible error. The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

The cost function on the training set is given by the formula:

But we don’t just want the model to get the training examples right; we also want it to generalize to new instances it hasn’t seen before. For example, if we train our model on training data where some property holds (let’s say for example we have a classifcation of images of animals to dogs and cats, and in the training set all dogs have big ears), and in the test set this property doesn’t hold anymore (some images have dogs with small ears), there is a high risk to missclassify/ predicit wrong value if the model decide to user this property (assign cats to some images where we have dogs with small ears). This is why we need the generalization error what we also call the expected loss or the risk.

Mathmetically, the generalization error of a function f is:

In the following decomposition, I will consider the squared loss as a our loss function. And let us assume that the training set and the test set are generated from the same distribution p(x,y).

For on single point sample x, we can decompase the generalization error as follow: (you can try to demonstrate using the properties of expectation and variance)

The three terms represent:
1. The square of the bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function f using a learning method for linear models, there will be error in the estimates function (f hat) due to this assumption.
2. The variance of the learning method, or, intuitively, how much the learning method will move around its mean.
3. The irreducible error sigma squared.

Make sure to keep in mind that the experiment that we repeat here is changing the training set (all generated with the same distribution p(x,y)) train the model and make a prediction on a single same point x. This is a very important point to understand.

A model with low bias is said to overfitt the traing data and with high bias is said to underfit the data. As the flexibility of the model changes, the bias and the variance change opposite ways. If the model is complex, it tries to learn the details about the training data and capture some special properties then unable to generalize, it has a low bias and high variance, small change in the training date will induce important changes on the predicted value. If the model is not complex, we tend to have high bias and low variance.

Keep in mind, that when you are going to need metrics to judge your model, the bias and the variance are almost never calculated, they just have to be kept in mind to reflect the overfitting and the underfitting issues. However, there are some method to calculate them such as bootstrap sampling of the training data, if you understood the previous decomposition, then you got the idea how it should be.

This is what we call the Bias-Variance Trade-off

Overfitting means a good performance on the training data, poor generalization to other data, it happens whena model learns the detail and noise in the training data to the extent that it negatively impactsthe performance on the model on new data. This means that the noise or random fluctuationsin the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Underfitting means a poor performance on the training data and poor generalization to otherdata. Underfitting is often not discussed as it is easy todetect given a good performance metric. The remedy is to move on and try alternate machinelearning algorithms.

IV. Reduce overfitting

In this section, we will present briefly without going to details some of the ways that are used to reduce overfitting. Notice that I said reduce, rather than eliminate overfitting. A good model will probably still overfit at least a little bit. Every algorithm has its own ways to reduce overfitting when using it, but there are some common approaches:

Improve data (This is what we do in corporate life, all day along) and add more
Simplify the model
Use k-fold cross for validation and hyperparameters tuning
Dimensionality reduction and feature selection
Early stopping, Regularization…
Ensemble Learning

If you don’t know some of the points mentioned here, don’t worry we will get to them later. I hope that was useful.