In machine learning, overfitting is a problem that results from attempting to capture every variance in a data set. An overfit model will lead to major errors when deployed to production, causing inaccurate predictions and unreliable results. In this article, we’ll explore what causes overfitting in the machine learning model development process and how to fix it to ensure your machine learning projects are reliable.
What Is Overfitting in Machine Learning?
An overfit machine learning model attempts to fit the data too precisely and capture every variance in a data set. In attempting to be too precise, it risks causing errors in production and leading to errors in predictions and analysis.
What Is Overfitting?
Although many machine learning model development methodologies exist, they all include the steps shown below in figure one.
One critical step in the process is the model fitting (or training) step. In this step, we use the training data set to fit the model and the validation data set to assess its quality. At this phase of the process, the problem of overfitting or underfitting occurs.
Model overfitting (or underfitting) is one aspect of assessing the quality of a model. A good machine learning model must be:
- Accurate
- Robust
- Small (parsimonious)
- Explainable
Some data scientists sometimes sacrifice the last attribute for better accuracy and robustness. Achieving these four attributes in any model is not straightforward, however. Improving the accuracy of the model fitted using the training data set usually comes at the expense of the other three characteristics, as shown in figure two.
Improving accuracy while fitting the model on the training data set could result in overfitting, which leads to high prediction errors when we test the model using the validation data set. It also increases the size and complexity of the model, thus reducing our ability to explain the meaning of data analytics predictions.
Let’s explain this through a simple example. A shipping company that ships parcels would like to develop a model to calculate the cost of shipping parcels (y) in terms of:
- The shipping distance in miles (x)
- The weight of the parcel in pounds (w)
- The volume of the parcel in square feet (h)
An example of a small, simple model would be:
An example of a slightly more complex model with more variables would be:
Finally, an example of a complex model with the same three variables would be:
Note that these are not actual models, merely examples.
Overfitting Vs. Underfitting
Let’s now examine in detail what the terms “overfitting” and “underfitting” mean. Figure three shows an example of three models fitted to data representing the cost of shipping a parcel (y) versus the shipping distance (x).
The solid black line is the training data used to fit the models. The red dashed line is a simple model (a straight line in this case), which is clearly an underfit model that would result in large model errors, even on the training data set. We say it is underfit because it is oversimplified, meaning it is not capturing data trends in their full complexity.
The black dashed line represents the predictions of a complex model that attempts to capture all the variabilities in the data. This is an overfit model because it would result in large errors when tested using the validation data set. Representing the other extreme of the underfit red dashed line, the overfit model is trying to be too precise. Thus, it misses the data’s broader trends.
The blue dashed line is considered a good model because it captures the trends in the data without oversimplification and without trying to replicate the data’s every ripple and variation.
How to Detect Overfitting
In figure three, we visually assessed the three models and guessed which one could be the best fit to avoid overfitting or underfitting. Another way to visualize these features is shown in figure four.
Figure four shows that increasing the model’s complexity and size by introducing more predictors will result in fewer errors when tested against the training data set — up to a point. After that point, the fit error begins to grow again because the model is beginning to overfit. This threshold represents our dashed blue line in figure three and separates overfitting from underfitting. It is the most appropriate blend of model size and complexity.
An example of how to measure the model prediction error is the mean square error (MSE) of the model, which is the average value of the squared difference between the actual and predicted value.
Machine Learning Model Bias and Variance
As figure four shows, inspecting the error on the validation data set versus the model size and complexity will reveal the best model configuration. Two specific concepts, model bias and model variance, help us understand the behavior of the error as a function of the model predictions.
Bias
Let’s define the model error (e) as the difference between the actual value and the predicted value, or:
Where y is the actual value and the y with a circumflex is the predicted value.
The bias is the difference between the actual value (y) and the expected prediction (expected value from the model). In other words, bias measures the systematic error when the model consistently misses the target. Figure five demonstrates the meaning of the bias using a target practice analogy, where the target’s inner circle represents low bias.
In the case of high bias in the target on the right, the relative distribution of the model predictions did not change, but their overall location shifted because of the bias.
Variance
Variance, on the other hand, is the average of the square of the deviation of the predicted value from the mean predicted value. In other words, it is the variance (square value of the standard deviation) of the predicted values. It measures the amount of scatter of the predicted values around a central expected value. Figure six shows the target practice analogy in the case of low and high variance.
Figure six shows that the location of the center of the model predictions did not change, but the scatter around this center increased in the case of higher variance.
The MSE can be deconstructed into the three components as follows:
Irreducible error here represents the limit of the variables in the model to predict the target value.
Bias and variance are interesting because they are directly related to the concepts of overfitting and underfitting. In figure seven below, we can plot a chart similar to that in figure four, which demonstrates the trade-off between model bias and variance as they relate to overfitting and underfitting.
How to Avoid Overfitting and Underfitting
Because avoiding overfitting and underfitting is an important aspect of model development, most modeling algorithms have developed mechanisms to guard against this problem. The general scheme of all the developed methods follows the strategy shown in figures four and seven. These methods vary the model size/complexity and attempt to minimize the total model error. We will review three such examples for commonly used models: decision trees, regression, and neural networks. Most of these methods are implemented in open-source and proprietary machine learning platforms.
Decision Trees
In general, decision trees tend to overfit when they become large. Terminal nodes end up with a small number of records, and thus, predictions based on these small samples are not robust. Therefore, many pruning algorithms have been developed to minimize the total error, as shown in figure seven.
You can perform tree pruning while generating the splits by eliminating those splits that increase the total error on a validation data set. Alternatively, the algorithm would grow a large tree and then prune it back to reduce the error.
Regression Models
Regression models and their derivatives have been around for a long time. Several schemes have been introduced to find the optimal model. These schemes depend on inserting and removing the predictors in a specific order to hunt for the minimum error. For example, in the forward variable selection method, the predictors are inserted one at a time so that the variable with the expected highest increase in model accuracy is used in each step. The process stops when no predictor can significantly improve the model’s accuracy.
The backward scheme is just the opposite of the forward scheme. We start with a model with all the possible predictors and then remove them one by one. A hybrid scheme known as the stepwise selection method works by inserting and removing variables iteratively until the best model is attained. Other model development schemes rely on using a measure of model accuracy, such as the R2 to add and remove variables iteratively in each step.
Neural Networks
Neural networks, in specific multilayered perceptron networks, are trained by iteratively adjusting the values of the weights between the neurons in the different layers. During the training iterations, commonly known as epochs, the weights change and the network minimizes the errors on the training data set with each epoch. The prediction error on the validation data set will be as in figure four. Sometimes the validation data error may not be unimodal, however. This means that it may have more than one minimum. This is shown in figure eight.
To ensure that the model fitting process captures the global minimum and not just a local one, after finding a minimum (point A), the training is resumed for a number of epochs to ensure that we actually found the global minimum (point B).
Although this simple strategy does not guarantee finding the global minimum’s location, it is usually sufficient for most practical cases.
Frequently Asked Questions
What are bias and variance in machine learning?
Bias is a metric that measures the extent to which model estimates deviate from the true answer in a systematic way. On the other hand, the variance is the amount of uncertainty (scatter) of the estimated values.
What causes overfitting of a machine learning model?
Overfitting in machine learning happens when the model attempts to capture all variability in the training data. This problem results in high errors in the validation data set and, later, during scoring and using the model.