A Simple Intuition with regards to Overfitting, or why testing on training information is a bad idea
When you initially begin with machine learning and you load a dataset and attempt models, you might think to yourself, why can’t I just develop a model with all of the data and assess it on the same dataset?
It appears to be logical. Additional data to train on the model is better, correct? Assessing the model and reporting outcomes on the same dataset will inform you how good the model is. Correct? Wrong.
In this blog article, you will find out the complications with this reasoning and generate an intuition for why it is critical to evaluate a model on unobserved data.
Train and Test on the Same Dataset
If you possess a dataset, say the iris flower dataset, what is the ideal model of that dataset?
The ideal model is the dataset itself. If you take a provided data instance and query for its classification, you can look that instance up in the dataset and report the accurate outcome every time.
This is the problem you are solving when you train and evaluate a model on the same dataset.
You are asking the model to make forecasts to data that it has observed prior. Data that was leveraged to create the model. The ideal model for this problem is the look-up model detailed above.
There are a few scenarios where you do wish to train a model and assess it with the same dataset.
You might wish to simplify the explanation of a predictive variable from information. For instance, you might want a grouping of simple rules or a decision tree that ideally details the observations you have gathered.
In this scenario, you are developing a descriptive model.
These models can be very beneficial and can assist you in your project on your businesses to better comprehend how the attributes are connected to the predictive value. You can integrate meaning to the outcomes with the domain expertise that you have.
The critical restriction of a descriptive model is that it is restricted to detailing the data on which it received training. You have no idea how precise a predictive model it is.
Modelling of a Target Function
Take up a made up classification problem the objective of which to classify data instances as either red or green.
For this issue, let’s assume that there exists an ideal model, or a perfect function that can accurately discriminate any data instance from the domain as red or green. In the context of a particular problem, the ideal discrimination function very probably has profound meaning in the problem domain to the domain specialists. We wish to think about that and attempt to tap into that viewpoint. We wish to deliver that outcome.
Our objective when devising a predictive model for this problem is to best approximate this perfect discrimination function.
We develop our approximation of the ideal discrimination function leveraging sample data gathered from the domain. It’s not all the possible data, it’s a sample or subset of all potential data. If we possessed all of the data, there would be no requirement to make forecasts as the solutions could just be looked up.
The information we leverage to develop our approximate model consists structure within it pertaining to the ideal discrimination function. Your objective with data prep is to ideally expose that structure to the modelling algorithm. The data also consists of things that are irrelevant to the discrimination function like biases from the selection of the data and arbitrary noise that perturbs and hides the structure. The model you choose to approximate the function must navigate these obstacles.
The framework assists us in comprehending the deeper difference amongst a descriptive and predictive model.
Descriptive vs Predictive Models
The descriptive model is just concerned with modelling of the structured in the observed data. It makes sense to train and assess it on the same dataset.
The predictive model is making an effort to tackle a much more tough problem, approximating the true discrimination function from a sample of data. We wish to leverage algorithms that do not pick out and model all of the noise in our sample. We do wish to select algorithms that generalize beyond the observed information. It makes logical sense that we could just assess the capability of the model to generalize from a data sample on data that it had not seen prior or during training.
The ideal descriptive model is precise on the observed data. The ideal predictive model is precise on unobserved data.
The flaw with assessing a predictive model on training data is that it does not inform you on how well the model has generalized to new unobserved data.
A model that is chosen for its precision on the training dataset instead of its precision on an unobserved test dataset is very probable have reduced precision on an unseen test dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training dataset. This is referred to as overfitting, and it’s more insidious than you might believe.
For instance, you might wish to cease training your model once the precision ceases improving. In this scenario, there will be a point where the precision on the training set goes on to improve but the precision on unobserved data begins to degrade.
You might be thinking to yourself: “So I’ll train on the training dataset and peek at the test dataset as I go”. A good idea, but now the test dataset is no longer unobserved data as it has been involved and influenced the training dataset.
You must evaluate your model on unobserved data to counter overfitting.
A split of data 66%/34% for training to evaluate datasets is a good beginning. Leveraging cross validation is better, and leveraging several runs of cross validation is better again. You wish to spend the time and obtain the ideal estimate of the models accurate on unobserved data.
You can improve the precision of your model be reducing its complexity.
In the scenario of decision trees for instance, you can prune the tree (delete leaves) upon training. This will reduce the amount of specialization in the particular training dataset and increase generalization on unobserved data. If you are leveraging regression for instance, you can leverage regularization to constrain the complexity (magnitude of the coefficients) during the training procedure.
In this blog article, you learned about the critical framework of phrasing the development of a predictive model as an approximation of an unknown ideal discrimination function.
Under this framework, you came to know that assessing the model on training data alone is inadequate. You learned that the best and most meaningful way to assess the ability of a predictive model to generalize is to assess it on unobserved data.
This intuition furnished the basis for why it is crucial to leverage train/test split tests, cross validation and ideally several cross validation in your test harness when assessing predictive models.