>Business >How to evaluate ML algorithms

How to evaluate ML algorithms

After you have given definition to our problem and prepped your data you are required to apply machine learning algorithms to the data in order to find a solution to your problem. 

You can utilize a lot of time selecting, running, and tuning algorithms. You ought to make sure you are leveraging your time efficiently to get closer to your objective. 

In this blog post, you will step through a process to swiftly test algorithms and find out whether or not there is structure in your problem for the algorithms to learn and which algorithms are efficient. 

Test Harness 

You require to give definition to a test harness. The test harness is the information you will train and evaluate an algorithm against and the performance measure you will leverage to evaluate its performance. It is critical to define your test harness well so that you can concentrate on assessing differing algorithms and thinking deeply about the problem. 

The objective of the test harness is to be able to swiftly and consistently test algorithms against a fair representation of the issue we are finding a solution to. The result of evaluating multiple algorithms against the harness will be an estimation of how an array of algorithms perform on the problem against a selected performance measure. You will be aware of which algorithms might be worth tuning on the problem and which ought not to be considered further. 

The outcomes will also provide you an indication of how learnable the problem is. If an array of differing learning algorithms universally feature poor performance on the problem, it might be an indication of a lack of structure available for algorithms to learn. This might be because there actually is dearth of learnable structure in the selected information or it might be an avenue to try out differing transforms to expose the structure to the learning algorithms. 

Performance Measure 

The performance measure is the way you wish to assess a solution to the problem. It is the measurement you will make of the predictions made by a trained model on the test dataset. 

Performance measures are usually specialized to the class of problem you are operating with, for instance classification, regression, and clustering. Several conventional performance measures will provide you with a score that is meaningful to your problem domain. For instance, classification precision for classification (total correct correction divided by the cumulative predictions made multiple by 100 to convert it to a percentage.) 

You might also desire a more comprehensive breakdown of performance, for instance, you might wish to know about the false positives on a spam classification problem as relevant email will be marked as spam and can’t be read. 

There are several standard performance measures to choose from. It is very uncommon that you have to devise a new performance measure yourself as you can typically identify or adapt one that ideally captures the requirements of the problem you’re finding a solution to. Look to similar problems you uncovered and at the performance measures leveraged to observe if any can be adopted.  

Test and Train Datasets 

From the data that has undergone transformation, you will require to choose a test set and a training set. An algorithm will be trained on the training dataset and will be assessed against the test set. This might be as simple as choosing a random split of data (66% for training, 34% for evaluation) or might consist of more complex sampling strategies. 

A trained model is not exposed to the test dataset during training and any forecasts made on that dataset are developed to be representative of the performance of the model, generally speaking. As such you ought to ensure the selection of your datasets are indicative of the problem you are finding a solution to.  

Cross Validation 

A more advanced strategy than leveraging a test and train dataset is to leverage the entire transformed dataset to train and evaluate a provided algorithm. A strategy you could leverage in your test harness that does this is referred to as cross validation. 

It first consists of separating the dataset into a number of equally sized groups of instances referred to as folds. The model then receives training on all folds excepting one that was left out and the prepped model is evaluated on that left out fold. The procedure is repeated so that every fold receives an opportunity of being left out and functioning as the test dataset. Lastly, the performance measures are averaged across all folds to estimate the capacity of the algorithm on the problem. 

For instance, a 3-fold cross validation would consist of training and evaluating a model 3 times. 

  • #1: Train on folds 1+2, test on fold 3 
  • #2: Train on folds 1+3, test on fold 2 
  • #3: Train on folds 2+3, test on fold 1 

The number of folds can demonstrate great variance on the basis of the size of your dataset, but typical numbers are 3, 5, 7 and 10 folds. The objective is to attain a good balance between the size and representation of data in your train and test sets. 

When you’re just beginning, stick with a simplistic split of train and test data (like 66%/34%) and shift onto cross validation after you possess more confidence. 

Testing Algorithms 

When beginning with a problem and possessing defined a test harness you are satisfied with, it is time to spot check an array of ML algorithms. Spot checking is good as it enables you to very swiftly observe if there are any learnable structures in the information and estimate which algorithms may be efficient on the problem. 

Spot checking also assists you to work out any problems in your test harness and ensure the selected performance measure is relevant. 

The best algorithm to first spot check is a random. Plug in an arbitrary number generator to produce predictions in the appropriate range. This could be the worst “algorithm outcome” you accomplish and will be the measure by which all enhancements can be evaluated. 

Choose 5-10 traditional algorithms that are relevant for your problem and run them through your evaluation harness. By conventional algorithms, we mean widespread strategies – no special configurations. 

Appropriate for your problem means that the algorithms can manage regression if you possess a regression problem. 

Select methods from the groupings of algorithms we have already reviewed. We would like to include a diverse mixture and possess 10-20 differing algorithms obtained from a diverse array of algorithm variants. Dependent on the library we are leveraging, we might spot check up to a 50+ widespread strategies to flush out promising strategies quickly. 

If you wish to run a lot of strategies, you might have to revisit data prep and reduce the size of your chosen dataset. This might minimize your confidence in the outcomes, so evaluate with several data set sizes. You might desire to leverage a smaller size dataset for algorithm spot checking and a fuller dataset for algorithm tuning. 


If this blog post by AICorespot, you learned about the criticality of establishing a trust worthy test harness that consists of the selection of test and training datasets and a performance measure relevant to your problem.  

You also learned about the technique of spot checking a diverse array of machine learning algorithms on your problem leveraging your test harness. You found out that this technique can swiftly highlight whether there is learnable structure within your dataset (and if not you can revisit data prep) and which algorithms perform typically well on the problem (that might be candidates for subsequent investigation and tuning) 


If you are seeking to delve deeper into this subject, you can learn more from the resources below. 

  • Data Mining: Practical Machine Learning Tools and Techniques, Chapter 5: Credibility Evaluating what’s been learned 
Add Comment