An Intro to Feature Selection
What features should you leverage to develop a predictive model?
This is a tough question that may need in-depth know-how of the problem domain.
It is doable to automatically choose the features in your data that are most relevant or most apt for the issue you are working on. This is a procedure referred to as feature selection.
In this blog post by AICoreSpot, you will find out about feature selection, the varieties of methods you can leverage and a useful checklist that you can adhere to the next time that you require to choose features for machine learning model.
What is feature selection?
Feature Selection is also referred to as variable selection or attribute selection.
It is the automatic choosing of attributes in your data (like columns in tabular data) that are most relevant to the predictive modelling issue that you are working on.
Feature Selection differs from dimensionality reduction. Both strategies look to minimize the number of attributes in the dataset, but a dimensionality reduction strategy do so by developing new combos of attributes, where as feature selection strategies include and exclude attributes existing in the data without altering them.
Instances of dimensionality reduction methods consist of Principal Component Analysis, Singular Value Decomposition, and Sammon’s mapping.
The issue that feature selection resolves
Feature Selection strategies assist you in your mission to develop a precise predictive model. They assist you by selecting features that will provide you as good or improve precision whilst needing reduced data.
Feature Selection strategies can be leveraged to detect and delete unwanted, redundant, and irrelevant attributes from that data that do not make contributions to the precision of a predictive model or might as a matter of fact, reduce the precision of the model.
Fewer attributes are desired as it minimizes the intricacy of the model, and a simpler model is easier to comprehend and explain.
The objective of variable selection is three-fold, enhancing the forecasting performance of the predictors, furnishing quicker and more affordable predictors, and furnishing an improved comprehension of the underlying procedure that produced the data.
Feature Selection Algorithms
There are three general categories of feature selection algorithms: filter strategies, wrapper strategies, and embedded strategies.
Filter feature selection go about applying a statistical measure to allocate a scoring to every feature. The features receive ranking by the score and are either chosen to be retained or deleted from the dataset. The strategies are typically univariate and consider the feature on its own, or with regard to the dependent variable.
Some instance of filter strategies consist of the Chi-squared test, information gain, and correlation coefficient scores.
Wrapper methods consider the choosing of a grouping of features as a search problem, where differing combos are prepped, assessed and contrasted to other combinations. A predictive model is leveraged to assess a combo of features and allocate a scoring on the basis of model precision.
The search procedure may be meticulous such as best-first search, it may be stochastic like a random hill-climbing algorithm, or it may leverage heuristics, such as forward and backward passes to include and eradicate features.
An instance if a wrapper method is the recursive feature elimination algorithm.
Embedded methods go about learning which features ideally contribute to the precision of the model as the model is being developed. The most typical variant of embedded feature selection strategies are regularization strategies.
Regularization methods are also referred to as penalization methods that put forth extra constraints into the optimization of a predictive algorithm (like a regression algorithm) that bias the model toward reduced intricacy (reduced coefficients)
Instances of regularization algorithms are the LASSO, Elastic Net, and Ridge Regression.
A Trap when choosing features
Feature selection is another critical aspect of the applied machine learning process, such as model selection. You cannot fire and forget.
It is critical to consider feature selection as a portion of the model selection procedure. If you don’t, you may inadvertently inject bias into your models which can have the outcome of overfitting.
For instance, you must integrate feature selection within the inner-loop when you are leveraging precision estimation models like cross-validation. This implies that feature selection is executed on the prepped fold right prior to the model receiving training. An error would be to execute feature selection first to prep your information, then execute model selection and training on the chosen features.
If we adopt the correct procedure, and execute feature selection in every fold, there is no longer any data about the held out cases in the option of features leveraged in that fold.
The reason is that the decisions made to choose the features were rendered on the complete training set, which in turn are passed onto the model. This might cause a mode a model that is improved by the chosen features over other models being evaluated to obtain seemingly improved outcomes, when as a matter of fact, it is a biased result.
If you execute feature selection on all of the data and then cross-validate, then the test information in every fold of the cross-validation process was also leveraged to select the features and this is what biases the performance analysis.
Feature Selection Checklist
Isabelle Guyon and Andre Elisseeff the authors of “An Intro to Variable and Feature Selection” furnish a brilliant checklist that you can leverage the next time you require to choose data features for your predictive modelling issue.
The critical part of the checklist are reproduced below:
- Do you possess domain knowledge? If yes, develop an improved set of adhoc features.
- Are your features commensurate? If no, think about normalizing them.
- Do you suspect interdependence of features? If yes, go about expanding your feature set by developing conjunctive features or products of features, as much as the computational resources you have available at your disposal will enable you to do.
- Do you require to prune the input variables (for example, for cost, speed, or data comprehension purposes?) If the answer is negative, develop disjunctive features or weighted totals of feature
- Do you require to evaluate features by themselves (for example, to comprehend their impact on the system or due to their number being so large that you require to do a first filtering?) If the answer is positive, leverage a variable ranking methodology, do it anyway to obtain baseline outcomes.
- Do you require a predictor? If no, stop.
- Do you have suspicions that your information is dirty (has a few meaningless input patterns and/or noisy outputs or incorrect class labels?) If yes, identify the outlier instances leveraging the top ranking variables obtained in stage 5 as a representation, validate and/or discard them.
- Do you know what to try first? If no, leverage a linear predictor. Leverage a forward selection method with the “probe” strategy as a stopping criterion or leverage the 0-norm embedded strategy to contrast, adhering to the ranking of step 5, develop a sequence of predictors of the same nature leveraging increasing subsets of features. Can you match or improve performance with a reduced subset? If yes, try out a non-linear predictor with that subset.
- Do you have new ideas, time, computational assets, and enough instances? If the answer is positive, contrast various feature selection methods, which includes your new data, correlation coefficients, backward selection and embedded strategies. Leverage linear and non-linear predictors. Choose the best strategy with model selection.
- Do you desire a stable solution (to enhance performance and/or comprehension?) If the answer is positive, subsample your data and redo your analysis for several “bootstrap”