>Business >How to detect outliers within your data

How to detect outliers within your data

A lot of you may have doubts regarding outlier detection in datasets when operating with machine learning algorithms. 

This blog post by AICorespot intends to address the queries of these individuals. 

If you have a query with regards to machine learning, read our historical blog posts as they serve as a goldmine of information on the subject. If you would wish for us to feature articles on specific topics, please shoot out an email to editorial@aicorespot.io. 

Outliers 

Several ML algorithms are sensitive to the range and distribution of attribute values within the input data. 

Outliers within input data can skew and misdirect the training procedure of ML algorithms having the outcome of increased training times, less precise models and ultimately weaker results. 

Even prior to the preparation of predictive models on training data, outliers can have the outcome of misdirecting representations and in turn misdirecting interpretations of collected information. Outliers can skew the summary distribution of attribute values in descriptive stats such as mean and standard deviation and in plots like histograms and scatterplots, undertaking compression of the body of data. 

Lastly, outliers, can signify instances of data examples that are of relevance to the problem like anomalies in the scenario of fraud detection and computer security. 

Outlier Modelling 

Outliers are extreme values that come a long way outside of the other observations. For instance, in a conventional distribution, outliers might be values on the tails of the distribution. 

The procedure of detecting outliers has several names within data mining and machine learning like outlier mining, outlier modelling, and novelty detection and anomaly detection. 

Aggarwal, in his published book, Outlier Analysis, furnishes an interesting taxonomy of outlier detection strategies, as follows: 

  • Extreme value analysis: Decide the statistical tails of the underlying distribution of the information. For instance, statistical strategies like the z-scores on univariate data. 
  • Probabilistic and Statistical models: Decide improbable examples from a probabilistic model of the data. For instance, gaussian mixture models that have undergone optimization leveraging expectation-maximization. 
  • Linear models: Projection strategies that model the information into lower dimensions leveraging linear correlations. For instance, principle component analysis and data with big residual errors might be outliers. 
  • Proximity-based models: Data examples that are isolated from the bulk of the data as decided by cluster, density, or nearest neighbour analysis.  
  • Information theoretic models: Outliers are identified as data instances that increase the intricacy (minimal code length) of the dataset. 
  • High-dimensional outlier detection: Strategies that search subspaces for outliers provide the breakdown of distance on the basis of measures in higher dimensions (curse of dimensionality) 

Aggarwal opines that the interpretability of an outlier model is of crucial importance. Context or rationale is needed surrounding decisions why a particular data instance is or is not an outlier. 

In the contributing chapter to Data Mining and Knowledge Discovery Handbook, Irad Ben-Gal puts forth a taxonomy of outlier models as univariate or multivariate and parametric and nonparametric. This is a very good method to construct strategies on the basis of what is known about the data. For instance: 

  • Are you considered with outliers in one or more than one attributes (univariate or multivariate strategies?) 
  • Can you go by the assumption that a statistical distribution from which the observations were sampled or not (parametric or nonparametric?) 

There are several methods and a lot of work has put into outlier detection. Begin by making some assumptions and develop experiments where you can obviously observe the effects of those assumptions against some performance or precision measure. 

Extreme Value Analysis 

You are not required to know sophisticated statistical techniques to look for, analyze, and filter out outliers from your information. Begin simple with extreme value analysis. 

  • Concentrate on univariate methods 
  • Visualize the data leveraging scatterplots, histograms, and box and whisker plots and look for extreme values. 
  • Assume a distribution (Gaussian) and search for values more than 2 or 3 standard deviations from the mean or 1.5x from the 1st or 3rd quartile. 
  • Filter out outliers candidate from training dataset and evaluate your model’s performance. 

Proximity Methods 

After you have looked into simpler extreme value strategies, take up moving onto proximity-driven strategies. 

  • Leverage clustering strategies in identifying the natural clusters in the information (like the k-means algorithm) 
  • Detect and mark the cluster centroids 
  • Detect data instances that are a static distance or percentage distance from cluster centroids 
  • Filter out outliers candidate from training dataset and evaluate your model’s performance levels.  

Projection Methods 

Projection methods are comparatively easy to go about applying and swiftly highlight extraneous values.  

  • Leverage projection strategies to summarize your information to two dimensions (like PCA, SOM, or Sammon’s mapping) 
  • Visualize the mapping and detect outliers manually 
  • Leverage proximity measures from projected values or codebook vectors to detect outliers 
  • Filter out outliers candidate from training dataset and evaluate your model’s performance levels. 

Strategies Robust to Outliers 

An alternative technique is to shift to models that are robust to outliers. There are robust forms of regression that reduce the median least square errors instead of the mean (so-called robust regression), but are more heavy from a computational standpoint. There are additionally, strategies such as decision trees that are robust to outliers. 

You ought to spot check some strategies that are robust to outliers. If there are considerable model precision advantages then there might be an avenue to model and filter out outliers from your training information. 

Resources 

There are a ton of webpages that talk about outlier identification/detection, but our recommendation is perusing through literature on the topic, content that is more authoritative. Even perusing through introductory literature on machine learning and data mining won’t be that relevant. For a traditional treatment of outliers by statistical experts, look at the following: 

  • Robust regression and outlier detection by Rousseeuw and Leroy put out in 2003. 
  • Outliers in statistical data by Barnett and Lewis, put out in 1994 
  • Identification of outliers a monograph by Hawkins put out in 1980 

For a modern rendering of outliers by the data mining community, take a look at: 

  • Outlier analysis by Aggarwal, put out in 2013 
  • Chapter 7 by Irad Ben-Gal in Data Mining and Knowledge Discovery Handbook edited by Maimon and Rokach, put out in 2010. 
Add Comment