>Business >How to manage missing values in machine learning data with Weka

How to manage missing values in machine learning data with Weka

Data is very uncommonly clean and typically you can have corrupt or absent values.

It is critical to detect, mark, and manage missing data when developing machine learning models in order to obtain the optimal performance.

In this blog article, you will find out how to manage absent values in your machine learning data leveraging Weka.

After going through this article, you will be aware of:

  • How to mark absent values within your dataset
  • How to remove data with absent values from your dataset
  • How to impute absent values

Forecast the onset of diabetes

The problem leveraged for this instance is the Pima Indians onset of diabetes dataset.

It is a classification problem where every instance signifies medical details for a single patient and the activity is to forecast whether the patient will have an onset of diabetes within the next half-a-decade.

You can learn more about the dataset here:

Dataset file

Dataset details

You can additionally access this dataset in your Weka installation, under the data/directory in the file referred to as diabetes.arff.

Mark Missing Values

The Pima Indians dataset is a good foundation for looking into absent data.

Some attributes like blood pressure (pres) and Body Mass Index (mass) have values of zero, which are impossible. These are instances of corrupt or absent data that must be marked manually.

You can mark absent values in Weka leveraging the NumericalCleaner filter. The recipe here displays how to leverage this filter to mark the 11 absent values on the Body Mass Index (mass) attribute.

  1. Open up the Weka explorer.
  2. Load the Pima Indians onset of diabetes dataset.
  3. Select the “Choose” button for the Filter and choose Numerical Cleaner, it is us under unsupervised.attribute.NumericalCleaner
  4. Choose the filter to configure it.
  5. Set the attributeIndices to 6, the index of the mass attribute.
  6. Set minThreshold to 0.1E-8 (close to nil), which is the minimum value allowed for the attribute.
  7. Set minDefault to NaN, which is not known and will substitute values below the threshold.
  8. Select the “OK” button on the filter configuration.
  9. Select the “Apply” button to apply the filter.

Click “mass” in the “attributes” pane and review the details of the “selected attribute”. Notice that the 11 attribute values that were formally set to 0 are not marked as missing.

In this instance, we marked values below a threshold as absent.

You could just as simply mark them with a particular numerical value. You could additionally mark values absent between a upper and lower range of values.

Next, let’s look at how we can delete instances with absent values from our dataset.

Remove Absent Data

Now that you are aware of how to mark absent values in your data, you are required to learn how to manage them.

An easy way to manage absent data is to remove those instances that have one or more absent values.

You can perform this in Weka leveraging the RemoveWithValues filter.

Continuing on from the above recipe to mark absent values, you can delete absent values as follows:

  1. Select the “Choose” button for the Filter and choose RemoveWithValues, it is under unsupervised.instance.RemoveWithValues
  2. Choose the filter to configure it.
  3. Set the attributeIndices to 6, the index of the mass attribute.
  4. Set matchMissingValues to “True”
  5. Select the “OK” button to leverage the configuration for the filter.
  6. Select the “Apply” button to apply the filter.

Select “mass” in the Attributes section and review the details of the “selected attribute”

Notice that the 11 attribute values that were marked absent have been removed from the dataset.

Observe that you can undo this operation by clicking the “Undo” button.

Impute Missing Values

Instances with absent values do not have to be deleted, you can substitute the absent values with some other value.

This is referred to as imputing missing values.

It is typical to impute absent values with the mean of the numerical distribution. You can do this simply in Weka leveraging the ReplaceMissingValues filter.

Continuing on from the starting recipe above to mark absent values, you can impute the missing values as follows:

  1. Select the “Choose” button for the filter and choose ReplaceMissingValues, it is under unsupervised.attribute.ReplaceMissingValues
  2. Click on the “Apply” button to apply the filter to your dataset.

Select “mass” in the “attributes” section and review the details of the “selected attributes”

Observe that the 11 attribute values that were marked absent have been set to the mean value of the distribution.

Conclusion

In this blog article, you found out how you can manage absent data in your machine learning dataset leveraging Weka.

Particularly, you learned:

  • How to mark corrupt values as absent in your dataset
  • How to remove instances with absent values from your dataset.
  • How to impute mean values for absent values in your dataset.
Add Comment