Techniques to handle missing values

7 min readJan 27, 2022

Hello folks,
Before creating a machine learning model most important task is to deal with the lost values. In this blog I tried to explain how to handle quantitative and qualitative missing values.
Before dealing with missing values, let’s first understand the types of missing values :

Missing Completely At Random (MCAR): There is no relationship between data missing and any other values, observed or missing.
Missing At Random (MAR): the missing data can be predicted by other features in the dataset, data are missing is systematically related to the observed but not the unobserved data.
Missing Not At Random(MNAR): There is some relationship between data missing and any other values which is unobserved.

Now that we understand the types of missing values, let’s see how to handle missing values.

Handling Categorical/Qualitative Missing Values:

Delete the rows:

This method commonly used to handle the null values. Here, we delete a particular row if it has a null value for a particular feature then this method is advised only when there are enough samples in the data set. One has to make sure that after we have deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not give the expected results while predicting the output.

Advantages:

Robust and highly accurate model
Deleting a particular row or a column with no specific information is better, since it does not have a high weightage

Disadvantages:

Loss of information and data
Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset.

2. Mode imputation:

Missing values are replaced with the most frequent value of the entire feature column. This method is suitable when Data is Missing Completely at Random(MCAR). Mode is used when the data having more occurrences of a particular value or more frequent value. You can use this method when data is missing completely at random, and no more than 5% of the variable contains missing data.

Advantages:

Fast way of obtaining complete datasets.
It can be used in production, i.e. during model deployment.

Disadvantages:

Imputing the approximations add variance and bias.
Works poorly compared to other multiple-imputations method

3. Develop a model to predict missing values: One smart way of doing this could be training a classifier over your columns with missing values as a dependent variable against other features of your data set and trying to impute based on the newly trained classifier. This method may result in better accuracy, unless a missing value is expected to have a very high variance.

One can experiment with different algorithms and check which gives the best accuracy instead of sticking to a single algorithm.

Advantages :

Imputing the missing variable is an improvement as long as the bias from the same is smaller than the omitted variable bias
Yields unbiased estimates of the model parameters

Disadvantages:

Bias also arises when an incomplete conditioning set is used for a categorical variable
Considered only as a proxy for the true values

4. Deleting the variable:

Dropping the whole feature column if there are an exceptionally larger set of missing values. This method is commonly used to handle the null values. Here, we delete a particular column if it has more than 70–75% of missing values. This method is advised only when there are enough samples in the data set. One has to make sure that after we have deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not give the expected results while predicting the output.

Advantages:

Robust and highly accurate model
Deleting a particular row or a column with no specific information is better, since it does not have a high weightage

Disadvantages :

Loss of information and data
Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset.

5. Arbitrary imputation:

Replacing all Nan Values with and arbitrary value(“Missing”), used especially when there are a greater number of frequent categories. Since they have a definite number of classes, we can assign another class for the missing values. The features col1 and col2 have missing values which can be replaced with a new category, say, U for ‘unknown’.

This strategy will add more information into the dataset which will result in the change of variance.

Advantages:

Less possibilities with one extra category, resulting in low variance after one hot encoding.
Negates the loss of data by adding an unique category

Disadvantages:

Adds less variance
Adds another feature to the model while encoding, which may result in poor performance

Handling Missing Values Numerical/Qualitative :

Mean / median imputation:

Missing values (NA/NaN) are replaced with the mean or median value of the entire feature column. This method is suitable for numerical variables. This technique is apply when Data is Missing Completely at Random(MCAR). If the variable follows a normal distribution, the mean and median are approximately the same. If the variable has a skewed distribution, then the median is a better representation.

You can use this method when data is missing completely at random, and no more than 5% of the variable contains missing data.

Advantages:

Easy to implement.
Fast way of obtaining complete datasets.
It can be used in production, i.e during model deployment.

Disadvantages:

It distorts the original variable distribution and variance.
It distorts the covariance with the remaining dataset variables.
The higher the percentage of missing values, the higher the distortions.

2. Random sample imputation:

This replaces missing data with a random sample extracted from the variables in the training set. This method is suitable for numerical and categorical variables. We’re replacing the missing values with other values within the same distribution of the original variable.

Advantages:

Easy to implement and a fast way of obtaining complete datasets.
It can be used in production.
Preserves the variance of the variable.

Disadvantages:

Randomness.
The relationship between imputed variables and other variables may be affected if there are a lot of missing values.
Memory is massive for deployment, as we need to store the original training set to extract values from, and replace the missing values with the randomly selected values.

3. Capturing NAN values with a new feature:

A new feature is created for NaN values where NaN is set to 1 else 0. This method should apply when the missing data type is MNAR.

Advantages of missing indicator:

Easy to implement.
It can capture the importance of missing data.
It can be integrated into production.

Disadvantages of missing indicator:

It expands the feature space.
The original variable still needs to be imputed.
Many missing indicators may end up being identical or very highly correlated.

4. End of distribution imputation:

Values which are far away from the distribution is taken, considering standard deviation, values which are after the 3rd standard deviation are considered Nan values gets replaced with these values. This method should be performed on the training set and propagated on the test set. Outliers will be computed using the mean rather than the median.

Advantage:

It is quick
Captures the importance of missing values

Disadvantages:

Performing this action may distort the variable
Mask predictive power if messiness is not important
Hide true outliers if the missing data is large
Create an unintended outlier if N/As are small.

5. Arbitrary imputation:

Arbitrary value imputation consists of replacing all occurrences of missing values (NA) within a variable with an arbitrary value (0, 999, -999, other combinations of 9s or -1 if the distribution is positive). The arbitrary value should be different from the mean or median and not within the normal values of the variable.

Advantages:

Easy to implement.
It’s a fast way to obtain complete datasets.
It can be used in production, i.e during model deployment.
It captures the importance of a value being “missing”, if there is one.

Disadvantages:

Distortion of the original variable distribution and variance.
Distortion of the covariance with the remaining dataset variables.
If the arbitrary value is at the end of the distribution, it may mask or create outliers.
We need to be careful not to choose an arbitrary value too similar to the mean or median (or any other typical value of the variable distribution).
The higher the percentage of NA, the higher the distortions.

6. End of tail imputation:

End of tail imputation is roughly equivalent to arbitrary value imputation, but it automatically selects the arbitrary values at the end of the variable distributions.

If the variable follows a normal distribution, we can use the mean plus or minus 3 times the standard deviation. If the variable is skewed, we can use the IQR proximity rule.

Normal Distribution:

Most of the observations (~99%) of a normally-distributed variable lie within the mean plus/minus three times standard deviations — for that the selected value = mean ± 3 ×standard deviations.

Skewed distributions:

The general approach is to calculate the quantiles, and then the inter-quantile range (IQR), as follows:

IQR = 75th Quantile –25th Quantile.

Upper limit = 75th Quantile + IQR ×3.

Lower limit = 25th Quantile — IQR ×3.

So the selected value for imputation is the previously calculated upper limit or the lower limit.

Summary :

In this tutorial, you learned how to handle missing values.

Specifically, you learned:

Types of missing values.
Types of variables and there types.
How to deal with categorical and numerical missing values in a dataset.

Techniques to handle missing values

Written by Rushikesh Lavate