How to convert categorical data to mathematics?

Rushikesh Lavate
4 min readJan 12, 2021

Hello reader,

The category data can be textual in nature. All machine learning models are an unusual kind of mathematical model that requires numbers to work with. This is one of the initial reasons we need to convert the categorical data into numerical before we can feed it to machine learning models.

Encoding Technique

Now, look at some feature engineering techniques which is used to convert categorical data into numerical

  1. OneHot Encoding Technique
  • In this technique, we convert the Categorical data in the form of 0 and 1
  • This technique creates a new feature so it is useful when we have minimum unique values in that specific function.
  • - Here we performed onehot encoding technique on the Sex feature
OneHot Encoding Technique
  • It creates two new features female and male and arranged in alphabetical order
  • In the ‘Sex’ feature wherever female is the category they put 1 in the new ‘female’ feature else 0.
  • Similarly, wherever male is the category they put 1 in the new ‘male’ feature else 0.

2. Ordinal Label Encoding Technique

  • This technique is applicable to ordinal categorical data.
  • Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known.
  • For example :

We have a feature education with values SSC, HSC, Diploma, Bachelor, Master, Ph.D. so we can easily rank this: Ph.D. person is holding a rank 1, Master = rank 2, Batchelor = rank 3

  • On the ‘Week Day’ column/feature we perform ordinal label encoding, We assign a rank as below :
Ordinal Label Encoding Technique

3. Count Frequency Encoding Technique

  • This technique is useful when we have frequent values in our feature
  • Here we replace category values with their frequency count
  • It does not create new feature hence it not increasing feature space
  • If two category having the same frequency count then it will provide the same weight to both
Applying Count Frequency encoding Technique

4. Mean Encoding Technique

  • First, find the relation percent(%)of the feature with respect to the Target/Output feature (finding mean)
  • Now, we replace category values with their mean with respect to the target feature
Applying Mean Encoding Technique

5. Target Guided Encoding Technique

  • First, find the relation percent(%)of the feature with respect to the Target/Output feature
  • After that order it in ascending order
  • Now, we apply ordinal label encoding (we assign rank/number to each cabin)
Applying Target Guided Ordinal Encoding Technique

6. Probability Encoding Technique

  • First, find the probability percentage :
  • Find the mean of the feature with respect to the Target/Output feature
  • Then, subtract the mean from 1
  • Probability percentage = mean / (mean — 1 )
  • Now, we replace category values with a probability percentage
Applying Probability Ration Encoding Technique

I believe this will help you develop your knowledge. In this blog, I included “How to convert category data into mathematics?” For that, I tried to include all the techniques used for theoretical conversion. For a more beneficial practical implementation please take a look at my GitHub repository I explained all the code line by line.

Make sure your data is missing free here is my notebooks that help you to handle missing values.

--

--

Rushikesh Lavate

Working as a Data Engineer. Bringing experience in Python, SQL. I have developed applications using Python, MySQL, MongoDB, Data Science, and Machine Learning.