Main navigation

Menu
Add To Bookmark

Guide to Encoding Categorical Values in Python


By NIIT Editorial

Published on 16/07/2021

8 minutes

It takes hours of work to develop a machine learning algorithm. Though performance depends upon the model and its hyper-parameters, still data processing and feeding of variables are very important for result optimization. Data sets do contain categorical values which are usually in text format which machine learning models cannot assess and hence converting categorical data is an unavoidable activity. All these add to the importance of encoding categorical values as the algorithm’s performance can vary based on how categorical variables are encoded which is nothing but encoding categorical data into numeric form before evaluation of the model.

In this article, you will find various and most suitable methods of encoding categorical values using python according to your categorical data.

To learn more about encoding categorical values you can check out the complete guide provided by NIIT

Understanding categorical data and its types

Since we are operating on categorical values in this article, it is very important to understand what categorical variables are. 

Categorical data are typically stored in text format which basically represents traits. Some of the examples are:

  1. Colour: Red, Blue, Green, Yellow, etc.
  2. Blood group: A+, A-, B+, B-, etc.
  3. Working post: CEO, Director, Manager, Assistant Manager, etc.
  4. Working department: Finance, Marketing, Human Resource, Sales, etc.

Categorical data have finite possible values like the above examples. These variables can be converted into numeric values like 1 for yes and 2 for no but remember those numbers don’t have mathematical meaning.

Based on order the categorical data is divided into two types-

1. Ordinal data: When in categorical Data a precise order for categories has to be maintained, such as, First, Second, And Third as per the priority or requirement of the task then, it is referred to as ordinal data.

 2. Nominal data: In this, categorical data is retained irrespective of its order. For example, (Red, Blue, Green), (Blue, Red, Green), or (Green, Blue, Red). Hence, these three data sets hold the same value. In nominal data presence and absence of a feature are considered rather than the order.

Methods for Encoding Categorical Values in Python

There is no specific rule for encoding categorical values in the Data Science world. Each approach has its pros and cons. But fortunately, python provides different tools for your various requirements to convert categorical data into suitable numeric values to enhance your model. The following are the methods you can opt from:

  1. Label Encoding

In label encoding, you can pick any column of categorical data for example   Departments in a company, and then you can simply label each value in the column with a different number. For example, if the company has five departments you could encode like this:

·        Finance -> 0

·        Production -> 1

·        Marketing -> 2

·        Human Resource -> 3

·        IT -> 4

This categorical data encoding technique is used when data is finite and ordinal. In this, you enjoy the benefits of both panda categories and easy conversion of numeric value for further analysis.

Check below example:

images

CC++C++

 

2. One Hot Encoding

One Hot Encoding is a common alternative approach used when nominal data is present. In this, each category is converted into a new column and mapped with either 0 or 1. In this, 0 represents absence and 1 represents presence, or 1 represents True and 0 represents False as per the given features of a category.

For instance, suppose we have a dataset with a category color, having different features like Red, Blue, Yellow, Green, Black. After sorting out the data add a binary number to each feature which is also known as a dummy variable. The number of dummy variables depends on the number of variables in the given category. After this, we have a number that is a dummy variable for each category of color.

Let’s implement this on python.

PythonPythonPythonPython

3. Dummy Encoding

 

In this categorical data encoding method, the categorical values or variables are transformed into dummy variables. A dummy variable is a binary variable that indicates 2 possible results like absence and presence or, true and false on each separate categorical value.

 

Dummy encoding is similar to one-hot encoding or you can say dummy encoding is an improvised version of one-hot-encoding. If there are N categories in categorical data, the one-hot-encoding uses N binary features but dummy encoding uses N-1 features to represent categories.

 

For example, if there are 3 categories like hot, cold, and warm, the Dummy encoding uses 2 variables whereas one-hot-encoding uses 3 variables.

 

Let’s implement this on python

PythonPythonPython

4. Binary Encoding

In binary encoding, the categories are converted into numbers starting from 1 as per data sheet order. Then those integers are further converted into binary codes. After that, these binary codes are further split into different columns.

Binary encoding is good when there are a high number of categories. For example, the different cities in a country. Binary encoding takes less time during processing in the model compared to One Hot Encoding because of the fewer feature columns. For example, if One Hot Encoding has 100 features, then Binary Encoding will have 7 features. Plus, it also reduces dimensionality for data with high cardinality.

Implementation on python:

PythonPythonPythonPython

5.Target Encoding

 

Target encoding that is also known as mean encoding uses target statistical measures like mean to encode categorical data into a numeric form for easy application in machine learning models.

In target encoding, we first have to specify the target column and target variable. After this, the mean of each target variable as per different categories is derived and then the category variable is replaced with its mean value. The variables are target correlated because each category is replaced with the posterior probability of the target.

PythonPythonPythonPython

There are many more methods and approaches for encoding categorical values. You can visit NIIT  for more details.

Conclusion

Every variable is important and can easily create a competitive advantage. So, it is essential to include categorical variables too. Many machine learning algorithms can only include numeric values so encoding categorical values is an important step in the data science process. But there are many approaches and methods for it and that’s exactly where the problem arises. Hence, it is very important to understand the data and the end result you want because then only you can know which approach should be used and how it should be implemented on your data set. So, whenever you put your analysis mode on, do keep all of this in mind.

 



Data Science Foundation Program (Full Time)

Become an industry ready StackRoute Certified Python Programmer in Data Science. The program is tailor-made for data enthusiasts. It enables learner to become job-ready to join data science practice team and gain experience to grow up as Data Analyst.

Visualise Data using Python and Excel

6 Weeks Full Time Immersive

Top