This article is about Software Development
Guide to Encoding Categorical Values in Python
By NIIT Editorial
Published on 16/07/2021
It takes hours of work to develop a machine learning algorithm. Though performance depends upon the model and its hyper-parameters, still data processing and feeding of variables are very important for result optimization. Data sets do contain categorical values which are usually in text format which machine learning models cannot assess and hence converting categorical data is an unavoidable activity. All these add to the importance of encoding categorical values as the algorithm’s performance can vary based on how categorical variables are encoded which is nothing but encoding categorical data into numeric form before evaluation of the model.
In this article, you will find various and most suitable methods of encoding categorical values using python according to your categorical data.
To learn more about encoding categorical values you can check out the complete guide provided by NIIT
Understanding categorical data and its types
Since we are operating on categorical values in this article, it is very important to understand what categorical variables are.
Categorical data are typically stored in text format which basically represents traits. Some of the examples are:
- Colour: Red, Blue, Green, Yellow, etc.
- Blood group: A+, A-, B+, B-, etc.
- Working post: CEO, Director, Manager, Assistant Manager, etc.
- Working department: Finance, Marketing, Human Resource, Sales, etc.
Categorical data have finite possible values like the above examples. These variables can be converted into numeric values like 1 for yes and 2 for no but remember those numbers don’t have mathematical meaning.
Based on order the categorical data is divided into two types-
1. Ordinal data: When in categorical Data a precise order for categories has to be maintained, such as, First, Second, And Third as per the priority or requirement of the task then, it is referred to as ordinal data.
2. Nominal data: In this, categorical data is retained irrespective of its order. For example, (Red, Blue, Green), (Blue, Red, Green), or (Green, Blue, Red). Hence, these three data sets hold the same value. In nominal data presence and absence of a feature are considered rather than the order.
Methods for Encoding Categorical Values in Python
There is no specific rule for encoding categorical values in the Data Science world. Each approach has its pros and cons. But fortunately, python provides different tools for your various requirements to convert categorical data into suitable numeric values to enhance your model. The following are the methods you can opt from:
- Label Encoding
Check below example:
2. One Hot Encoding
There are many more methods and approaches for encoding categorical values. You can visit NIIT for more details.
Every variable is important and can easily create a competitive advantage. So, it is essential to include categorical variables too. Many machine learning algorithms can only include numeric values so encoding categorical values is an important step in the data science process. But there are many approaches and methods for it and that’s exactly where the problem arises. Hence, it is very important to understand the data and the end result you want because then only you can know which approach should be used and how it should be implemented on your data set. So, whenever you put your analysis mode on, do keep all of this in mind.
Data Science Foundation Program (Full Time)
Become an industry ready StackRoute Certified Python Programmer in Data Science. The program is tailor-made for data enthusiasts. It enables learner to become job-ready to join data science practice team and gain experience to grow up as Data Analyst.
Visualise Data using Python and Excel
6 Weeks Full Time Immersive