Data Pre-Processing Using Scikit-Learn

Yash Chauhan
5 min readNov 7, 2021

--

Introduction

In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Dataset Description

The Iris flower data set or Fisher’s Iris data set is a multivariate data set. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Dataset: Iris Data Set

Data Encoding

Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this we assign unique values to all the categorical attribute. An example is to treat male or female for gender as 1 or 0. so there are two types so data encoding label encoding and Onehot encoding

Label Encoding

If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.

we will use the hotel Column for encoding

As you can see ‘hotel’ column has two categories (1) City Hotel (2) Resort Hotel. After Using Label Encoder we convert it into 0 and 1 forms. That shows City hotel convert to 0 and Resort Hotel converted to 1.

classes_ attribute is helping us to identify numerical categories for particular label categories. ( 0 index: City Hotel and 1 index: Resort Hotel.)

One Hot Encoding

Though label encoding is straight it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

Normalization

Normalization is the process where the values are scaled in a range of -1,1 i.e. converting the values to a common scale. This ensures that the large values in the data set do not influence the learning process and have a similar impact on the model’s learning process. The function normalize provides a quick and easy way to perform this operation on a single array-like dataset.

Standardization

Data standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn. The preprocessing module provides the StandarScaler utility class, which is a quick and easy way to perform Standardization.

Discretization

Data discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. Basically, a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. Here, I have used a built-in dataset iris which classifies the flowers based on their characteristics.

There are 3 types of Discretization available in Sci-kit learn.

(1) Quantile Discretization Transform

(2) Uniform Discretization Transform

(3) KMeans Discretization Transform

Imputing Missing Values

Handling missing values is an important task that every data scientist must have to do. We can handle missing values in two ways.

(1) Remove the data (whole row) which have missing values.

(2) Add the values by using some strategies or using Imputer.

We can remove the missing values when the ratio of the number of missing values and a total number of values is low. So in this particular situation, we can remove missing values using dropna() in pandas.

If the ratio is high so we have to Impute the values.

Thankfully Scikit Learn gives the SimpleImputer Class Which will help us to fill values in missing values. It replaces the NaN values with a specified placeholder.

As you can see we use strategy=‘mean’ which means all the missing values will be filled with the mean of that particular column.

we can use median, constant, and most frequent as a strategy

Conclusion

There is a lot more in data preprocessing I discussed some of the common methods.

--

--

No responses yet