Data Preprocessing in Machine Learning Python

Data preprocessing in machine learning is the step where we take raw data and clean it up so that it’s easier to work with. Raw data often has missing information, errors, or inconsistencies, making it hard to use directly in machine learning models. Preprocessing fixes these problems by filling in missing values, removing errors, and organizing the data in a consistent way. This step is crucial because machine learning models need clean and well-structured data to work effectively.

Data Cleaning

Data often includes unnecessary information or has missing values. To address these problems, we clean the data by removing irrelevant details and handling any missing or noisy data, ensuring it’s ready for accurate analysis.

Missing Data

When some data is missing, there are a few ways to deal with it:

  1. Ignoring tuples: This method works well when you have a large dataset and a record (tuple) has many missing values. In such cases, you can simply exclude that record.
  2. Filling in the values: You can fill in the missing values manually, use the most likely value (based on probability), or use the average value of that attribute (mean).

Noisy Data

Noisy data is data that doesn’t make much sense and is difficult for a machine learning model to interpret. It often comes from errors during data collection or data entry. To handle noisy data, you can use the following methods:

Binning

The binning method helps smooth out noisy data by dividing it into equal-sized groups or “bins.” After sorting the data, each bin is processed separately. You can replace the values in each bin with the mean (average) of that bin or use the boundary values (the smallest and largest values) to clean the data. This technique helps reduce noise and make the data more consistent.

Regression

Regression is a technique used to smooth out data by fitting it to a regression function. This function can be linear, involving a single independent variable, or non-linear, involving multiple independent variables. By applying regression, the data is adjusted to follow a smoother trend, making it easier for models to understand and predict outcomes.

Clustering

Clustering involves grouping similar data points together into clusters. This method helps to organize and identify patterns in the data. However, outliers—data points that don’t fit well with any group—might either go unnoticed or fall outside the defined clusters.

Data Transformation

In this step, data is converted into the proper format for the data mining or processing stage. This can be done in several ways:

Normalization

Normalization adjusts data values so they fall within a specific range, such as 0.0 to 1.0 or -1.0 to 1.0. This helps to ensure that different features in the data are on a similar scale, which is important for accurate analysis and modeling.

Attribute Selection

In this process, existing attributes are used to create new ones that may be more relevant or useful for the analysis.

Discretization

This involves replacing the raw numeric values of an attribute with conceptual or interval levels, making the data easier to analyze and interpret.

Generation of Concept Hierarchy

Here, attributes are converted to a higher level within a hierarchy. For example, an attribute labeled “town” might be elevated to “city,” helping to organize the data into broader categories.

Data Reduction

When dealing with large amounts of data, analyzing it can become challenging. Data reduction is the process of making data storage and analysis more efficient by reducing the amount of data while maintaining its essential features.

Data Cube Aggregation

This involves summarizing or aggregating data to construct a data cube, which organizes the data in a way that makes it easier to analyze. Aggregation helps in reducing the data’s complexity while retaining its key information.

Attribute Subset Selection

In this step, only the most relevant attributes are chosen for analysis, and the rest are excluded. This is done by evaluating each attribute’s p-value against a significance level; attributes with p-values higher than the significance level may be discarded.

Numerosity Reduction

This technique involves storing only the data model or summary rather than the entire dataset. It helps reduce storage needs and is commonly used in regression models to simplify and speed up data analysis.

Dimensionality Reduction

Dimensionality reduction techniques reduce the size of the data while preserving its essential features. This is done using encoding mechanisms that can be either lossless or lossy:

  • Lossless Reduction: The original data can be perfectly reconstructed from the compressed data.
  • Lossy Reduction: The original data cannot be fully recovered from the compressed data.

Two main methods for dimensionality reduction are:

  • Principal Component Analysis (PCA): A technique that transforms the data into a new coordinate system, reducing the number of dimensions while retaining most of the variation in the data.
  • Wavelet Transforms: A method that breaks down data into different frequency components, allowing for compression by focusing on the most significant details and discarding less important ones.

Data Pre-Processing Steps

Data preprocessing in machine learning python code

Step One – Importing the Library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the dataset 
dataset = pd.read_csv('Data.csv')
x= dataset.iloc[:,:-1].values
y= dataset.iloc[:,-1].values
print(x)

OutPut

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
print(y)

OutPut

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

Taking care of missing data

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan , strategy='mean')
imputer.fit(x[:,1:3])
x[:,1:3] = imputer.transform(x[:,1:3])
print(x)

OutPut

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

Encoding the Dependent Variable

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

print(y)

OutPut

[0 1 0 0 1 1 0 1 0 1]

Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
x_train, x_test , y_train , y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
print(y_train)

[0 1 0 0 1 1 0 1]
print(y_test)

[0 1]

Feature Scaling

print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]

Data Pre-Processing – Frequently Asked Questions

I can’t import the dataset. It says that the file is not found. What should I do?

Python: Check that you are in the folder containing the file Data.csv in File Explorer. This folder is called the “Working Directory.”

R: Ensure that you have set the correct working directory, as shown in the lecture, and that this directory contains the file Data.csv.

What is the difference between the independent variables and the dependent variable?

Independent Variables are the input data you use in your analysis. They are the variables you manipulate or observe to see how they affect something else. 

Dependent Variable is what you want to predict or measure based on the independent variables. It depends on the values of the independent variables.

In Python, why do we create X and y separately?

In Python, we create X and y separately because:

  • X contains the independent variables (input features), which we use for predicting.
  • y contains the dependent variable (target), which is what we want to predict.

Using Numpy arrays for X and y is often more convenient than using Pandas dataframes for data preprocessing and building machine learning models. Numpy arrays are well-suited for mathematical operations and model training.

In Python, what does ’iloc’ exactly do?

in Python, iloc is used to locate rows and columns in a DataFrame by their integer index positions. It allows you to access data based on numerical indices rather than labels, making it possible to select specific rows and columns simply by specifying their index numbers.

In Python, what does ’.values’ exactly do?

In Python, .values returns the data from a DataFrame or Series as a Numpy array. This is how you convert columns or rows from a DataFrame into Numpy arrays, which are often used for data processing and machine learning tasks.

In R, why don’t we have to create arrays?

In R, you don’t need to create arrays separately because R is designed to work efficiently with dataframes directly. R provides built-in functions and tools that make it easy to manipulate and analyze dataframes without needing to convert them into arrays.

Missing Data 

In Python, what is the difference between fit and transform?

fit: This method is used to learn or extract information from the data. For example, if you’re using an Imputer to handle missing values, fit calculates the mean or other statistics from the training data.

transform: After fitting, transform applies the learned information to the data. In the case of an Imputer, transform will replace missing values with the mean that was calculated during the fit process.

Is replacing by the mean the best strategy to handle missing values?

Replacing missing values with the mean is a useful strategy, but it’s not always the best option. The choice depends on your specific problem, how your data is distributed, and how many missing values you have. For instance, if you have a lot of missing values, using the mean might not be ideal. Other methods include using the median, the most frequent value, or prediction-based imputation.

Prediction imputation is a more advanced and often better strategy. Here’s how it works: You treat the column with missing values as the target variable and use other columns as predictors. Then, you split your data into a training set (with no missing values) and a test set (with missing values). You build a classification model (like k-NN) on the training set to predict the missing values in the test set. Finally, you replace the missing values with the model’s predictions. This method can be more accurate than simply using the mean.

Categorical Data 

In Python, what do the two ’fit_transform’ methods do?

When the fit_transform() method is called from the LabelEncoder() class, it converts categorical string labels into integers. For example, it might convert “France,” “Spain,” and “Germany” into 0, 1, and 2, respectively.

When the fit_transform() method is called from the OneHotEncoder() class, it creates separate binary columns for each unique category. These columns are used to represent each category with binary values (0 or 1), turning categorical data into dummy variables.

In R, why don’t we manually create the dummy variables like we do in Python

In R, you don’t need to manually create dummy variables because they are automatically generated when you use the factor() function. This function handles the conversion of categorical data into dummy variables for you, which will be visible when performing regression and classification.

Splitting the dataset into the Training set and Test set

What is the difference between the training set and the test set?

The training set is the portion of your data used to train your model. It helps the model learn how to make predictions based on the independent variables.

The test set is the remaining portion of your data that is not used in training. It’s used to evaluate how well the model performs and whether it can accurately predict the dependent variable with new, unseen data.

Why do we split on the dependent variable?

We split based on the dependent variable to ensure that both the training and test sets have a representative distribution of the dependent variable’s values. This way, the model can learn meaningful correlations between the independent and dependent variables and be tested on a diverse set of outcomes. If the training set only had one value of the dependent variable, the model would not learn how to handle different outcomes effectively.

Feature Scaling

Do we really have to apply Feature Scaling on the dummy variables?

Yes, you should apply feature scaling to dummy variables if you want to optimize the accuracy of your model predictions. This helps ensure that all features, including dummy variables, are on a similar scale, which can improve model performance.

However, if maintaining interpretability is more important, you might choose not to scale dummy variables, as they are already in binary form (0 or 1) and don’t require scaling for interpretability.

When should we use Standardization and Normalization?

Normalization (scaling data to a range, like 0 to 1) is typically used when your data is normally distributed or when you want to ensure that all features are within the same range.

Standardization (scaling data to have a mean of 0 and a standard deviation of 1) is used when your data is not normally distributed or when the distribution of your data is unknown.

In practice, it’s often useful to test both normalization and standardization to see which method works best for your model.

Leave a Comment