//
Home

Latest Post

Data Preprocessing


Data Preprocessing is a vital step in the machine learning pipeline. Just as visualization is necessary to understand the relationships in data, proper preparation or data munging is required to ensure machine learning models work optimally.

The process of data preparation is highly interactive and iterative. A typical process includes at least the following steps:

  1. Visualization of the dataset to understand the relationships and identify possible problems with the data.
  2. Data cleaning and transformation to address the problems identified. It many cases, step 1 is then repeated to verify that the cleaning and transformation had the desired effect.
  3. Construction and evaluation of a machine learning models. Visualization of the results will often lead to understanding of further data preparation that is required; going back to step 1.

In this lab we will learn the following:

  • Recode character strings to eliminate characters that will not be processed correctly.
  • Find and treat missing values.
  • Set correct data type of each column.
  • Transform categorical features to create categories with more cases and coding likely to be useful in predicting the label.
  • Apply transformations to numeric features and the label to improve the distribution properties.
  • Locate and treat duplicate cases.
%d bloggers like this: