Essential Techniques for Data Preprocessing in Machine Learning
Written on
Chapter 1: Understanding Data Preprocessing for Machine Learning
In the realm of machine learning, algorithms identify patterns within a dataset's features to forecast target variables for previously unseen data. The trained model acts as a mathematical function that effectively links feature values (X) to the target variable (y).
Since machine learning relies heavily on numerical data, it's crucial to represent these numbers in a manner that aligns with the algorithm's comprehension. For instance, if we have a feature indicating car colors with values like red, blue, and grey, encoding them as numbers (e.g., red = 1, blue = 2, grey = 3) could mislead the algorithm into perceiving red as more significant due to its higher numerical representation.
Data preprocessing refers to the conversion of raw features into a numerical format comprehensible to machine learning algorithms. This process requires careful analysis of the data to determine the most appropriate preprocessing methods.
In this tutorial, I will outline common preprocessing techniques using the Scikit-learn library, providing code examples for clarity. This guide does not aim to cover all preprocessing methods exhaustively, but rather to establish a solid foundation in widely-used strategies. For those interested in further exploration, links will be provided at the end of the article.
Setting Up the Dataset
For this tutorial, I will utilize the 'autos' dataset sourced from openml.org. This dataset contains various features related to car characteristics and a categorical target variable indicative of its insurance risk. You can download the dataset and convert it into a pandas DataFrame using the code snippet below.
Before proceeding with preprocessing, it’s vital to understand the data types of each column. By executing df.dtypes, we can observe that the dataset comprises both categorical and numerical data types.
Section 1.1: Encoding Categorical Features
As previously mentioned, machine learning algorithms necessitate numerical data. Consequently, any categorical features must be converted into numerical format before model training.
One prevalent method for handling categorical variables is one-hot encoding (also known as dummy encoding). This technique generates a new column for each distinct value within the feature. Each new column acts as a binary feature, denoting a 0 if the value is absent and a 1 if present.
The Scikit-learn library offers a preprocessing method for one-hot encoding. The following code snippet transforms the dataset's categorical features into one-hot encoded columns.
When employing this technique, it’s crucial to consider the cardinality of the feature, which refers to the count of unique values within a column. For example, a feature with 50 unique values would result in 50 new columns upon one-hot encoding, potentially leading to:
- An excessively large training set, resulting in longer training times.
- A sparse training set that may cause overfitting issues.
To gauge the cardinality of the features in our dataset, execute df[categorical_cols].nunique().
For instance, the 'make' column exhibits a notably high cardinality.
To address high cardinality categories, infrequent values can be aggregated into a new category. The OneHotEncoder method provides two options for this:
- Set the min_frequency argument to a specified number, aggregating any values below this threshold into an 'infrequent' category (available in Scikit-learn 1.1.0 and above).
- Set the max_categories argument to limit the number of columns created.
Section 1.2: Handling Missing Values
Most real-world datasets contain some missing values due to various reasons, such as data generation errors or irrelevance for specific samples. Since most machine learning algorithms cannot process null values, addressing these is essential.
One option is to remove rows with missing values; however, this can significantly reduce the training dataset's size. Alternatively, missing values can be replaced using a method known as imputation.
Numerous strategies exist for imputing missing values, ranging from simple methods (e.g., substituting missing values with the median, mean, or most frequent value) to more complex approaches employing machine learning algorithms to determine optimal imputation values.
Before selecting an imputation strategy, it's essential to assess the dataset for missing values. Execute the following command to reveal the count of missing values across features.
The results might indicate that five features have missing values, with a low percentage (under 2%) for all except the 'normalized-losses' column.
Typically, exploratory analysis informs the choice of imputation strategy. For this tutorial, I will showcase both a simple and a more complex imputation strategy using Scikit-learn’s SimpleImputer. We will use the mean to replace missing values in numerical features and the most frequent value for categorical features.
Using a basic statistic like the mean for imputation may not yield optimal model performance. A more sophisticated method involves employing a machine learning algorithm for value imputation, such as the K-Nearest Neighbours algorithm, which utilizes distance metrics to determine nearest neighbors and imputes their mean.
Chapter 2: Feature Scaling and Binning
Numerical features in a training set often exhibit varying scales. For example, the 'price' feature may range from 5,118 while 'compression-ratio' spans between 7 and 23. If not addressed, a machine learning model may incorrectly assign greater importance to larger numerical values.
Another important preprocessing step is centering, which transforms features to create a normal distribution. Many machine learning algorithms assume that features are normally distributed, and they may not perform optimally unless this condition is met.
The Scikit-learn StandardScaler method facilitates both centering and scaling by removing the mean and adjusting each feature to unit variance.
Binning, or discretization, is another technique that converts continuous variables into categories or buckets of similar values. This method is particularly beneficial when dealing with features that have numerous infrequent values, as it can mitigate noise and reduce overfitting risks.
In our dataset, the price variable has a wide range of values, with the most common price appearing only twice. Thus, binning is advantageous for this feature.
The Scikit-learn library offers the KBinsDiscretizer method, which performs both binning and categorical encoding in one step. The following code converts the price feature into six bins and subsequently one-hot encodes the new categorical variable, resulting in a sparse matrix.
Putting It All Together
Thus far, we have independently executed various preprocessing steps. In practical machine learning applications, it’s essential to apply preprocessing to both the training set and any test or validation datasets, and to repeat this process during inference with new data. Therefore, writing code that consolidates all these transformations is more efficient.
Scikit-learn provides a valuable tool called pipelines, which allows for chaining preprocessing steps with estimators. The following code snippet creates a pipeline that incorporates all preprocessing steps covered in this tutorial while fitting a Random Forest classifier.
Machine learning algorithms differ fundamentally from human learning. An algorithm cannot grasp the relationship between the number of doors on a car as intuitively as humans do. For machines to learn effectively, data must be transformed into a format that aligns with how they learn.
In this article, we have reviewed several preprocessing techniques, including:
- Encoding Categorical Features: Converting categorical variables into numerical representations is essential since most algorithms can only process numerical data.
- Imputing Missing Values: Replacing null values with sensible alternatives is necessary, as many algorithms cannot interpret missing data.
- Feature Scaling: Ensuring that features with different scales are aligned prevents misinterpretation of their importance.
- Binning: Aggregating continuous variables with many infrequent values into groups reduces noise and the likelihood of overfitting during training.
This tutorial provides an introductory overview of the most common data preprocessing techniques in machine learning. While the methods outlined here offer various options, many other preprocessing steps exist.
For those seeking to deepen their understanding of these techniques, the book "Hands-on Machine Learning with Scikit-learn and TensorFlow" is an excellent resource, available as a free PDF.
For additional articles on Scikit-learn, feel free to explore my previous posts below.
Thank you for reading!
Citation
Chapter 3: Video Resources for Further Learning
The first video, Preprocessing Data for Machine Learning - Deep Dive, delves into the intricacies of data preprocessing techniques that enhance machine learning model performance.
The second video, Learn Data Science: Data Preprocessing in Python for Machine Learning, offers a practical guide to implementing preprocessing techniques in Python, focusing on hands-on examples.