Data Pre-processing tasks using python

Data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA, correlation

4 min readOct 28, 2021

Data Set Description

In this practical performed by me I have chosen the iris dataset. The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Task to be performed

In this practical implementation various data preprocessing tasks such as Data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA, correlation are performed.

Adding Noise to Dataset

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set. Now dataset feature have increased and before applying feature selection the dataset is divided into training and testing set.

Why feature selection is important?

The objective of feature selection in ML is to identify the best set of features that enable one to build useful and constructive models of the subject one is trying to study.

Advantages of feature selection are:

It enables the machine learning algorithm to train faster.
It reduces the complexity of a model and makes it easier to interpret.
It improves the accuracy of a model if the right subset is chosen.
It reduces overfitting.

Disadvantages of feature selection are:

Although the results are promising, exact approaches are only able to handle a few hundred or thousand variables at most (so, they are not applicable on high dimensional data).
Another shortcoming of the vast majority of the feature selection methods is that they arbitrarily seek to identify only one solution to the problem.

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:

SelectKBest removes all but the k highest scoring features.

Recursive Feature Elimination

Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm. RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable. There are two important configuration options when using RFE: the choice in the number of features to select and the choice of the algorithm used to help choose features. Both of these hyperparameters can be explored, although the performance of the method is not strongly dependent on these hyperparameters being configured well.

Principal Component Analysis

Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.