top of page

Data Preprocessing Roadmap

  • Writer: Sairam Penjarla
    Sairam Penjarla
  • Jun 7, 2024
  • 1 min read

Data Cleaning

  1. Handling Missing Values

pandas.DataFrame.fillna()
pandas.DataFrame.dropna()
sklearn.impute.SimpleImputer
  • Handling Outliers

    • Z-score method

    • IQR method

numpy.clip()
  • Data Normalization and Scaling

sklearn.preprocessing.StandardScaler
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.RobustScaler

Data Transformation

  • Encoding Categorical Variables

sklearn.preprocessing.OneHotEncoder
sklearn.preprocessing.LabelEncoder
pandas.get_dummies()
  • Binning

pandas.cut()
pandas.qcut()
  • Log Transformation

numpy.log()
numpy.log1p()

Feature Engineering

  • Polynomial Features

sklearn.preprocessing.PolynomialFeatures
  • Interaction Features

  • Custom interaction terms

  • Datetime Features

  • Extracting year, month, day, etc. from datetime objects

  • pandas.DatetimeIndex

  • Text Features

sklearn.feature_extraction.text.CountVectorizer
sklearn.feature_extraction.text.TfidfVectorizer

Feature Selection

  • Univariate Selection

sklearn.feature_selection.SelectKBest
sklearn.feature_selection.chi2
  • Recursive Feature Elimination

sklearn.feature_selection.RFE
  • Principal Component Analysis (PCA)

sklearn.decomposition.PCA
  • Feature Importance

sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.ExtraTreesClassifier

Data Augmentation

  • Image Data Augmentation

keras.preprocessing.image.ImageDataGenerator
  • Text Data Augmentation

  • Synonym replacement

  • Random insertion, swap, and deletion

  • Audio Data Augmentation

  • Time stretching

  • Pitch shifting


Splitting Data

  • Train-Test Split

sklearn.model_selection.train_test_split
  • Cross-Validation Split

sklearn.model_selection.KFold
sklearn.model_selection.StratifiedKFold

Balancing Techniques

  • Over-sampling

imblearn.over_sampling.SMOTE
  • Under-sampling

imblearn.under_sampling.RandomUnderSampler
  • Hybrid Methods

imblearn.combine.SMOTEENN
imblearn.combine.SMOTETomek

Data Integration and Reduction

  • Merging and Joining

pandas.merge()
pandas.concat()
  • Dimensionality Reduction

sklearn.decomposition.PCA
sklearn.decomposition.TruncatedSVD
sklearn.manifold.TSNE

Feature Scaling

  • Normalization

sklearn.preprocessing.Normalizer
  • Standardization

sklearn.preprocessing.StandardScaler
  • Robust Scaling

sklearn.preprocessing.RobustScaler

 
 

Sign up for more like this.

Thanks for submitting!

bottom of page