Data Preprocessing Roadmap
- Sairam Penjarla

- Jun 7, 2024
- 1 min read
Data Cleaning
Handling Missing Values
pandas.DataFrame.fillna()pandas.DataFrame.dropna()sklearn.impute.SimpleImputerHandling Outliers
Z-score method
IQR method
numpy.clip()Data Normalization and Scaling
sklearn.preprocessing.StandardScalersklearn.preprocessing.MinMaxScalersklearn.preprocessing.RobustScalerData Transformation
Encoding Categorical Variables
sklearn.preprocessing.OneHotEncodersklearn.preprocessing.LabelEncoderpandas.get_dummies()Binning
pandas.cut()pandas.qcut()Log Transformation
numpy.log()numpy.log1p()Feature Engineering
Polynomial Features
sklearn.preprocessing.PolynomialFeaturesInteraction Features
Custom interaction terms
Datetime Features
Extracting year, month, day, etc. from datetime objects
pandas.DatetimeIndex
Text Features
sklearn.feature_extraction.text.CountVectorizersklearn.feature_extraction.text.TfidfVectorizerFeature Selection
Univariate Selection
sklearn.feature_selection.SelectKBestsklearn.feature_selection.chi2Recursive Feature Elimination
sklearn.feature_selection.RFEPrincipal Component Analysis (PCA)
sklearn.decomposition.PCAFeature Importance
sklearn.ensemble.RandomForestClassifiersklearn.ensemble.ExtraTreesClassifierData Augmentation
Image Data Augmentation
keras.preprocessing.image.ImageDataGeneratorText Data Augmentation
Synonym replacement
Random insertion, swap, and deletion
Audio Data Augmentation
Time stretching
Pitch shifting
Splitting Data
Train-Test Split
sklearn.model_selection.train_test_splitCross-Validation Split
sklearn.model_selection.KFoldsklearn.model_selection.StratifiedKFoldBalancing Techniques
Over-sampling
imblearn.over_sampling.SMOTEUnder-sampling
imblearn.under_sampling.RandomUnderSamplerHybrid Methods
imblearn.combine.SMOTEENNimblearn.combine.SMOTETomekData Integration and Reduction
Merging and Joining
pandas.merge()pandas.concat()Dimensionality Reduction
sklearn.decomposition.PCAsklearn.decomposition.TruncatedSVDsklearn.manifold.TSNEFeature Scaling
Normalization
sklearn.preprocessing.NormalizerStandardization
sklearn.preprocessing.StandardScalerRobust Scaling
sklearn.preprocessing.RobustScaler

