Data Preprocessing Roadmap
- Sairam Penjarla
- Jun 7, 2024
- 1 min read
Data Cleaning
Handling Missing Values
pandas.DataFrame.fillna()
pandas.DataFrame.dropna()
sklearn.impute.SimpleImputer
Handling Outliers
Z-score method
IQR method
numpy.clip()
Data Normalization and Scaling
sklearn.preprocessing.StandardScaler
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.RobustScaler
Data Transformation
Encoding Categorical Variables
sklearn.preprocessing.OneHotEncoder
sklearn.preprocessing.LabelEncoder
pandas.get_dummies()
Binning
pandas.cut()
pandas.qcut()
Log Transformation
numpy.log()
numpy.log1p()
Feature Engineering
Polynomial Features
sklearn.preprocessing.PolynomialFeatures
Interaction Features
Custom interaction terms
Datetime Features
Extracting year, month, day, etc. from datetime objects
pandas.DatetimeIndex
Text Features
sklearn.feature_extraction.text.CountVectorizer
sklearn.feature_extraction.text.TfidfVectorizer
Feature Selection
Univariate Selection
sklearn.feature_selection.SelectKBest
sklearn.feature_selection.chi2
Recursive Feature Elimination
sklearn.feature_selection.RFE
Principal Component Analysis (PCA)
sklearn.decomposition.PCA
Feature Importance
sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.ExtraTreesClassifier
Data Augmentation
Image Data Augmentation
keras.preprocessing.image.ImageDataGenerator
Text Data Augmentation
Synonym replacement
Random insertion, swap, and deletion
Audio Data Augmentation
Time stretching
Pitch shifting
Splitting Data
Train-Test Split
sklearn.model_selection.train_test_split
Cross-Validation Split
sklearn.model_selection.KFold
sklearn.model_selection.StratifiedKFold
Balancing Techniques
Over-sampling
imblearn.over_sampling.SMOTE
Under-sampling
imblearn.under_sampling.RandomUnderSampler
Hybrid Methods
imblearn.combine.SMOTEENN
imblearn.combine.SMOTETomek
Data Integration and Reduction
Merging and Joining
pandas.merge()
pandas.concat()
Dimensionality Reduction
sklearn.decomposition.PCA
sklearn.decomposition.TruncatedSVD
sklearn.manifold.TSNE
Feature Scaling
Normalization
sklearn.preprocessing.Normalizer
Standardization
sklearn.preprocessing.StandardScaler
Robust Scaling
sklearn.preprocessing.RobustScaler