Customer Segmentation Using Agglomerative Clustering
- Sairam Penjarla
- Jul 17, 2024
- 5 min read
Updated: Jul 19, 2024
Customer segmentation is a critical task in marketing and business analytics that allows businesses to categorize customers into distinct groups based on shared characteristics. In this project, we'll explore how to perform customer segmentation using Principal Component Analysis (PCA) for dimensionality reduction and clustering techniques such as KMeans and Agglomerative Clustering.
Introduction
Customer segmentation helps businesses understand their customers better, enabling personalized marketing strategies, improved customer retention, and more effective product recommendations. By clustering customers based on demographic and behavioral data, businesses can target specific customer groups with tailored marketing campaigns.
Project Setup
To follow along with this project, clone the GitHub repository using the following command:
git clone <https://github.com/sairam-penjarla/Customer-segmentation.git>
cd Customer-segmentation
Project Structure
The project is structured into several key files:
preprocessing.py: Contains classes for data preprocessing steps, including feature engineering and data scaling.
model_training.py: Contains classes for training clustering models (KMeans and Agglomerative Clustering) and visualization.
main.py: Main script to orchestrate the preprocessing, model training, and prediction steps.
Step-by-Step Explanation
1. Data Preprocessing (preprocessing.py)
Data preprocessing is crucial to ensure that the data is clean, normalized, and ready for modeling. Let's dive into the FeatureEngineering class within preprocessing.py.
Feature Engineering
The FeatureEngineering class performs several key tasks:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
class FeatureEngineering:
def run(self, data):
# Calculate age of customers based on birth year
data["Age"] = 2021 - data["Year_Birth"]
# Calculate total spending across various items
data["Spent"] = data[["MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds"]].sum(axis=1)
# Determine living situation based on marital status
living_situation_map = {
"Married": "Partner",
"Together": "Partner",
"Absurd": "Alone",
"Widow": "Alone",
"YOLO": "Alone",
"Divorced": "Alone",
"Single": "Alone"
}
data["Living_With"] = data["Marital_Status"].replace(living_situation_map)
# Calculate total number of children in the household
data["Children"] = data["Kidhome"] + data["Teenhome"]
# Calculate family size based on living situation and children count
data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner": 2}) + data["Children"]
# Create a binary indicator for parenthood
data["Is_Parent"] = (data["Children"] > 0).astype(int)
# Segment education levels into three groups
education_map = {
"Basic": "Undergraduate",
"2n Cycle": "Undergraduate",
"Graduation": "Graduate",
"Master": "Postgraduate",
"PhD": "Postgraduate"
}
data["Education"] = data["Education"].replace(education_map)
# Rename spending columns for clarity
data = data.rename(columns={
"MntWines": "Wines",
"MntFruits": "Fruits",
"MntMeatProducts": "Meat",
"MntFishProducts": "Fish",
"MntSweetProducts": "Sweets",
"MntGoldProds": "Gold"
})
# Drop redundant features
features_to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
data = data.drop(features_to_drop, axis=1)
return data
Explanation:
Age Calculation: Computes the age of customers based on their birth year.
Total Spending: Aggregates spending across different product categories.
Living Situation: Maps marital status to living arrangements (Living_With).
Family Size: Calculates family size based on living situation and children count.
Parenthood Indicator: Creates a binary indicator (Is_Parent) based on the presence of children.
Education Segmentation: Segments education levels into categories for better analysis.
Column Renaming and Dropping: Renames columns for clarity and drops redundant features.
Dimensionality Reduction using PCA
After feature engineering, the PreprocessingSteps class in preprocessing.py applies PCA for dimensionality reduction:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
class PreprocessingSteps:
def run(self, data):
# Convert 'Dt_Customer' to datetime format and calculate days as a customer
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format="%d-%m-%Y")
d1 = data['Dt_Customer'].max()
data['Customer_For'] = (d1 - data['Dt_Customer']).dt.days
data["Customer_For"] = pd.to_numeric(data["Customer_For"], errors="coerce")
# Perform feature engineering
data = self.fe.run(data)
# Filter out outliers based on age and income
data = data[(data["Age"] < 90) & (data["Income"] < 600000)]
# Encode categorical variables
object_cols = data.select_dtypes(include=['object']).columns.tolist()
LE = LabelEncoder()
data[object_cols] = data[object_cols].apply(lambda col: LE.fit_transform(col))
# Drop unnecessary columns
cols_to_drop = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response']
data.drop(cols_to_drop, axis=1, inplace=True)
# Scale features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
scaled_data = pd.DataFrame(scaled_data, columns=data.columns)
# Perform PCA
pca = PCA(n_components=3).fit(scaled_data)
PCA_data = pd.DataFrame(pca.transform(scaled_data), columns=["col1", "col2", "col3"])
return PCA_data
Explanation:
Datetime Conversion: Converts the Dt_Customer column to datetime format and calculates the number of days each customer has been with the company (Customer_For).
Feature Engineering: Invokes the FeatureEngineering class to perform detailed feature transformations and data cleaning.
Outlier Removal: Filters out outliers based on age and income thresholds.
Categorical Encoding: Encodes categorical variables using LabelEncoder for numerical modeling.
Feature Scaling: Standardizes numerical features using StandardScaler.
PCA Transformation: Applies PCA to reduce the dimensionality of the data to three principal components (col1, col2, col3).
2. Model Training and Visualization (model_training.py)
The ClusteringModel class in model_training.py trains clustering models and visualizes the results:
KMeans Model Training and Elbow Method
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
import matplotlib.pyplot as plt
class ClusteringModel:
def __init__(self):
self.PCA_Data = None
def fit(self, PCA_data):
self.PCA_Data = PCA_data
# Initialize KMeans and the KElbowVisualizer
visualizer = KElbowVisualizer(KMeans(), k=(1, 10))
# Fit the visualizer to PCA data
visualizer.fit(PCA_data)
# Visualize the elbow plot
visualizer.show()
self.n_clusters = visualizer.elbow_value_
return self.n_clusters
Explanation:
Elbow Method: Utilizes the KElbowVisualizer from Yellowbrick to determine the optimal number of clusters (n_clusters) for KMeans. This method evaluates the sum of squared distances from each point to its assigned cluster centroid across different values of k.
Visualization: Displays the elbow plot to help select the appropriate number of clusters based on the point where the distortion begins to stabilize.
Agglomerative Clustering and 3D Visualization
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import colors
class ClusteringModel:
def predict(self, PCA_data):
# Initialize Agglomerative Clustering model
AC = AgglomerativeClustering(n_clusters=self.n_clusters)
# Fit model and predict clusters on PCA_data
PCA_data['Clusters'] = AC.fit_predict(PCA_data)
# Add the Clusters feature to the original dataframe
PCA_data['Clusters'] = PCA_data['Clusters'].values
# Visualize clusters in 3D plot
self.plot_predictions_3D(PCA_data)
def plot_predictions_3D(self, PCA_data):
# Define the figure and axis for 3D plot
fig = plt.figure
(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Define colormap for clusters
cmap = colors.ListedColormap(["#0077BB", "#33BEEB", "#EE7733", "#CC3311", "#EE3377", "#BBBBBB"])
# Scatter plot with clusters colored by 'Clusters' column
x = PCA_data["col1"]
y = PCA_data["col2"]
z = PCA_data["col3"]
scatter = ax.scatter(x, y, z, s=30, c=PCA_data["Clusters"], marker='x', cmap=cmap)
ax.set_title("Plot of Clusters")
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
# Add colorbar for the clusters
cbar = plt.colorbar(scatter, ticks=[0, 1, 2, 3]) # Adjust ticks as needed
cbar.set_label('Cluster')
plt.show()
Explanation:
Agglomerative Clustering: Implements hierarchical clustering with AgglomerativeClustering to group data points into clusters based on their proximity.
3D Visualization: Plots the clustered data points in a 3D scatter plot, where each point's position is defined by its three principal components (col1, col2, col3). Each cluster is distinguished by a unique color, providing a visual representation of the clustering results.
3. Main Execution (main.py)
The main.py script ties everything together to preprocess the data, train clustering models, and visualize the results:
import pandas as pd
from preprocessing import PreprocessingSteps
from model_training import ClusteringModel
# Load and preprocess data
data = pd.read_csv("dataset/marketing_campaign.csv", sep="\\\\t").dropna()
preprocessor = PreprocessingSteps()
PCA_data = preprocessor.run(data)
# Initialize ClusteringModel and train models
model = ClusteringModel()
model.fit(PCA_data)
model.predict(PCA_data)
Explanation:
Data Loading: Loads the marketing campaign dataset (marketing_campaign.csv) and removes any rows with missing values.
Preprocessing: Invokes the PreprocessingSteps class to preprocess the data, including feature engineering, outlier removal, categorical encoding, feature scaling, and PCA transformation.
Model Training: Creates an instance of ClusteringModel, applies the elbow method to determine the optimal number of clusters for KMeans, trains the models using KMeans and Agglomerative Clustering, and visualizes the clustering results in a 3D plot.
Conclusion
Customer segmentation is a powerful technique that empowers businesses to understand their customer base more deeply, leading to improved marketing strategies and operational efficiencies. By leveraging PCA for dimensionality reduction and clustering algorithms such as KMeans and Agglomerative Clustering, businesses can uncover actionable insights from their customer data.
This project demonstrates the step-by-step process of implementing customer segmentation using Python and popular machine learning libraries. For further exploration and adaptation to specific business use cases, refer to the project repository on GitHub.
Happy coding !