Customer Segmentation Using Agglomerative Clustering

Sairam Penjarla
Jul 17, 2024
5 min read

Updated: Jul 19, 2024

Customer segmentation is a critical task in marketing and business analytics that allows businesses to categorize customers into distinct groups based on shared characteristics. In this project, we'll explore how to perform customer segmentation using Principal Component Analysis (PCA) for dimensionality reduction and clustering techniques such as KMeans and Agglomerative Clustering.

Introduction

Customer segmentation helps businesses understand their customers better, enabling personalized marketing strategies, improved customer retention, and more effective product recommendations. By clustering customers based on demographic and behavioral data, businesses can target specific customer groups with tailored marketing campaigns.

Project Setup

To follow along with this project, clone the GitHub repository using the following command:


git clone <https://github.com/sairam-penjarla/Customer-segmentation.git>
cd Customer-segmentation

Project Structure

The project is structured into several key files:

preprocessing.py: Contains classes for data preprocessing steps, including feature engineering and data scaling.
model_training.py: Contains classes for training clustering models (KMeans and Agglomerative Clustering) and visualization.
main.py: Main script to orchestrate the preprocessing, model training, and prediction steps.

Step-by-Step Explanation

1. Data Preprocessing (preprocessing.py)

Data preprocessing is crucial to ensure that the data is clean, normalized, and ready for modeling. Let's dive into the FeatureEngineering class within preprocessing.py.

Feature Engineering

The FeatureEngineering class performs several key tasks:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

class FeatureEngineering:
    def run(self, data):
        # Calculate age of customers based on birth year
        data["Age"] = 2021 - data["Year_Birth"]

        # Calculate total spending across various items
        data["Spent"] = data[["MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds"]].sum(axis=1)

        # Determine living situation based on marital status
        living_situation_map = {
            "Married": "Partner",
            "Together": "Partner",
            "Absurd": "Alone",
            "Widow": "Alone",
            "YOLO": "Alone",
            "Divorced": "Alone",
            "Single": "Alone"
        }
        data["Living_With"] = data["Marital_Status"].replace(living_situation_map)

        # Calculate total number of children in the household
        data["Children"] = data["Kidhome"] + data["Teenhome"]

        # Calculate family size based on living situation and children count
        data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner": 2}) + data["Children"]

        # Create a binary indicator for parenthood
        data["Is_Parent"] = (data["Children"] > 0).astype(int)

        # Segment education levels into three groups
        education_map = {
            "Basic": "Undergraduate",
            "2n Cycle": "Undergraduate",
            "Graduation": "Graduate",
            "Master": "Postgraduate",
            "PhD": "Postgraduate"
        }
        data["Education"] = data["Education"].replace(education_map)

        # Rename spending columns for clarity
        data = data.rename(columns={
            "MntWines": "Wines",
            "MntFruits": "Fruits",
            "MntMeatProducts": "Meat",
            "MntFishProducts": "Fish",
            "MntSweetProducts": "Sweets",
            "MntGoldProds": "Gold"
        })

        # Drop redundant features
        features_to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
        data = data.drop(features_to_drop, axis=1)

        return data

Explanation:

Age Calculation: Computes the age of customers based on their birth year.
Total Spending: Aggregates spending across different product categories.
Living Situation: Maps marital status to living arrangements (Living_With).
Family Size: Calculates family size based on living situation and children count.
Parenthood Indicator: Creates a binary indicator (Is_Parent) based on the presence of children.
Education Segmentation: Segments education levels into categories for better analysis.
Column Renaming and Dropping: Renames columns for clarity and drops redundant features.

Dimensionality Reduction using PCA

After feature engineering, the PreprocessingSteps class in preprocessing.py applies PCA for dimensionality reduction:


from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

class PreprocessingSteps:
    def run(self, data):
        # Convert 'Dt_Customer' to datetime format and calculate days as a customer
        data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format="%d-%m-%Y")
        d1 = data['Dt_Customer'].max()
        data['Customer_For'] = (d1 - data['Dt_Customer']).dt.days
        data["Customer_For"] = pd.to_numeric(data["Customer_For"], errors="coerce")

        # Perform feature engineering
        data = self.fe.run(data)

        # Filter out outliers based on age and income
        data = data[(data["Age"] < 90) & (data["Income"] < 600000)]

        # Encode categorical variables
        object_cols = data.select_dtypes(include=['object']).columns.tolist()
        LE = LabelEncoder()
        data[object_cols] = data[object_cols].apply(lambda col: LE.fit_transform(col))

        # Drop unnecessary columns
        cols_to_drop = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response']
        data.drop(cols_to_drop, axis=1, inplace=True)

        # Scale features
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        scaled_data = pd.DataFrame(scaled_data, columns=data.columns)

        # Perform PCA
        pca = PCA(n_components=3).fit(scaled_data)
        PCA_data = pd.DataFrame(pca.transform(scaled_data), columns=["col1", "col2", "col3"])

        return PCA_data

Explanation:

Datetime Conversion: Converts the Dt_Customer column to datetime format and calculates the number of days each customer has been with the company (Customer_For).
Feature Engineering: Invokes the FeatureEngineering class to perform detailed feature transformations and data cleaning.
Outlier Removal: Filters out outliers based on age and income thresholds.
Categorical Encoding: Encodes categorical variables using LabelEncoder for numerical modeling.
Feature Scaling: Standardizes numerical features using StandardScaler.
PCA Transformation: Applies PCA to reduce the dimensionality of the data to three principal components (col1, col2, col3).

2. Model Training and Visualization (model_training.py)

The ClusteringModel class in model_training.py trains clustering models and visualizes the results:

KMeans Model Training and Elbow Method


from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
import matplotlib.pyplot as plt

class ClusteringModel:
    def __init__(self):
        self.PCA_Data = None

    def fit(self, PCA_data):
        self.PCA_Data = PCA_data

        # Initialize KMeans and the KElbowVisualizer
        visualizer = KElbowVisualizer(KMeans(), k=(1, 10))

        # Fit the visualizer to PCA data
        visualizer.fit(PCA_data)

        # Visualize the elbow plot
        visualizer.show()
        self.n_clusters = visualizer.elbow_value_

        return self.n_clusters

Explanation:

Elbow Method: Utilizes the KElbowVisualizer from Yellowbrick to determine the optimal number of clusters (n_clusters) for KMeans. This method evaluates the sum of squared distances from each point to its assigned cluster centroid across different values of k.
Visualization: Displays the elbow plot to help select the appropriate number of clusters based on the point where the distortion begins to stabilize.

Agglomerative Clustering and 3D Visualization


from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import colors

class ClusteringModel:
    def predict(self, PCA_data):
        # Initialize Agglomerative Clustering model
        AC = AgglomerativeClustering(n_clusters=self.n_clusters)

        # Fit model and predict clusters on PCA_data
        PCA_data['Clusters'] = AC.fit_predict(PCA_data)

        # Add the Clusters feature to the original dataframe
        PCA_data['Clusters'] = PCA_data['Clusters'].values

        # Visualize clusters in 3D plot
        self.plot_predictions_3D(PCA_data)

    def plot_predictions_3D(self, PCA_data):
        # Define the figure and axis for 3D plot
        fig = plt.figure

(figsize=(10, 8))
        ax = fig.add_subplot(111, projection='3d')

        # Define colormap for clusters
        cmap = colors.ListedColormap(["#0077BB", "#33BEEB", "#EE7733", "#CC3311", "#EE3377", "#BBBBBB"])

        # Scatter plot with clusters colored by 'Clusters' column
        x = PCA_data["col1"]
        y = PCA_data["col2"]
        z = PCA_data["col3"]
        scatter = ax.scatter(x, y, z, s=30, c=PCA_data["Clusters"], marker='x', cmap=cmap)
        ax.set_title("Plot of Clusters")
        ax.set_xlabel('X Label')
        ax.set_ylabel('Y Label')
        ax.set_zlabel('Z Label')

        # Add colorbar for the clusters
        cbar = plt.colorbar(scatter, ticks=[0, 1, 2, 3])  # Adjust ticks as needed
        cbar.set_label('Cluster')

        plt.show()

Explanation:

Agglomerative Clustering: Implements hierarchical clustering with AgglomerativeClustering to group data points into clusters based on their proximity.
3D Visualization: Plots the clustered data points in a 3D scatter plot, where each point's position is defined by its three principal components (col1, col2, col3). Each cluster is distinguished by a unique color, providing a visual representation of the clustering results.

3. Main Execution (main.py)

The main.py script ties everything together to preprocess the data, train clustering models, and visualize the results:


import pandas as pd
from preprocessing import PreprocessingSteps
from model_training import ClusteringModel

# Load and preprocess data
data = pd.read_csv("dataset/marketing_campaign.csv", sep="\\\\t").dropna()
preprocessor = PreprocessingSteps()
PCA_data = preprocessor.run(data)

# Initialize ClusteringModel and train models
model = ClusteringModel()
model.fit(PCA_data)
model.predict(PCA_data)

Explanation:

Data Loading: Loads the marketing campaign dataset (marketing_campaign.csv) and removes any rows with missing values.
Preprocessing: Invokes the PreprocessingSteps class to preprocess the data, including feature engineering, outlier removal, categorical encoding, feature scaling, and PCA transformation.
Model Training: Creates an instance of ClusteringModel, applies the elbow method to determine the optimal number of clusters for KMeans, trains the models using KMeans and Agglomerative Clustering, and visualizes the clustering results in a 3D plot.

Conclusion

Customer segmentation is a powerful technique that empowers businesses to understand their customer base more deeply, leading to improved marketing strategies and operational efficiencies. By leveraging PCA for dimensionality reduction and clustering algorithms such as KMeans and Agglomerative Clustering, businesses can uncover actionable insights from their customer data.

This project demonstrates the step-by-step process of implementing customer segmentation using Python and popular machine learning libraries. For further exploration and adaptation to specific business use cases, refer to the project repository on GitHub.

Happy coding !