Building a Random Forest Classifier from Scratch in Python

Sairam Penjarla
Jul 17, 2024
7 min read

Introduction

In this blog post, we will walk through the process of implementing a Random Forest classifier from scratch in Python. A Random Forest is a powerful machine learning algorithm that combines multiple decision trees to improve classification accuracy and prevent overfitting. By the end of this post, you will have a deep understanding of how each component of the Random Forest works and how they come together to make accurate predictions.

To follow along, you can clone the GitHub repository and try the code yourself. This hands-on approach will help solidify your understanding of the concepts discussed. Let's get started!

Cloning the Repository

To get started, clone the repository using the following commands:

# Clone the repository
git clone <https://github.com/sairam-penjarla/Random-Forest-from-scratch.git>

# Navigate to the project directory
cd Random-Forest-from-scratch

Understanding the Code Structure

The project is organized into four main files:

tree.py
decision_tree.py
random_forest.py
inference.py

Each file has a specific role in building the Random Forest classifier. Let's dive into each of these files and understand the functions and classes they contain.

tree.py

This file contains the Node class and the resample_data function, which are essential for creating decision trees and generating bootstrap samples.

resample_data

The resample_data function creates a bootstrap sample from the dataset. This means it randomly samples data points with replacement to create a new dataset of the same size as the original.

import numpy as np

def resample_data(features, labels):
    """Generate a bootstrap sample from the dataset."""
    num_samples = features.shape[0]
    sample_indices = np.random.choice(num_samples, size=num_samples, replace=True)
    return features[sample_indices], labels[sample_indices]

Node

The Node class represents a single node in a decision tree. It stores information about the depth of the node, the feature and threshold used for splitting, and pointers to the left and right child nodes.

class Node:
    def __init__(self, current_depth=0, depth_limit=None):
        """Initialize a tree node."""
        self.current_depth = current_depth
        self.depth_limit = depth_limit
        self.left_child = None
        self.right_child = None
        self.split_feature = None
        self.split_value = None
        self.leaf_value = None  # Value for leaf nodes

decision_tree.py

The decision_tree.py file contains the CustomDecisionTree class, which encapsulates the logic for building and using a single decision tree. This class includes methods for fitting the tree to the data, recursively growing the tree, finding the best splits, calculating Gini impurity, and making predictions.

CustomDecisionTree

The CustomDecisionTree class is initialized with an optional depth limit, which determines the maximum depth the tree can grow to.

import numpy as np
from tree import Node, resample_data

class CustomDecisionTree:
    def __init__(self, depth_limit=None):
        """Initialize the CustomDecisionTree with an optional depth limit."""
        self.depth_limit = depth_limit

fit

The fit method trains the decision tree using the provided features and labels. It calculates the number of unique classes and features in the dataset and then starts building the tree.

def fit(self, features, labels):
    """Fit the CustomDecisionTree to the provided features and labels."""
    self.num_classes = len(np.unique(labels))
    self.num_features = features.shape[1]
    self.root = self._build_tree(features, labels)

buildtree

The buildtree method is a recursive function that constructs the decision tree. It determines the majority class at each node and checks if further splitting is possible based on the current depth and the depth limit.

def _build_tree(self, features, labels, current_depth=0):
    """Recursively build the tree."""
    class_counts = [np.sum(labels == c) for c in range(self.num_classes)]
    majority_class = np.argmax(class_counts)
    node = Node(current_depth, self.depth_limit)

    # Check if we should split further
    if current_depth < self.depth_limit and len(features) > 1:
        selected_features = np.random.choice(self.num_features, size=int(np.sqrt(self.num_features)), replace=False)
        best_feature, best_split = self._find_optimal_split(features, labels, selected_features)
        if best_feature is not None:
            node.split_feature = best_feature
            node.split_value = best_split
            left_mask = features[:, best_feature] < best_split
            right_mask = ~left_mask
            features_left, labels_left = features[left_mask], labels[left_mask]
            features_right, labels_right = features[right_mask], labels[right_mask]
            if len(labels_left) > 0 and len(labels_right) > 0:
                node.left_child = self._build_tree(features_left, labels_left, current_depth + 1)
                node.right_child = self._build_tree(features_right, labels_right, current_depth + 1)
            else:
                node.leaf_value = majority_class
        else:
            node.leaf_value = majority_class
    else:
        node.leaf_value = majority_class

    return node

findoptimal_split

The findoptimal_split method identifies the best feature and threshold for splitting the data to minimize the Gini impurity.

def _find_optimal_split(self, features, labels, feature_indices):
    """Find the best feature and threshold for splitting the data."""
    lowest_gini = float('inf')
    optimal_feature, optimal_threshold = None, None

    if len(feature_indices) == 0:
        return optimal_feature, optimal_threshold

    for feature in feature_indices:
        sorted_thresholds = np.unique(features[:, feature])
        for threshold in sorted_thresholds:
            left_mask = features[:, feature] < threshold
            right_mask = features[:, feature] >= threshold
            if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                continue

            gini_score = self._compute_gini(labels[left_mask], labels[right_mask])

            if gini_score < lowest_gini:
                lowest_gini = gini_score
                optimal_feature = feature
                optimal_threshold = threshold

    return optimal_feature, optimal_threshold

computegini

The computegini method calculates the Gini impurity of a split, which measures the impurity or disorder of the split.

def _compute_gini(self, left_labels, right_labels):
    """Calculate the Gini impurity of a split."""
    n_left, n_right = len(left_labels), len(right_labels)
    if n_left == 0 or n_right == 0:
        return 0
    total_samples = n_left + n_right
    p_left = n_left / total_samples
    p_right = n_right / total_samples
    gini_left = 1 - sum((np.sum(left_labels == c) / n_left) ** 2 for c in np.unique(left_labels))
    gini_right = 1 - sum((np.sum(right_labels == c) / n_right) ** 2 for c in np.unique(right_labels))
    return p_left * gini_left + p_right * gini_right

predict

The predict method takes a set of data points and returns their predicted class labels by traversing the tree.

def predict(self, data):
    """Predict the class labels for the given data."""
    return np.array([self._predict_single(sample, self.root) for sample in data])

predictsingle

The predictsingle method is a recursive function that traverses the tree for a single data point to determine its predicted class label.

def _predict_single(self, sample, node):
    """Recursively predict the class label for a single sample."""
    if node.leaf_value is not None:
        return node.leaf_value
    if sample[node.split_feature] < node.split_value:
        return self._predict_single(sample, node.left_child)
    else:
        return self._predict_single(sample, node.right_child)

By breaking down each function, we can see how the decision tree is constructed, how it finds the best splits, and how it makes predictions. This detailed explanation provides a comprehensive understanding of the CustomDecisionTree class and its methods.

random_forest.py

The random_forest.py file contains the RandomForestClassifier class, which implements a Random Forest ensemble of decision trees. This class includes methods for fitting multiple trees to the data, making predictions using the ensemble, and generating bootstrap samples.

RandomForestClassifier

The RandomForestClassifier class is initialized with parameters for the number of trees (num_trees) and an optional depth limit (depth_limit) for individual decision trees.

import numpy as np
from decision_tree import CustomDecisionTree

class RandomForestClassifier:
    def __init__(self, num_trees=100, depth_limit=None):
        """Initialize the RandomForestClassifier with the number of trees and an optional depth limit."""
        self.num_trees = num_trees
        self.depth_limit = depth_limit
        self.trees = []

fit

The fit method trains the random forest by creating multiple decision trees (CustomDecisionTree instances) and fitting them to bootstrap samples of the data.

def fit(self, features, labels):
    """Fit the RandomForestClassifier to the provided features and labels."""
    self.trees = []

    for _ in range(self.num_trees):
        tree = CustomDecisionTree(depth_limit=self.depth_limit)
        sample_features, sample_labels = self._create_bootstrap_sample(features, labels)
        tree.fit(sample_features, sample_labels)
        self.trees.append(tree)

predict

The predict method combines predictions from all trees in the forest to produce a final prediction. It calculates the mean prediction across all trees for each data point.

def predict(self, features):
    """Predict the class labels for the given features."""
    predictions = np.zeros((features.shape[0], self.num_trees))

    for idx, tree in enumerate(self.trees):
        predictions[:, idx] = tree.predict(features)

    return np.mean(predictions, axis=1)

createbootstrap_sample

The createbootstrap_sample method generates a bootstrap sample from the dataset. It randomly samples data points with replacement to create new datasets for each tree.

def _create_bootstrap_sample(self, features, labels):
    """Generate a bootstrap sample from the dataset."""
    num_samples = features.shape[0]
    sample_indices = np.random.choice(num_samples, size=num_samples, replace=True)
    return features[sample_indices], labels[sample_indices]

By understanding each method in the RandomForestClassifier class, we gain insight into how a Random Forest is constructed, trained, and used for making predictions. This breakdown helps in grasping the mechanics behind ensemble learning and the role of individual decision trees within the ensemble.

inference.py

The inference.py file demonstrates the practical usage of the RandomForestClassifier for training on the Iris dataset and making predictions.

Example Usage with Iris Dataset

# Example usage with Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from random_forest import RandomForestClassifier

# Load the Iris dataset
iris_data = load_iris()
features, labels = iris_data.data, iris_data.target

Splitting the Dataset

# Split the dataset into training and testing sets
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=42)

Initializing and Training the Random Forest Classifier

# Initialize and train the Random Forest classifier
random_forest = RandomForestClassifier(num_trees=100, depth_limit=5)
random_forest.fit(features_train, labels_train)

Making Predictions

# Make predictions on the test set
predicted_labels = random_forest.predict(features_test)

Evaluating Model Accuracy

# Evaluate the accuracy of the model
accuracy = accuracy_score(labels_test, predicted_labels.round())
print(f'Accuracy: {accuracy * 100:.2f}%')

This code snippet provides a complete workflow for using the RandomForestClassifier to train a model on the Iris dataset, make predictions, and evaluate its accuracy. It demonstrates the integration of machine learning libraries (scikit-learn) with custom implementations (RandomForestClassifier from random_forest.py) for ensemble learning.

Conclusion

In this blog post, we delved into the implementation and usage of a Random Forest classifier from scratch using Python. We explored how decision trees are integrated into an ensemble model to improve predictive performance and robustness. Here’s a brief summary of what we covered:

Understanding Random Forest: We introduced the concept of Random Forests, which are composed of multiple decision trees trained on bootstrap samples of the dataset with random feature subsets.
Implementation Details: We walked through the implementation of key components:

Decision Tree: Constructed using a custom implementation (CustomDecisionTree) capable of recursive splitting based on Gini impurity.
Random Forest: Defined as RandomForestClassifier, which trains multiple decision trees and aggregates their predictions to improve accuracy.

Practical Example: Demonstrated how to apply the Random Forest classifier to the classic Iris dataset. This included:

Loading the dataset and splitting it into training and testing sets.
Initializing and training the Random Forest model.
Making predictions on the test set and evaluating model accuracy using metrics like accuracy score.

Next Steps: Encouraged readers to explore the provided GitHub repository (Random-Forest-from-scratch) for detailed code implementations and further experimentation. Trying out different datasets or modifying parameters can deepen understanding and enhance practical skills in machine learning.

By understanding the workings of Random Forests and implementing them from scratch, you gain insights into ensemble methods and the nuances of decision tree-based models. We hope this blog post has equipped you with foundational knowledge and practical skills to apply Random Forests in your own projects. Happy coding!

Feel free to clone the GitHub repository and explore the code firsthand. Experiment with different datasets and parameters to further enhance your understanding of Random Forests. Happy coding!

Introduction

Cloning the Repository

Understanding the Code Structure

tree.py

resample_data

Node

decision_tree.py

CustomDecisionTree

fit

buildtree

findoptimal_split

computegini

predict

predictsingle

random_forest.py

RandomForestClassifier

fit

predict

createbootstrap_sample

inference.py

Example Usage with Iris Dataset

Splitting the Dataset

Initializing and Training the Random Forest Classifier

Making Predictions

Evaluating Model Accuracy

Conclusion

Sign up for more like this.