Building a Movie Recommendation System from Scratch

Sairam Penjarla
Jul 17, 2024
3 min read

Updated: Jul 19, 2024

Introduction

In the age of streaming services, recommendation systems have become a crucial feature to keep users engaged by suggesting content they are likely to enjoy. In this blog post, we will walk through the process of building a movie recommendation system from scratch using Python and machine learning libraries. We will cover data preprocessing, transformation, model training, and generating recommendations based on cosine similarity.

Before we dive in, I encourage you to clone the GitHub repository and try it out yourself to better understand the concepts discussed in this blog. Let's get started!

GitHub Link

You can find the complete code in the GitHub repository. Clone the repository and try it out yourself to better understand the concepts discussed in this blog.

Data Loading and Preprocessing

DataLoader Class

In this section, we introduce the DataLoader class, which is responsible for loading the movie dataset and performing initial preprocessing steps. This includes renaming columns for better readability and handling missing values to ensure the dataset is clean and ready for further analysis.


import pandas as pd
import logging
import yaml

class DataLoader:
    def __init__(self, filepath, columns):
        """Initialize the DataLoader with the file path and columns to rename."""
        self.filepath = filepath
        self.columns = columns
        self.df = None

    def preprocess_data(self):
        """Load and preprocess the data."""
        logging.info("Loading data from file: %s", self.filepath)
        self.df = pd.read_csv(self.filepath)
        logging.info("Preprocessing data")

        # Rename columns
        self.df = self.df.rename(columns={
            "listed_in": self.columns['genre'],
            "director": self.columns['director'],
            "cast": self.columns['cast'],
            "description": self.columns['description'],
            "title": self.columns['title'],
            "date_added": self.columns['date_added'],
            "country": self.columns['country']
        })

        # Fill missing values
        self.df[self.columns['country']] = self.df[self.columns['country']].fillna(self.df[self.columns['country']].mode()[0])
        self.df[self.columns['date_added']] = self.df[self.columns['date_added']].fillna(self.df[self.columns['date_added']].mode()[0])
        self.df[self.columns['rating']] = self.df[self.columns['rating']].fillna(self.df[self.columns['country']].mode()[0])
        self.df = self.df.dropna(how='any', subset=[self.columns['cast'], self.columns['director']])

        # Further processing
        self.df['category'] = self.df[self.columns['genre']].apply(lambda x: x.split(",")[0])
        self.df['YearAdded'] = self.df[self.columns['date_added']].apply(lambda x: x.split(" ")[-1])
        self.df['MonthAdded'] = self.df[self.columns['date_added']].apply(lambda x: x.split(" ")[0])
        self.df['country'] = self.df[self.columns['country']].apply(lambda x: x.split(",")[0])

        return self.df

Data Transformation

DataTransformer Class

Next, we discuss the DataTransformer class. This class handles the transformation of the dataset by cleaning and combining relevant features into a single 'soup' of text for each movie. This transformed data will be used to train our machine learning model.


class DataTransformer:
    def __init__(self, df):
        """Initialize the DataTransformer with the DataFrame."""
        self.df = df
        self.features = ['category', 'director_name', 'cast_members', 'summary', 'movie_title']
        self.filters = self.df[self.features]

    @staticmethod
    def clean_text(text):
        """Clean the text by converting to lowercase and removing spaces."""
        return str.lower(text.replace(" ", ""))

    def apply_transformations(self):
        """Apply transformations to the data."""
        logging.info("Applying data transformations")
        for feature in self.features:
            self.filters[feature] = self.filters[feature].apply(self.clean_text)

        self.filters['Soup'] = self.filters.apply(self.create_soup, axis=1)
        return self.filters

    @staticmethod
    def create_soup(row):
        """Create a combined 'soup' of features for each movie."""
        return f"{row['director_name']} {row['cast_members']} {row['category']} {row['summary']}"

Model Training

ModelTrainer Class

The ModelTrainer class is introduced in this section. This class trains a machine learning model using the CountVectorizer to convert the text data into a matrix of token counts. It then computes the cosine similarity matrix, which will be used to find similar movies.


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class ModelTrainer:
    def __init__(self, filters):
        """Initialize the ModelTrainer with the filtered DataFrame."""
        self.filters = filters
        self.count_vectorizer = CountVectorizer(stop_words='english')

    def train_model(self):
        """Train the model and compute the cosine similarity matrix."""
        logging.info("Training model")
        self.count_matrix = self.count_vectorizer.fit_transform(self.filters['Soup'])
        self.cosine_sim_matrix = cosine_similarity(self.count_matrix, self.count_matrix)
        return self.cosine_sim_matrix

Generating Recommendations

Recommender Class

In this section, we present the Recommender class. This class uses the cosine similarity matrix to generate movie recommendations based on a given movie title. It finds movies that are most similar to the specified title and returns a list of recommended movies.


class Recommender:
    def __init__(self, df, filters, cosine_sim_matrix):
        """Initialize the Recommender with the DataFrame, filters, and cosine similarity matrix."""
        self.df = df
        self.cosine_sim_matrix = cosine_sim_matrix
        filters = filters.reset_index()
        self.indices = pd.Series(filters.index, index=filters['movie_title'])

    def get_recommendations(self, title):
        """Get movie recommendations based on the given title."""
        logging.info("Getting recommendations for title: %s", title)
        title = title.replace(' ', '').lower()
        idx = self.indices[title]
        sim_scores = list(enumerate(self.cosine_sim_matrix[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
        movie_indices = [i[0] for i in sim_scores]
        return self.df['movie_title'].iloc[movie_indices]

Putting It All Together

This section shows how to integrate all the classes and methods discussed so far. We demonstrate how to load the data, preprocess it, transform it, train the model, and finally generate movie recommendations using the Recommender class.


data_loader = DataLoader(config['data']['filepath'], config['data']['columns'])
df = data_loader.preprocess_data()

data_transformer = DataTransformer(df)
filters = data_transformer.apply_transformations()

model_trainer = ModelTrainer(filters)
cosine_sim_matrix = model_trainer.train_model()

recommender = Recommender(df, filters, cosine_sim_matrix)
print(recommender.get_recommendations('PK'))

Conclusion

In this blog post, we walked through the process of building a movie recommendation system from scratch. We covered data loading and preprocessing, data transformation, model training, and generating recommendations. By following these steps, you can create your own recommendation system and apply it to different datasets and use cases. Happy coding!