Building a Movie Recommendation System from Scratch
- Sairam Penjarla
- Jul 17, 2024
- 3 min read
Updated: Jul 19, 2024
Introduction
In the age of streaming services, recommendation systems have become a crucial feature to keep users engaged by suggesting content they are likely to enjoy. In this blog post, we will walk through the process of building a movie recommendation system from scratch using Python and machine learning libraries. We will cover data preprocessing, transformation, model training, and generating recommendations based on cosine similarity.
Before we dive in, I encourage you to clone the GitHub repository and try it out yourself to better understand the concepts discussed in this blog. Let's get started!
GitHub Link
You can find the complete code in the GitHub repository. Clone the repository and try it out yourself to better understand the concepts discussed in this blog.
Data Loading and Preprocessing
DataLoader Class
In this section, we introduce the DataLoader class, which is responsible for loading the movie dataset and performing initial preprocessing steps. This includes renaming columns for better readability and handling missing values to ensure the dataset is clean and ready for further analysis.
import pandas as pd
import logging
import yaml
class DataLoader:
def __init__(self, filepath, columns):
"""Initialize the DataLoader with the file path and columns to rename."""
self.filepath = filepath
self.columns = columns
self.df = None
def preprocess_data(self):
"""Load and preprocess the data."""
logging.info("Loading data from file: %s", self.filepath)
self.df = pd.read_csv(self.filepath)
logging.info("Preprocessing data")
# Rename columns
self.df = self.df.rename(columns={
"listed_in": self.columns['genre'],
"director": self.columns['director'],
"cast": self.columns['cast'],
"description": self.columns['description'],
"title": self.columns['title'],
"date_added": self.columns['date_added'],
"country": self.columns['country']
})
# Fill missing values
self.df[self.columns['country']] = self.df[self.columns['country']].fillna(self.df[self.columns['country']].mode()[0])
self.df[self.columns['date_added']] = self.df[self.columns['date_added']].fillna(self.df[self.columns['date_added']].mode()[0])
self.df[self.columns['rating']] = self.df[self.columns['rating']].fillna(self.df[self.columns['country']].mode()[0])
self.df = self.df.dropna(how='any', subset=[self.columns['cast'], self.columns['director']])
# Further processing
self.df['category'] = self.df[self.columns['genre']].apply(lambda x: x.split(",")[0])
self.df['YearAdded'] = self.df[self.columns['date_added']].apply(lambda x: x.split(" ")[-1])
self.df['MonthAdded'] = self.df[self.columns['date_added']].apply(lambda x: x.split(" ")[0])
self.df['country'] = self.df[self.columns['country']].apply(lambda x: x.split(",")[0])
return self.df
Data Transformation
DataTransformer Class
Next, we discuss the DataTransformer class. This class handles the transformation of the dataset by cleaning and combining relevant features into a single 'soup' of text for each movie. This transformed data will be used to train our machine learning model.
class DataTransformer:
def __init__(self, df):
"""Initialize the DataTransformer with the DataFrame."""
self.df = df
self.features = ['category', 'director_name', 'cast_members', 'summary', 'movie_title']
self.filters = self.df[self.features]
@staticmethod
def clean_text(text):
"""Clean the text by converting to lowercase and removing spaces."""
return str.lower(text.replace(" ", ""))
def apply_transformations(self):
"""Apply transformations to the data."""
logging.info("Applying data transformations")
for feature in self.features:
self.filters[feature] = self.filters[feature].apply(self.clean_text)
self.filters['Soup'] = self.filters.apply(self.create_soup, axis=1)
return self.filters
@staticmethod
def create_soup(row):
"""Create a combined 'soup' of features for each movie."""
return f"{row['director_name']} {row['cast_members']} {row['category']} {row['summary']}"
Model Training
ModelTrainer Class
The ModelTrainer class is introduced in this section. This class trains a machine learning model using the CountVectorizer to convert the text data into a matrix of token counts. It then computes the cosine similarity matrix, which will be used to find similar movies.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
class ModelTrainer:
def __init__(self, filters):
"""Initialize the ModelTrainer with the filtered DataFrame."""
self.filters = filters
self.count_vectorizer = CountVectorizer(stop_words='english')
def train_model(self):
"""Train the model and compute the cosine similarity matrix."""
logging.info("Training model")
self.count_matrix = self.count_vectorizer.fit_transform(self.filters['Soup'])
self.cosine_sim_matrix = cosine_similarity(self.count_matrix, self.count_matrix)
return self.cosine_sim_matrix
Generating Recommendations
Recommender Class
In this section, we present the Recommender class. This class uses the cosine similarity matrix to generate movie recommendations based on a given movie title. It finds movies that are most similar to the specified title and returns a list of recommended movies.
class Recommender:
def __init__(self, df, filters, cosine_sim_matrix):
"""Initialize the Recommender with the DataFrame, filters, and cosine similarity matrix."""
self.df = df
self.cosine_sim_matrix = cosine_sim_matrix
filters = filters.reset_index()
self.indices = pd.Series(filters.index, index=filters['movie_title'])
def get_recommendations(self, title):
"""Get movie recommendations based on the given title."""
logging.info("Getting recommendations for title: %s", title)
title = title.replace(' ', '').lower()
idx = self.indices[title]
sim_scores = list(enumerate(self.cosine_sim_matrix[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
movie_indices = [i[0] for i in sim_scores]
return self.df['movie_title'].iloc[movie_indices]
Putting It All Together
This section shows how to integrate all the classes and methods discussed so far. We demonstrate how to load the data, preprocess it, transform it, train the model, and finally generate movie recommendations using the Recommender class.
data_loader = DataLoader(config['data']['filepath'], config['data']['columns'])
df = data_loader.preprocess_data()
data_transformer = DataTransformer(df)
filters = data_transformer.apply_transformations()
model_trainer = ModelTrainer(filters)
cosine_sim_matrix = model_trainer.train_model()
recommender = Recommender(df, filters, cosine_sim_matrix)
print(recommender.get_recommendations('PK'))
Conclusion
In this blog post, we walked through the process of building a movie recommendation system from scratch. We covered data loading and preprocessing, data transformation, model training, and generating recommendations. By following these steps, you can create your own recommendation system and apply it to different datasets and use cases. Happy coding!