Text Summarization Using BERT: A Comprehensive Guide for Intermediate Practitioners
- Sairam Penjarla
- Jun 27, 2024
- 3 min read
In this blog, we'll walk through a text summarization project using BERT, focusing on how to create a custom text summarizer and process data blocks for summarization. We will explain the theory behind text summarization and then delve into the code, block by block, to provide a thorough understanding of the process.
Github URL:
It is highly recomended to go through the below repo to get the full version of the code along with the necessary filels such as requirements.txt file and csv file.
Theory Behind Text Summarization
What is Text Summarization?
In this section, we'll introduce the concept of text summarization, which involves generating a concise and coherent summary of a longer text document. We'll discuss the importance of summarization in various applications, such as extracting key information from large documents and improving readability.
BERT for Text Summarization
We'll explain how BERT, a transformer-based model developed by Google, can be used for text summarization. BERT's bidirectional context understanding allows it to capture the essence of a document, making it an effective tool for generating high-quality summaries.
Code Explanation
1. Importing Required Libraries
import pandas as pd
from summarizer import Summarizer
In this step, we'll import the necessary libraries for our project. This includes pandas for data manipulation and a summarizer library that leverages BERT for text summarization. Having these libraries ready ensures that we can process data and generate summaries efficiently.
2. Defining the TextSummarizer Class
class TextSummarizer:
def __init__(self, max_length=400, min_length_text=40):
self.MAX_LENGTH = max_length
self.MIN_LENGTH_TEXT = min_length_text
self.summarizer = Summarizer() # Assuming Summarizer is an available library
Here, we'll define a custom class, TextSummarizer, to encapsulate the summarization functionality. The constructor initializes the maximum length for text blocks and the minimum length for the summaries. This step is necessary to set up the summarizer with configurable parameters.
3. Summarizing Text
def summarize_text(self, text):
summary = self.summarizer(text, min_length=self.MIN_LENGTH_TEXT)
return ''.join(summary)
We'll create a method summarize_text within the TextSummarizer class to generate summaries. This method takes a text input and returns a summary using the BERT-based summarizer. This step is crucial for generating concise summaries of longer texts.
4. Processing Data for Summarization
def process_data(self, data_path, num_blocks=5):
DATA_COLUMNS = {'TEXT': str, 'ENV_PROBLEMS': int, 'POLLUTION': int, 'TREATMENT': int, 'CLIMATE': int, 'BIOMONITORING': int}
warnings.filterwarnings('ignore')
df = pd.read_csv(data_path, delimiter=';', header=0)
df = df.astype(DATA_COLUMNS)
df = df[:num_blocks]
bodies = []
i = 0
while i < len(df):
body = ""
body_empty = True
while (len(body) < self.MAX_LENGTH) and (i < len(df)):
if body_empty:
body = df.loc[i, 'TEXT']
body_empty = False
else:
body += " " + df.loc[i, 'TEXT']
i += 1
bodies.append(body)
bert_summary = []
for body in bodies:
bert_summary.append(self.summarize_text(body))
for i in range(len(bodies)):
print("ORIGINAL TEXT:")
print(bodies[i])
print("\\\\nBERT Summarizing Result:")
print(bert_summary[i])
In this step, we'll define the process_data method to handle the data preprocessing and summarization. This method reads the data from a CSV file, processes it into text blocks, and generates summaries for each block. This step is essential for preparing the data and ensuring it is in a suitable format for summarization.
5. Usage Example
text_summarizer = TextSummarizer()
text_summarizer.process_data('water_problem_nlp_en_for_Kaggle_100.csv')
body = "Despite the similar volumes of discharged wastewater major part of pollutants comes with communal WWTPs. They bring 84% of organic pollution 86% of phosphate ions and 84% of mineral nitrogen 91% of ammonia nitrogen 87% nitrate nitrogen and 79% nitrite nitrogen. The input of the industry is between 7-21% and agriculture has the lowest impact on water bodies - 0-6%. Of the 92 urban areas only 51 localities (55%) have centralized collection of communal waste waters and their monitoring. Among the 2878 villages 6 of them (0.2%) have such a monitoring."
summary = text_summarizer.summarizer(body)
print(summary)
Finally, we'll provide an example of how to use the TextSummarizer class to process data and generate summaries. This includes creating an instance of the class, processing a sample data file, and summarizing a text block. This step demonstrates the practical application of the class and how to generate summaries from new data.
Conclusion
In the conclusion, we'll summarize the process of building and using a text summarization model with BERT. By following each step, readers will learn how to preprocess data, create a custom summarizer, and generate summaries from text blocks. This workflow showcases the effectiveness of BERT in text summarization tasks and provides a foundation for further exploration and application in various domains.