Tesseract OCR with Python

Sairam Penjarla
Jun 27, 2024
2 min read

Ever come across a scanned document or an image with text and wished you could easily extract it? Look no further than Optical Character Recognition (OCR)! In this blog, we'll explore the magic of Tesseract OCR, delve into the provided Python code and equip you to extract text from images like a pro.

GitHub Link:

https://github.com/sairam-penjarla/Tesseract-OCR

The Enthralling History of Tesseract OCR

Tesseract boasts a rich history. Originally developed by Hewlett-Packard (HP) in the early 1990s, it was later open-sourced in 2005, paving the way for a vibrant community of developers to contribute. Google, recognizing its potential, adopted the project and continues to spearhead its development. Today, Tesseract is a leading open-source OCR engine, supporting over 100 languages and powering applications like Google Drive's image-to-text functionality.

The Inner Workings of Tesseract OCR

Tesseract is a powerful tool, but how does it work? It follows a multi-stage approach:

Preprocessing: The image undergoes adjustments to enhance the clarity of text, such as noise reduction and thresholding (converting the image to black and white).
Segmentation: Individual characters are isolated from the background.
Feature Extraction: Key characteristics of each character are identified.
Pattern Recognition: These features are compared to a built-in database of character patterns, leading to character recognition.
Text Reconstruction: Recognized characters are combined to form words and sentences.

Unveiling the Python Code (Part by Part):

The provided code offers a convenient command-line tool for text extraction using Tesseract OCR. Let's dissect it step by step:

1. Dependencies (requirements.txt):

pytesseract argparse

This file lists the required libraries:

pytesseract: Provides a Python wrapper for interacting with Tesseract OCR.
argparse: Enables parsing command-line arguments, making the script user-friendly.

2. Extracting Text Function (main.py):

Python

def extract_text(image_path):
  # ... code for processing and saving extracted text ...

This function is the heart of the script. It takes the image path as input and performs the following tasks:

Folder Management (Optional): Checks for the existence of inputs and outputs folders for image organization (you can modify this behavior).
Image Handling: Reads the image using Pillow's Image library (optional: move the image to inputs if needed).
Error Handling: Catches potential FileNotFoundError for a missing image.
Text Extraction: Calls pytesseract.image_to_string to extract text from the image.
Output Generation: Creates a unique filename (optional) to avoid conflicts and saves the extracted text to a .txtfile in the outputs folder.

3. Command-Line Arguments (main.py):

Python

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Image Text Extraction Tool")
  parser.add_argument("--image_path", required=True, help="Path to the image file")
  args = parser.parse_args()
  extract_text(args.image_path)

This section defines how the script interacts with the user through the command line:

Argument Parser: Creates an argparse object to handle arguments.
Image Path Argument: Defines a required argument named -image_path for specifying the image file location.
Parsing Arguments: Uses parser.parse_args() to retrieve the provided arguments.
Function Call: Calls the extract_text function with the parsed image path, initiating the extraction process.

Getting Started (It's Time to Extract Text!)

Here's how to use this code:

Clone the Repository: Bash git clone <https://github.com/sairam-penjarla/Tesseract-OCR.git>
Install Dependencies: Bash cd Tesseract-OCR pip install -r requirements.txt Ensure Tesseract OCR is installed: This might involve additional system-specific steps (refer to Tesseract documentation for details).
Run the Script: Bash python main.py --image_path <path_to_your_image.jpg>