How To Extract Text From PDF With Python 3

Answer :

In this tutorial, we are going to examine the most popular libraries for extracting data from PDF with Python. PDF is great for reading but we may need to extract some details for further processing.

I tested numerous packages, each with its own strengths and weakness. There are good packages for PDF processing and extracting text from PDF which most of people are using: Textract, Apache Tika, pdfPlumber, pdfmupdf, PyPDF2

Note: PyPDF2 is not maintained, so I ignore it.

Let all these libraries anyway

pdfplumber #

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.6, 3.7, and 3.8 and work on MacOS, Windows, Linux

pip install pdfminer.six

Install pdfplumber #

pip install pdfplumber

Basic usage #

import pdfplumberwith pdfplumber.open("pdffile.pdf") as pdf:    page  = pdf.pages[0]    text = page.chars[0]    print(text)

To start working with a PDF, call pdfplumber.open(x), where x can be a:

path to your PDF file
file object, loaded as bytes
file-like object, loaded as bytesThe open method returns an instance of the pdfplumber.PDF class.

Tika #

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Install tika #

Installing the Python library is simple enough, but it will not work unless you have JAVA installed. So make sure you have Java installed.

pip install tika

tika basic usage #

import tikatika.initVM()from tika import parserparsed = parser.from_file('sample.pdf')print(parsed["metadata"])print(parsed["content"])

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

Fedora, Red Hat, and friends #

sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel

macOS #

brew install pkg-config poppler python

Windows #

Currently tested only when using conda:

Install the Microsoft Visual C++ Build Tools
Install poppler through conda:
```
conda install -c conda-forge poppler
```

Install pdftotext #

pip install pdftotext

pdftotext basic usage #

import pdftotext# Load PDF filewith open("pdffile.pdf", "rb") as f:    pdf = pdftotext.PDF(f)# If it's password-protectedwith open("secure_pdffile.pdf", "rb") as f:    pdf = pdftotext.PDF(f, "secret")# Iterate over all the pagesfor page in pdf:    # text content in pdf page    print(page)# Read all the text into one stringprint("\n\n".join(pdf))

import fitz  # this is pymupdfwith fitz.open("my.pdf") as doc:    text = ""    for page in doc:        text += page.getText()print(text)

Conclusion #

The textract library was not considered for using the same algorithm as pdftotext. (textract is a wrapper for Poppler: pdftotext) | https://pypi.org/project/textract/The observations about the extraction of the algorithm are dependent on the PDF file, its encoding process and the diversity of non-textual elements present, such as Images and Tables.

Main features found:

Abstract:

In this experiment, the choice should fall on the PyMuPDF or Tika-Python libraries. pdftotext is a great library, but preserves the same layout as the original text, which in certain situations is inappropriate.

How To Extract Text From PDF With Python 3

Answer :

pdfplumber #

Install pdfplumber #

Basic usage #

Tika #

Install tika #

tika basic usage #

pdftotext #

OS Dependencies #

Debian, Ubuntu, and friends #

Fedora, Red Hat, and friends #

macOS #

Windows #

Install pdftotext #

pdftotext basic usage #

PyMuPDF #

Install PyMuPDF #

PyMuPDF basic usage #

Conclusion #

Belum ada Komentar untuk "How To Extract Text From PDF With Python 3"

Posting Komentar

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel