py-pdf / awesome-pdfLinks

A curated list of resources around PDF files

☆149

Alternatives and similar repositories for awesome-pdf

Users that are interested in awesome-pdf are comparing it to the libraries listed below

Sorting:

pymupdf / PyMuPDF-Utilities
Demos, examples and utilities using PyMuPDF
☆706Updated last month
ShayHill / docx2python
Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
☆201Updated last week
py-pdf / pdfly
CLI tool to extract (meta)data from PDF and manipulate PDF files
☆534Updated this week
jstockwin / py-pdf-parser
A Python tool to help extracting information from structured PDFs.
☆427Updated 3 weeks ago
ocropus / hocr-tools
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
☆407Updated last year
microsoft / Simplify-Docx
Simplify DOCX files to JSON
☆256Updated last year
maxpmaxp / pdfreader
Python API for PDF documents
☆124Updated last year
pdf-association / pdf-corpora
An index of PDF-centric corpora
☆161Updated 7 months ago
pd3f / pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
☆328Updated 2 years ago
ExtractTable / ExtractTable-py
Python library to extract tabular data from images and scanned PDFs
☆285Updated last year
py-pdf / benchmarks
Benchmarking PDF libraries
☆321Updated 7 months ago
kermitt2 / pdfalto
PDF to XML ALTO file converter
☆261Updated this week
pypdfium2-team / pypdfium2
Python bindings to PDFium, reasonably cross-platform.
☆721Updated this week
LeoFCardoso / pdf2pdfocr
A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
☆303Updated 8 months ago
Xyntopia / pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable p…
☆87Updated last year
writecrow / ocr2text
Convert a PDF via OCR to a TXT file in UTF-8 encoding
☆156Updated 2 years ago
ahmedkhemiri95 / PDFs-TextExtract
Multiple and Large PDF Documents Text Extraction.
☆131Updated last year
datalab-to / pdftext
Extract structured text from pdfs quickly
☆661Updated 8 months ago
ocrmypdf / OCRmyPDF-EasyOCR
OCRmyPDF EasyOCR plugin
☆98Updated 4 months ago
stanfordnlp / pdf-struct
Logical structure analysis for visually structured documents
☆93Updated 3 years ago
cbrunet / python-poppler
Python binding to Poppler-cpp pdf library
☆113Updated last year
adobe / pdfservices-python-sdk-samples
Adobe PDFServices python SDK Samples
☆161Updated 6 months ago
parkerhancock / patent_client
A collection of ORM-style clients to public patent data
☆123Updated last month
marieai / marie-ai
Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…
☆80Updated this week
JoshData / pdf-redactor
A general purpose PDF text-layer redaction tool for Python 2/3.
☆209Updated last year
neuml / txtmarker
🖍️ Highlight text in documents
☆111Updated 9 months ago
explosion / weasel
🦦 weasel: A small and easy workflow system
☆90Updated 2 months ago
HazyResearch / pdftotree
A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
☆461Updated 2 years ago
x-tabdeveloping / neofuzz
Blazing fast fuzzy text search for Python.
☆51Updated 9 months ago
ad-freiburg / pdfact
A basic tool that extracts the structure from the PDF files of scientific articles.
☆76Updated 4 years ago