pymupdf / PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
☆6,826Updated this week
Alternatives and similar repositories for PyMuPDF:
Users that are interested in PyMuPDF are comparing it to the libraries listed below
- Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.☆7,483Updated last week
- A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files☆8,896Updated this week
- A python module that wraps the pdftoppm utility to convert PDF to PIL Image object☆1,749Updated 8 months ago
- Demos, examples and utilities using PyMuPDF☆644Updated 9 months ago
- Community maintained fork of pdfminer - we fathom PDF☆6,325Updated this week
- A Python library to extract tabular data from PDFs☆3,235Updated this week
- Python bindings to PDFium☆552Updated 2 weeks ago
- A Python library for reading and writing PDF, powered by QPDF☆2,305Updated last week
- extract text from any document. no muss. no fuss.☆4,023Updated 4 months ago
- A machine learning software for extracting information from scholarly documents☆3,901Updated this week
- A Python wrapper for Google Tesseract☆6,061Updated last month
- Python PDF Parser (Not actively maintained). Check out pdfminer.six.☆5,282Updated 2 years ago
- Create and modify Word documents with Python☆4,891Updated 7 months ago
- Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame☆2,244Updated 3 months ago
- Python Imaging Library (Fork)☆12,660Updated this week
- Open source Python library for converting PDF to DOCX.☆2,856Updated 6 months ago
- OCR, layout analysis, reading order, table recognition in 90+ languages☆17,023Updated this week
- OCR & Document Extraction using vision models☆10,760Updated this week
- Convert HTML to Markdown☆1,504Updated this week
- pdfrw is a pure Python library that reads and writes PDFs☆1,885Updated 11 months ago
- Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model☆7,346Updated last month
- A Repo For Document AI☆2,768Updated this week
- Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.☆10,692Updated this week
- Convert PDF to markdown + JSON quickly with high accuracy☆23,601Updated this week
- A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.☆2,233Updated 2 years ago
- Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.☆1,570Updated last week
- Retrying library for Python☆7,249Updated this week
- Create Open XML PowerPoint documents in Python☆2,668Updated 7 months ago
- borb is a library for reading, creating and manipulating PDF files in python.☆3,459Updated 4 months ago
- Using GPT to parse PDF☆3,336Updated 7 months ago