pymupdf/PyMuPDF

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/pymupdf/PyMuPDF)

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

☆10,253

Alternatives and similar repositories for PyMuPDF

Users that are interested in PyMuPDF are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

py-pdf / pypdf
View on GitHub
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
☆10,120Jun 30, 2026Updated 2 weeks ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,564Jun 17, 2026Updated last month
pymupdf / PyMuPDF-Utilities
View on GitHub
Demos, examples and utilities using PyMuPDF
☆722Jan 8, 2026Updated 6 months ago
pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,001Mar 13, 2026Updated 4 months ago
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,155Updated this week
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,621Jul 7, 2026Updated last week
datalab-to / surya
View on GitHub
OCR, layout analysis, reading order, table recognition in 90+ languages
☆21,113Updated this week
pymupdf / pymupdf4llm
View on GitHub
PyMuPDF4LLM
☆1,981Updated this week
run-llama / llama_index
View on GitHub
LlamaIndex is the leading document agent and OCR platform
☆50,928Updated this week
pikepdf / pikepdf
View on GitHub
A Python library for reading and writing PDF, powered by QPDF
☆2,763Updated this week
Belval / pdf2image
View on GitHub
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
☆1,975Jul 23, 2024Updated last year
docling-project / docling
View on GitHub
Get your documents ready for gen AI
☆63,447Updated this week
camelot-dev / camelot
View on GitHub
A Python library to extract tabular data from PDFs
☆3,786Updated this week
opendatalab / PDF-Extract-Kit
View on GitHub
A Comprehensive Toolkit for High-Quality PDF Content Extraction
☆9,793Jan 3, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
PaddlePaddle / PaddleOCR
View on GitHub
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/…
☆85,753Updated this week
ocrmypdf / OCRmyPDF
View on GitHub
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
☆34,217Updated this week
langchain-ai / langchain
View on GitHub
The agent engineering platform.
☆142,053Updated this week
pypdfium2-team / pypdfium2
View on GitHub
Python bindings to PDFium, reasonably cross-platform.
☆796Updated this week
vllm-project / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆86,566Updated this week
gradio-app / gradio
View on GitHub
Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
☆43,164Updated this week
deepset-ai / haystack
View on GitHub
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and a…
☆25,935Updated this week
streamlit / streamlit
View on GitHub
Streamlit — A faster way to build and share data apps.
☆45,269Updated this week
Layout-Parser / layout-parser
View on GitHub
A Unified Toolkit for Deep Learning Based Document Image Analysis
☆5,762Aug 15, 2024Updated last year
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
opendatalab / MinerU
View on GitHub
Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.
☆75,015Updated this week
BerriAI / litellm
View on GitHub
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing a…
☆53,941Updated this week
JaidedAI / EasyOCR
View on GitHub
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and …
☆29,777Dec 5, 2025Updated 7 months ago
stanfordnlp / dspy
View on GitHub
DSPy: The framework for programming—not prompting—language models
☆36,202Updated this week
facebookresearch / nougat
View on GitHub
Implementation of Nougat Neural Optical Understanding for Academic Documents
☆10,047Feb 21, 2025Updated last year
microsoft / graphrag
View on GitHub
A modular graph-based Retrieval-Augmented Generation (RAG) system
☆34,495Updated this week
astral-sh / uv
View on GitHub
An extremely fast Python package and project manager, written in Rust.
☆87,635Updated this week
tesseract-ocr / tesseract
View on GitHub
Tesseract Open Source OCR Engine (main repository)
☆75,420Updated this week
ArtifexSoftware / mupdf
View on GitHub
mupdf mirror
☆2,864Updated this week
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
chroma-core / chroma
View on GitHub
Search infrastructure for AI
☆28,821Updated this week
pydantic / pydantic
View on GitHub
Data validation using Python type hints
☆28,312Updated this week
facebookresearch / faiss
View on GitHub
A library for efficient similarity search and clustering of dense vectors.
☆40,536Updated this week
ArtifexSoftware / pdf2docx
View on GitHub
Open source Python library for converting PDF to DOCX.
☆3,469May 1, 2026Updated 2 months ago
allenai / olmocr
View on GitHub
Toolkit for linearizing PDFs for LLM datasets/training
☆19,112Mar 25, 2026Updated 3 months ago
ollama / ollama
View on GitHub
Get up and running with Kimi-K2.6, GLM-5.2, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
☆176,416Updated this week
pgvector / pgvector
View on GitHub
Open-source vector similarity search for Postgres
☆22,241Jul 11, 2026Updated last week