jalan/pdftotext

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jalan/pdftotext)

jalan / pdftotext

☆1,063

Alternatives and similar repositories for pdftotext

Users that are interested in pdftotext are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,002Mar 13, 2026Updated 4 months ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,570Updated this week
py-pdf / pypdf
View on GitHub
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
☆10,121Jun 30, 2026Updated 3 weeks ago
pymupdf / PyMuPDF
View on GitHub
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
☆10,272Updated this week
chezou / tabula-py
View on GitHub
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
☆2,315Dec 5, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
pikepdf / pikepdf
View on GitHub
A Python library for reading and writing PDF, powered by QPDF
☆2,765Updated this week
chrismattmann / tika-python
View on GitHub
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
☆1,661Jul 1, 2026Updated 2 weeks ago
deanmalmgren / textract
View on GitHub
extract text from any document. no muss. no fuss.
☆4,669Jul 11, 2026Updated last week
euske / pdfminer
View on GitHub
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
☆5,283Dec 7, 2022Updated 3 years ago
jstockwin / py-pdf-parser
View on GitHub
A Python tool to help extracting information from structured PDFs.
☆425Jul 13, 2026Updated last week
camelot-dev / camelot
View on GitHub
A Python library to extract tabular data from PDFs
☆3,786Updated this week
atlanhq / camelot
View on GitHub
Camelot: PDF Table Extraction for Humans
☆3,716Jan 5, 2023Updated 3 years ago
pmaupin / pdfrw
View on GitHub
pdfrw is a pure Python library that reads and writes PDFs
☆1,908Apr 29, 2024Updated 2 years ago
HazyResearch / pdftotree
View on GitHub
A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
☆460Aug 3, 2023Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
cbrunet / python-poppler
View on GitHub
Python binding to Poppler-cpp pdf library
☆115Sep 6, 2024Updated last year
Belval / pdf2image
View on GitHub
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
☆1,975Jul 23, 2024Updated last year
grobidOrg / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆5,005Updated this week
maxpmaxp / pdfreader
View on GitHub
Python API for PDF documents
☆124Sep 5, 2024Updated last year
metachris / pdfx
View on GitHub
Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
☆1,076Jun 15, 2023Updated 3 years ago
grobidOrg / grobid-client-python
View on GitHub
Python client for GROBID Web services
☆410Mar 5, 2026Updated 4 months ago
jcushman / pdfquery
View on GitHub
A fast and friendly PDF scraping library.
☆781Oct 17, 2023Updated 2 years ago
flairNLP / flair
View on GitHub
A very simple framework for state-of-the-art Natural Language Processing (NLP)
☆14,384Oct 27, 2025Updated 8 months ago
ankushshah89 / python-docx2txt
View on GitHub
A pure python based utility to extract text and images from docx files.
☆586Mar 24, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
ocrmypdf / OCRmyPDF
View on GitHub
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
☆34,234Updated this week
doccano / doccano
View on GitHub
Open source annotation tool for machine learning practitioners.
☆10,705Apr 14, 2026Updated 3 months ago
camelot-dev / excalibur
View on GitHub
A web interface to extract tabular data from PDFs
☆1,810May 20, 2026Updated 2 months ago
explosion / spaCy
View on GitHub
💫 Industrial-strength Natural Language Processing (NLP) in Python
☆33,756May 19, 2026Updated 2 months ago
microsoft / Simplify-Docx
View on GitHub
Simplify DOCX files to JSON
☆265Sep 26, 2024Updated last year
UW-xDD / table-extract
View on GitHub
Locate and extract tables and figures in PDFs
☆43Mar 19, 2021Updated 5 years ago
datadesk / web-map-maker
View on GitHub
Use Natural Earth and OpenStreetMap data to export an image or a vector file.
☆98Jun 24, 2021Updated 5 years ago
DerwenAI / pytextrank
View on GitHub
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
☆2,218Jun 24, 2026Updated 3 weeks ago
tabulapdf / tabula
View on GitHub
Tabula is a tool for liberating data tables trapped inside PDF files
☆7,446Mar 14, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
DavidWBressler / adaptivesoftmax
View on GitHub
☆12Nov 25, 2018Updated 7 years ago
stanfordnlp / stanza
View on GitHub
Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
☆7,845Updated this week
mindee / doctr
View on GitHub
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning. Ongo…
☆6,186Updated this week
Layout-Parser / layout-parser
View on GitHub
A Unified Toolkit for Deep Learning Based Document Image Analysis
☆5,763Aug 15, 2024Updated last year
invoice-x / invoice2data
View on GitHub
Extract structured data from PDF invoices
☆2,178Jul 14, 2026Updated last week
ecatkins / xpdf_python
View on GitHub
Python wrapper for xpdf
☆19Nov 28, 2019Updated 6 years ago
timClicks / slate
View on GitHub
The simplest way to extract text from PDFs in Python
☆427Jul 7, 2022Updated 4 years ago