datalab-to/pdftext

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/datalab-to/pdftext)

datalab-to / pdftext

Extract structured text from pdfs quickly

☆708

Alternatives and similar repositories for pdftext

Users that are interested in pdftext are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

datalab-to / surya
View on GitHub
OCR, layout analysis, reading order, table recognition in 90+ languages
☆21,149Updated this week
VikParuchuri / texify
View on GitHub
Math OCR model that outputs LaTeX and markdown
☆1,127Jan 29, 2025Updated last year
pypdfium2-team / pypdfium2
View on GitHub
Python bindings to PDFium, reasonably cross-platform.
☆801Updated this week
datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,843Updated this week
VikParuchuri / tabled
View on GitHub
Detect and extract tables to markdown and csv
☆748Jan 24, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
jimexist / surya-rs
View on GitHub
Rust implementation of Surya
☆66Mar 1, 2025Updated last year
VikParuchuri / textbook_quality
View on GitHub
Generate textbook-quality synthetic LLM pretraining data
☆508Oct 19, 2023Updated 2 years ago
teknium1 / ShareGPT-Builder
View on GitHub
☆126Dec 18, 2024Updated last year
cohere-ai / BinaryVectorDB
View on GitHub
Efficient vector database for hundred millions of embeddings.
☆215May 17, 2024Updated 2 years ago
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,197Updated this week
AnswerDotAI / rerankers
View on GitHub
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,626Dec 20, 2025Updated 7 months ago
AnswerDotAI / RAGatouille
View on GitHub
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…
☆3,943May 17, 2025Updated last year
opendatalab / PDF-Extract-Kit
View on GitHub
A Comprehensive Toolkit for High-Quality PDF Content Extraction
☆9,806Jan 3, 2025Updated last year
Filimoa / open-parse
View on GitHub
Improved file parsing for LLM’s
☆3,162May 17, 2026Updated 2 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
swyxio / openlangmem
View on GitHub
☆47Apr 6, 2024Updated 2 years ago
pymupdf / PyMuPDF
View on GitHub
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
☆10,315Updated this week
AnswerDotAI / byaldi
View on GitHub
Use late-interaction multi-modal models such as ColPali in just a few lines of code.
☆851Jan 28, 2025Updated last year
ai8hyf / TF-ID
View on GitHub
TF-ID: Table/Figure IDentifier for academic papers
☆247Jul 12, 2024Updated 2 years ago
VikParuchuri / zero_to_gpt
View on GitHub
Go from no deep learning knowledge to implementing GPT.
☆1,308May 30, 2024Updated 2 years ago
docling-project / docling-parse
View on GitHub
Simple package to extract text with coordinates from programmatic PDFs
☆326Updated this week
facebookresearch / nougat
View on GitHub
Implementation of Nougat Neural Optical Understanding for Academic Documents
☆10,050Feb 21, 2025Updated last year
enoch3712 / ExtractThinker
View on GitHub
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
☆1,586Aug 27, 2025Updated 10 months ago
microsoft / table-transformer
View on GitHub
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the o…
☆2,931Jun 24, 2024Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
davanstrien / awesome-synthetic-datasets
View on GitHub
awesome synthetic (text) datasets
☆335Jan 8, 2026Updated 6 months ago
allenai / olmocr
View on GitHub
Toolkit for linearizing PDFs for LLM datasets/training
☆19,182Mar 25, 2026Updated 4 months ago
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆15,331Updated this week
migtissera / Sensei
View on GitHub
Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI
☆221Apr 29, 2024Updated 2 years ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,593Updated this week
docling-project / docling-ibm-models
View on GitHub
☆207Updated this week
pymupdf / pymupdf4llm
View on GitHub
PyMuPDF4LLM
☆2,021Updated this week
xhluca / bm25s
View on GitHub
Fast BM25 search in Python, powered by Numpy and Numba
☆1,746Updated this week
katanaml / sparrow
View on GitHub
Structured data extraction, instruction calling and agentic workflows with ML, LLM and Vision LLM
☆5,185Jun 30, 2026Updated 3 weeks ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
poloclub / unitable
View on GitHub
UniTable: Towards a Unified Table Foundation Model
☆534Apr 21, 2026Updated 3 months ago
Ucas-HaoranWei / GOT-OCR2.0
View on GitHub
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
☆8,158Feb 10, 2025Updated last year
opendatalab / DocLayout-YOLO
View on GitHub
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
☆2,235Apr 14, 2025Updated last year
cohere-ai / cohere-terrarium
View on GitHub
A simple Python sandbox for helpful LLM data agents
☆312Apr 22, 2026Updated 3 months ago
QuivrHQ / MegaParse
View on GitHub
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
☆7,407Feb 21, 2025Updated last year
mindee / doctr
View on GitHub
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning. Ongo…
☆6,190Updated this week
qdrant / fastembed
View on GitHub
Fast, Accurate, Lightweight Python library to make State of the Art Embedding
☆3,104Updated this week