HazyResearch/pdftotree

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/HazyResearch/pdftotree)

HazyResearch / pdftotree

A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.

☆460

Alternatives and similar repositories for pdftotree

Users that are interested in pdftotree are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

HazyResearch / fonduer
View on GitHub
A knowledge base construction engine for richly formatted data
☆412Jun 23, 2021Updated 5 years ago
HazyResearch / fonduer-tutorials
View on GitHub
A collection of simple tutorials for using Fonduer
☆101Oct 27, 2020Updated 5 years ago
pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,002Mar 13, 2026Updated 4 months ago
kermitt2 / pdfalto
View on GitHub
PDF to XML ALTO file converter
☆272Updated this week
atlanhq / camelot
View on GitHub
Camelot: PDF Table Extraction for Humans
☆3,716Jan 5, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
grobidOrg / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆5,005Updated this week
dpapathanasiou / pdfminer-layout-scanner
View on GitHub
A more complete example of programming with PDFMiner, which continues where the default documentation stops
☆216Dec 3, 2019Updated 6 years ago
xigt / freki
View on GitHub
Analyze XML extracted from PDFs (e.g. from TET or PDFMiner)
☆20Jan 11, 2018Updated 8 years ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,570Updated this week
DS3Lab / DocParser
View on GitHub
☆83Apr 12, 2022Updated 4 years ago
jcushman / pdfquery
View on GitHub
A fast and friendly PDF scraping library.
☆781Oct 17, 2023Updated 2 years ago
ExtractTable / ExtractTable-py
View on GitHub
Python library to extract tabular data from images and scanned PDFs
☆285Jul 30, 2024Updated last year
Layout-Parser / layout-parser
View on GitHub
A Unified Toolkit for Deep Learning Based Document Image Analysis
☆5,763Aug 15, 2024Updated last year
tamirhassan / dataset-tools
View on GitHub
Java command-line tools for comparing results to ground truth for table location and structure detection as used in the ICDAR 2013 Table …
☆33May 31, 2020Updated 6 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
HazyResearch / babble
View on GitHub
A system for generating training labels via natural language explanations
☆146Jun 24, 2019Updated 7 years ago
lquirosd / P2PaLA
View on GitHub
Page to PAGE Layout Analysis Tool
☆192Jan 17, 2022Updated 4 years ago
HazyResearch / reef
View on GitHub
Automatically labeling training data
☆108Jan 8, 2019Updated 7 years ago
camelot-dev / camelot
View on GitHub
A Python library to extract tabular data from PDFs
☆3,786Updated this week
HazyResearch / TreeStructure
View on GitHub
Table Extraction Tool
☆90Feb 28, 2018Updated 8 years ago
chezou / tabula-py
View on GitHub
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
☆2,315Dec 5, 2024Updated last year
madhav1ag / CDeCNet
View on GitHub
CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images
☆134Sep 11, 2025Updated 10 months ago
tfmorris / pdf2table
View on GitHub
PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz
☆40Mar 15, 2024Updated 2 years ago
Layout-Parser / layout-model-training
View on GitHub
The scripts for training Detectron2-based Layout Models on popular layout analysis datasets
☆220Sep 26, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
jalan / pdftotext
View on GitHub
☆1,063Jun 28, 2026Updated 3 weeks ago
doc-analysis / TableBank
View on GitHub
TableBank: A Benchmark Dataset for Table Detection and Recognition
☆1,080Aug 12, 2024Updated last year
euske / pdfminer
View on GitHub
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
☆5,283Dec 7, 2022Updated 3 years ago
blester125 / iobes
View on GitHub
Tool for parsing and converting various span encoding schemes.
☆23Jan 13, 2024Updated 2 years ago
allenai / pdffigures2
View on GitHub
Given a scholarly PDF, extract figures, tables, captions, and section titles.
☆750Mar 10, 2024Updated 2 years ago
CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003
View on GitHub
Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
☆12May 11, 2021Updated 5 years ago
eLifePathways / sciencebeam-parser
View on GitHub
A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…
☆297Jul 8, 2026Updated last week
mawanda-jun / TableTrainNet
View on GitHub
Table recognition inside douments using neural networks
☆92Sep 11, 2018Updated 7 years ago
kba / hocr-spec
View on GitHub
The hOCR Embedded OCR Workflow and Output Format
☆74Aug 12, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
allenai / vila
View on GitHub
Incorporating VIsual LAyout Structures for Scientific Text Classification
☆180Mar 18, 2023Updated 3 years ago
ocropus / hocr-tools
View on GitHub
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
☆416Aug 10, 2024Updated last year
ismail-mebsout / Parsing-PDFs-using-YOLOV3
View on GitHub
Parsing pdf tables using YOLOV3
☆120Jun 25, 2026Updated 3 weeks ago
cellsrg / tabbypdf
View on GitHub
A tool for extracting arbitrary tables from untagged PDF documents
☆40Jan 8, 2021Updated 5 years ago
UW-xDD / table-extract
View on GitHub
Locate and extract tables and figures in PDFs
☆43Mar 19, 2021Updated 5 years ago
WZBSocialScienceCenter / pdftabextract
View on GitHub
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
☆2,255Jun 24, 2022Updated 4 years ago
sachinraja13 / TabStructNet
View on GitHub
☆132Mar 24, 2023Updated 3 years ago