A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
☆459Aug 3, 2023Updated 2 years ago
Alternatives and similar repositories for pdftotree
Users that are interested in pdftotree are comparing it to the libraries listed below
Sorting:
- A knowledge base construction engine for richly formatted data☆412Jun 23, 2021Updated 4 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆216Dec 3, 2019Updated 6 years ago
- A collection of simple tutorials for using Fonduer☆101Oct 27, 2020Updated 5 years ago
- Community maintained fork of pdfminer - we fathom PDF☆6,909Feb 24, 2026Updated last week
- A machine learning software for extracting information from scholarly documents☆4,670Updated this week
- Camelot: PDF Table Extraction for Humans☆3,716Jan 5, 2023Updated 3 years ago
- Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.☆9,800Jan 28, 2026Updated last month
- ☆82Apr 12, 2022Updated 3 years ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆218Sep 26, 2023Updated 2 years ago
- Table Extraction Tool☆90Feb 28, 2018Updated 8 years ago
- A fast and friendly PDF scraping library.☆783Oct 17, 2023Updated 2 years ago
- Java command-line tools for comparing results to ground truth for table location and structure detection as used in the ICDAR 2013 Table …☆33May 31, 2020Updated 5 years ago
- Python library to extract tabular data from images and scanned PDFs☆284Jul 30, 2024Updated last year
- Tool for parsing and converting various span encoding schemes.☆23Jan 13, 2024Updated 2 years ago
- A system for generating training labels via natural language explanations☆146Jun 24, 2019Updated 6 years ago
- Page to PAGE Layout Analysis Tool☆191Jan 17, 2022Updated 4 years ago
- extract text from any document. no muss. no fuss.☆4,471Feb 4, 2026Updated last month
- A Python library to extract tabular data from PDFs☆3,602Updated this week
- 📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF☆39Mar 8, 2022Updated 3 years ago
- TableBank: A Benchmark Dataset for Table Detection and Recognition☆1,082Aug 12, 2024Updated last year
- ☆15Feb 5, 2019Updated 7 years ago