HazyResearch / pdftotree
A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
☆434Updated last year
Alternatives and similar repositories for pdftotree:
Users that are interested in pdftotree are comparing it to the libraries listed below
- Table Detection and Extraction Using Deep Learning ( It is built in Python, using Luminoth, TensorFlow<2.0 and Sonnet.)☆198Updated 2 years ago
- Software that makes labeling PDFs easy.☆404Updated 8 months ago
- PDF to XML ALTO file converter☆222Updated 2 weeks ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆205Updated last year
- Document Layout Analysis☆359Updated last week
- Companion code to the paper "Extracting Scientific Figures with Distantly Supervised Neural Networks" 🤖☆138Updated 2 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆379Updated 5 months ago
- Parsing pdf tables using YOLOV3☆114Updated 3 years ago
- DocBank: A Benchmark Dataset for Document Layout Analysis☆592Updated 5 months ago
- A knowledge base construction engine for richly formatted data☆408Updated 3 years ago
- Science-parse version 2☆234Updated 5 years ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- Page to PAGE Layout Analysis Tool☆191Updated 3 years ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆175Updated last year
- Python library to extract tabular data from images and scanned PDFs☆270Updated 5 months ago
- Library used to deskew a scanned document☆434Updated 3 weeks ago
- a Deep Learning Framework for Text https://delft.readthedocs.io/☆390Updated 3 weeks ago
- Extract tables from scanned image PDFs using Optical Character Recognition.☆271Updated 4 years ago
- PDF parser and converter to HTML☆85Updated 3 months ago
- Fuzzy matching and more functionality for spaCy.☆255Updated 6 months ago
- Pre-Recognize Library - library with algorithms for improving OCR quality.☆104Updated last year
- 🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based☆308Updated last year
- CORD: A Consolidated Receipt Dataset for Post-OCR Parsing☆411Updated 2 years ago
- Document Layout Analysis resources repos for development with PdfPig.☆599Updated last year
- Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm…☆810Updated last month
- Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.☆1,535Updated 9 months ago
- Detectron2 for Document Layout Analysis☆185Updated 5 months ago
- Table Extraction Tool☆90Updated 6 years ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆308Updated last year
- PYthon Automated Term Extraction☆310Updated last year