pd3f / pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
☆314Updated last year
Alternatives and similar repositories for pd3f:
Users that are interested in pd3f are comparing it to the libraries listed below
- Software that makes labeling PDFs easy.☆410Updated 11 months ago
- Python library to extract tabular data from images and scanned PDFs☆277Updated 8 months ago
- Document Layout Analysis☆365Updated this week
- PDF to XML ALTO file converter☆236Updated this week
- An easy way to extract information from documents☆1,748Updated last year
- Logical structure analysis for visually structured documents☆89Updated 2 years ago
- 📄 ⚙️ ETL processes for medical and scientific papers☆384Updated 3 months ago
- Document Layout Analysis resources repos for development with PdfPig.☆611Updated last year
- A high performance bibliographic information service: https://biblio-glutton.readthedocs.io☆136Updated 7 months ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆209Updated last year
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆389Updated 8 months ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- ☆357Updated last year
- `pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.☆103Updated last year
- Python bindings to PDFium☆558Updated this week
- Extract structured text from pdfs quickly☆464Updated last month
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!☆284Updated last year
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆175Updated 2 years ago
- Extracting Semi-Structured Data from PDFs on a large scale☆51Updated 2 years ago
- multimodal document analysis☆164Updated 10 months ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆161Updated 2 years ago
- howdoi.ai☆255Updated last year
- 📚 Process PDFs, Word documents and more with spaCy☆551Updated last month
- Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)☆363Updated last year
- Repository for the EM German Model☆109Updated last year
- A Dataset of German Legal Documents for Named Entity Recognition☆166Updated 2 years ago
- This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging t…☆30Updated 2 months ago
- OCR engine for all the languages☆813Updated last week
- Python client for GROBID Web services☆321Updated last month
- Developer APIs to Accelerate LLM Projects☆1,632Updated 6 months ago