pd3f / pd3fLinks
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
☆319Updated last year
Alternatives and similar repositories for pd3f
Users that are interested in pd3f are comparing it to the libraries listed below
Sorting:
- Software that makes labeling PDFs easy.☆416Updated last year
- PDF to XML ALTO file converter☆238Updated this week
- Document Layout Analysis☆376Updated 2 weeks ago
- Extract structured text from pdfs quickly☆481Updated this week
- Python library to extract tabular data from images and scanned PDFs☆278Updated 10 months ago
- 📄 ⚙️ ETL processes for medical and scientific papers☆384Updated last month
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆211Updated last year
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆280Updated 2 weeks ago
- Custom recipe and utilities for document processing☆199Updated 2 years ago
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!☆288Updated last week
- Document Layout Analysis resources repos for development with PdfPig.☆616Updated last year
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆344Updated 2 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆392Updated 9 months ago
- ☆367Updated last year
- A web-based document annotation tool, powered by GPT-4☆260Updated last year
- Python client for GROBID Web services☆330Updated 2 months ago
- Annotate entities directly onto a PDF with automatic OCR for scanned PDFs☆60Updated 2 years ago
- This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging t…☆33Updated 3 months ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- A high performance bibliographic information service: https://biblio-glutton.readthedocs.io☆139Updated 8 months ago
- ☆183Updated last week
- Logical structure analysis for visually structured documents☆89Updated 2 years ago
- A curated list of resources for Document Understanding (DU) topic☆1,408Updated 2 years ago
- Python bindings to PDFium☆578Updated this week
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆70Updated this week
- Parsing pdf tables using YOLOV3☆117Updated 4 years ago
- OCR engine for all the languages☆832Updated this week
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.☆448Updated last year
- A Prodigy plugin for PDF annotation☆30Updated 2 months ago
- ✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3☆322Updated last year