pd3f / pd3f
π PDF text extraction pipeline: self-hosted, local-first, Docker-based
β304Updated last year
Alternatives and similar repositories for pd3f:
Users that are interested in pd3f are comparing it to the libraries listed below
- Software that makes labeling PDFs easy.β402Updated 8 months ago
- Python library to extract tabular data from images and scanned PDFsβ270Updated 5 months ago
- Document Layout Analysisβ359Updated 3 weeks ago
- PDF to XML ALTO file converterβ222Updated last week
- π βοΈ ETL processes for medical and scientific papersβ372Updated last week
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.β435Updated last year
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!β279Updated 11 months ago
- Logical structure analysis for visually structured documentsβ85Updated 2 years ago
- A curated list of resources for Document Understanding (DU) topicβ1,344Updated last year
- A python based HTML to text conversion library, command line client and Web service.β281Updated last week
- `pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.β102Updated 9 months ago
- Python client for GROBID Web servicesβ301Updated 2 weeks ago
- Demos, examples and utilities using PyMuPDFβ612Updated 6 months ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.β375Updated 5 months ago
- An easy way to extract information from documentsβ1,729Updated last year
- A Repo For Document AIβ2,659Updated this week
- A high performance bibliographic information service: https://biblio-glutton.readthedocs.ioβ131Updated 4 months ago
- β340Updated last year
- This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.β1,154Updated 3 months ago
- A basic tool that extracts the structure from the PDF files of scientific articles.β74Updated 3 years ago
- PDF parser and converter to HTMLβ84Updated 3 months ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance β¦β82Updated 6 years ago
- Document Layout Analysis resources repos for development with PdfPig.β598Updated last year
- Provides OCR (Optical Character Recognition) services through web applicationsβ245Updated 11 months ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasetsβ204Updated last year
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple toolsβ¦β293Updated 2 years ago
- Open Legal Data Platformβ102Updated last week
- OCR engine for all the languagesβ767Updated this week
- Python bindings to PDFiumβ473Updated this week
- A general purpose PDF text-layer redaction tool for Python 2/3.β186Updated 7 months ago