pd3f / pd3f
π PDF text extraction pipeline: self-hosted, local-first, Docker-based
β311Updated last year
Alternatives and similar repositories for pd3f:
Users that are interested in pd3f are comparing it to the libraries listed below
- Python library to extract tabular data from images and scanned PDFsβ271Updated 6 months ago
- PDF to XML ALTO file converterβ224Updated this week
- Software that makes labeling PDFs easy.β405Updated 9 months ago
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!β280Updated last year
- Document Layout Analysisβ359Updated last month
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.β436Updated last year
- β349Updated last year
- This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.β1,185Updated 4 months ago
- π βοΈ ETL processes for medical and scientific papersβ375Updated last month
- Document Layout Analysis resources repos for development with PdfPig.β602Updated last year
- Logical structure analysis for visually structured documentsβ86Updated 2 years ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysisβ316Updated 2 years ago
- An easy way to extract information from documentsβ1,738Updated last year
- Extract structured text from pdfs quicklyβ418Updated this week
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pacβ¦β264Updated last year
- A repository of instructions in French to fine-tune LLMsβ17Updated last year
- img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processingβ651Updated last week
- OCR engine for all the languagesβ791Updated this week
- A basic tool that extracts the structure from the PDF files of scientific articles.β74Updated 3 years ago
- β¨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3β322Updated last year
- Benchmarking PDF librariesβ257Updated last year
- A python based HTML to text conversion library, command line client and Web service.β290Updated last month
- Scripts and results from our OCR roundup, available on Sourceβ150Updated 6 years ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasetsβ205Updated last year
- Python bindings to PDFiumβ522Updated this week
- Extracting Semi-Structured Data from PDFs on a large scaleβ51Updated 2 years ago
- Demos, examples and utilities using PyMuPDFβ626Updated 7 months ago
- β173Updated this week
- π Retrieval augmented generation (RAG) and language model powered search applicationsβ286Updated 2 months ago
- Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Texβ¦β998Updated last year