pd3f / pd3fLinks
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
☆321Updated last year
Alternatives and similar repositories for pd3f
Users that are interested in pd3f are comparing it to the libraries listed below
Sorting:
- Document Layout Analysis☆376Updated last week
- Logical structure analysis for visually structured documents☆90Updated 2 years ago
- Software that makes labeling PDFs easy.☆415Updated last year
- Python library to extract tabular data from images and scanned PDFs☆278Updated 10 months ago
- 📄 ⚙️ ETL processes for medical and scientific papers☆385Updated last week
- Demos, examples and utilities using PyMuPDF☆664Updated 11 months ago
- Extract structured text from pdfs quickly☆497Updated last week
- PDF to XML ALTO file converter☆242Updated last week
- Extracting Semi-Structured Data from PDFs on a large scale☆52Updated 2 years ago
- ☆369Updated last year
- 📚 Process PDFs, Word documents and more with spaCy☆644Updated 3 months ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆349Updated 2 years ago
- `pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.☆104Updated last year
- A python based HTML to text conversion library, command line client and Web service.☆311Updated 3 weeks ago
- A post-processing tool for scanned sheets of paper.☆1,083Updated 11 months ago
- A Repo For Document AI☆2,859Updated this week
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!☆290Updated 3 weeks ago
- The scripts for training Detectron2-based Layout Models on popular layout analysis datasets☆211Updated last year
- Document Layout Analysis resources repos for development with PdfPig.☆619Updated last year
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- Scripts and results from our OCR roundup, available on Source☆150Updated 6 years ago
- Benchmarking PDF libraries☆287Updated last year
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆68Updated 4 years ago
- This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.☆1,246Updated 2 months ago
- Master repository which includes most other OCR-D repositories as submodules☆73Updated last month
- A curated list of resources for Document Understanding (DU) topic☆1,423Updated 2 years ago
- Python bindings to PDFium☆585Updated last week
- Lightweight, performant, deep table extraction☆473Updated 3 weeks ago
- A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.☆187Updated 3 weeks ago
- Repository for the EM German Model☆109Updated last year