pd3f/pd3f

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/pd3f/pd3f)

pd3f / pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

☆334

Alternatives and similar repositories for pd3f

Users that are interested in pd3f are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

slub / docsa
View on GitHub
SLUB Document Classification and Similarity Analysis
☆10Aug 31, 2023Updated 2 years ago
opensanctions / datapatch
View on GitHub
A Python library for defining rule-based overrides on messy data
☆18Nov 24, 2025Updated 7 months ago
guardian / giant
View on GitHub
Platform for journalists to search, analyse, categorise and share unstructured data
☆59Updated this week
axa-group / Parsr
View on GitHub
Transforms PDF, Documents and Images into Enriched Structured Data
☆6,177Mar 20, 2026Updated 3 months ago
basti-schr / eu-wahlprogramme
View on GitHub
Maschinenlesbare Wahlprogramme der Europawahl 2019
☆13May 14, 2019Updated 7 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
dbmdz / historic-ner
View on GitHub
Repository for "Towards Robust Named Entity Recognition for Historic German"
☆18Dec 11, 2020Updated 5 years ago
mysociety / bluetail
View on GitHub
An alpha project combining beneficial ownership and contracting data
☆13Jun 9, 2021Updated 5 years ago
alephdata / ingest-file
View on GitHub
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
☆66Dec 19, 2025Updated 6 months ago
gambolputty / newscorpus
View on GitHub
A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.
☆20Jul 5, 2024Updated 2 years ago
openredact / openredact-app
View on GitHub
This is a prototype of a semi-automatic data anonymization app for German documents. ➡️ The project has moved to: https://gitlab.opencode…
☆24Mar 20, 2026Updated 3 months ago
alephdata / pdflib
View on GitHub
Binary Python bindings for poppler utils for content extraction
☆42May 12, 2021Updated 5 years ago
pedrohavay / followthemoney
View on GitHub
A Go port of FollowTheMoney (FtM) — a pragmatic data model for people, companies, assets, relationships and documents used in investigati…
☆22Sep 8, 2025Updated 10 months ago
alephdata / languagecodes
View on GitHub
A Python helper library to convert between ISO 639 two- and three-letter codes.
☆11Nov 13, 2024Updated last year
codeforberlin / tickets
View on GitHub
Collecting good beginner tasks and project ideas.
☆16Apr 23, 2018Updated 8 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
okfde / offeneregister.de
View on GitHub
OffeneRegister.de – Offene Daten für das Handelsregister
☆36Feb 2, 2026Updated 5 months ago
UB-Mannheim / reichsanzeiger-nlp
View on GitHub
Reichsanzeiger-NLP: NER/NEL corpus for the German historical newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (1819–19…
☆16Oct 18, 2024Updated last year
opensanctions / storyweb
View on GitHub
Extract networks of entities from journalistic reporting
☆49Jul 17, 2023Updated 2 years ago
okfde / transparenzranking.de
View on GitHub
Transparenzranking.de vergleicht alle Transparenzregelungen Deutschlands
☆12Mar 26, 2026Updated 3 months ago
CivOmega / civomega
View on GitHub
Ask questions about government data.
☆38Jan 17, 2019Updated 7 years ago
opensanctions / qarin
View on GitHub
How can we improve name matching in screening tools?
☆16Aug 13, 2025Updated 10 months ago
opensemanticsearch / open-semantic-etl
View on GitHub
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…
☆280Oct 9, 2022Updated 3 years ago
deepdoctection / deepdoctection
View on GitHub
A Repo For Document AI
☆3,186Jun 20, 2026Updated 2 weeks ago
balzer82 / PegidaSprache
View on GitHub
Analyse des Pegida facebook Korpus
☆10Jan 31, 2015Updated 11 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
ndrplz / eurlex-toolbox
View on GitHub
Python toolbox to load, parse and process Official Journals of the European Union (EU).
☆24May 3, 2024Updated 2 years ago
janlelis / characteristics
View on GitHub
Character info under different encodings
☆27Sep 12, 2025Updated 9 months ago
VRI-UFPR / ocrd-gbn
View on GitHub
OCR-D compliant toolset for optical layout recognition on historical german-language documents published in Brazil
☆11Sep 24, 2021Updated 4 years ago
n-waves / ulmfit4de
View on GitHub
ULMFiT Method for German Language
☆15May 10, 2019Updated 7 years ago
DARIAH-DE / DARIAH-DKPro-Wrapper
View on GitHub
Wrapper for DKPro Core to extract lingustic information from books.
☆16Feb 26, 2022Updated 4 years ago
OCR-D / ocrd_all
View on GitHub
Master repository which includes most other OCR-D repositories as submodules
☆73Jul 4, 2025Updated last year
openredact / anonymizer
View on GitHub
A Python module that provides multiple anonymization techniques for text (This is only a prototype) ➡️ The project has moved to: https://…
☆26Mar 20, 2026Updated 3 months ago
sdockray / dat-syllabus
View on GitHub
Peer-to-peer markdown syllabus platform for Beaker Browser.
☆14Dec 11, 2017Updated 8 years ago
LanguageMachines / LuigiNLP
View on GitHub
A workflow system for Natural Language Processing.
☆21Oct 17, 2019Updated 6 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,513Jun 17, 2026Updated 3 weeks ago
mrapp-ke / Boomer
View on GitHub
A scikit-learn implementation of BOOMER - An Algorithm for Learning Gradient Boosted Multi-label Classification Rules
☆21Mar 27, 2024Updated 2 years ago
opensanctions / fingerprints
View on GitHub
Now included in rigour
☆150Nov 24, 2025Updated 7 months ago
intranda / goobi-viewer-core
View on GitHub
Goobi viewer - Presentation software for digital libraries, museums, archives and galleries. Open Source.
☆27Jul 1, 2026Updated last week
openlegaldata / legal-reference-extraction
View on GitHub
Legal Reference Extraction
☆49Jun 15, 2026Updated 3 weeks ago
ICIJ / datashare-installer
View on GitHub
☆12Updated this week
stefan-it / europeana-bert
View on GitHub
BERT and ELECTRA models trained on Europeana Newspapers
☆39Dec 14, 2021Updated 4 years ago