pdf-association / pdf-corporaView external linksLinks
An index of PDF-centric corpora
☆161Jul 4, 2025Updated 7 months ago
Alternatives and similar repositories for pdf-corpora
Users that are interested in pdf-corpora are comparing it to the libraries listed below
Sorting:
- Repository for "Towards Robust Named Entity Recognition for Historic German"☆18Dec 11, 2020Updated 5 years ago
- ☆14Jan 3, 2024Updated 2 years ago
- ☆10Apr 28, 2025Updated 9 months ago
- Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and te…☆44Jan 18, 2024Updated 2 years ago
- multimodal document analysis☆166Nov 13, 2025Updated 3 months ago
- Viewer for text datasets in formats like HuggingFace, JSONL, etc.☆15Feb 25, 2025Updated 11 months ago
- Node.js port of EtherCalc with widgets☆10Nov 11, 2017Updated 8 years ago
- GloSAT Historical Measurement Table Dataset☆11Dec 3, 2025Updated 2 months ago
- This software (prototype) extracts values of Excel spreadsheet properties and calculates a tentative spreadsheet complexity assessment ba…☆13Dec 13, 2022Updated 3 years ago
- afl源码分析☆13Aug 9, 2018Updated 7 years ago
- Repository for revision of PREMIS OWL ontology group☆13May 12, 2022Updated 3 years ago
- Node.js library for the OpenAI API☆16Jun 23, 2023Updated 2 years ago
- Wrapper around hfsutils to generate DFXML for HFS-formatted disk images☆11Apr 20, 2018Updated 7 years ago
- A persistent repository for PRONOM Research Week activities☆12May 26, 2021Updated 4 years ago
- Win32 Differential Update Library☆14Dec 30, 2019Updated 6 years ago
- Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store …☆31Dec 17, 2025Updated last month
- Django app for managing PREMIS Events☆14Updated this week
- Tools for Creating Universal Numeric Fingerprints for Data☆22Apr 12, 2022Updated 3 years ago
- Digital Forensics Essentials (DFE)☆13Mar 18, 2024Updated last year
- A tool for detecting viruses and NSFW material in WARC files☆17Dec 16, 2025Updated last month
- A visual tool to interpret and understand PyTorch machine learning models☆17Feb 11, 2024Updated 2 years ago
- A package dedicated for running benchmark agreement testing☆17Sep 18, 2025Updated 4 months ago
- Manuals, lexica, OCR test data for PoCoTo and the profiler☆15Jul 2, 2021Updated 4 years ago
- ISCC - Software Development Kit☆19Sep 15, 2025Updated 4 months ago
- Rank items in weighted lists☆18Jun 29, 2025Updated 7 months ago
- The most relaxed testing framework of Kubernetes in the world☆15Nov 26, 2019Updated 6 years ago
- Build visualizations live!☆22Jan 5, 2023Updated 3 years ago
- A single header open/save dialog box with local/googledrive/onedrive WinRT interface for Windows☆17Jun 2, 2019Updated 6 years ago
- A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.☆20Jul 5, 2024Updated last year
- A machine learning model explainer that works on top of Apache Spark☆18Jul 14, 2017Updated 8 years ago
- OCR Annotations from Amazon Textract for Industry Documents Library☆103Aug 20, 2022Updated 3 years ago
- 📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF☆39Mar 8, 2022Updated 3 years ago
- Python bindings to PDFium, reasonably cross-platform.☆721Updated this week
- a dataset for camera-based table detection☆16Jul 30, 2021Updated 4 years ago
- a makefile linter☆20Sep 19, 2016Updated 9 years ago
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆287Feb 13, 2023Updated 3 years ago
- Efficient hOCR tooling☆55Aug 18, 2025Updated 5 months ago
- Presentations, tutorials and data for the OCR workshop at LMU☆16Jun 2, 2017Updated 8 years ago
- SCAlable Preservation Environments☆40Jul 7, 2022Updated 3 years ago