An index of PDF-centric corpora
☆163Jul 4, 2025Updated 8 months ago
Alternatives and similar repositories for pdf-corpora
Users that are interested in pdf-corpora are comparing it to the libraries listed below
Sorting:
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆99Feb 21, 2026Updated 2 weeks ago
- Targeted PDFs demonstrating commonly seen PDF differentials and interoperability issues☆13May 8, 2025Updated 9 months ago
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆86Feb 21, 2026Updated 2 weeks ago
- PDF Name Registry☆22Jan 16, 2026Updated last month
- veraPDF test corpus for ISO 19005 (PDF/A) and ISO 14289 (PDF/UA)☆85Feb 9, 2026Updated 3 weeks ago
- Repository for "Towards Robust Named Entity Recognition for Historic German"☆18Dec 11, 2020Updated 5 years ago
- Named entity recognition for the legal domain☆43Jun 1, 2021Updated 4 years ago
- A fork of the disktype disk and disk image format detection tool☆11Nov 16, 2016Updated 9 years ago
- ☆14Jan 3, 2024Updated 2 years ago
- Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and te…☆44Jan 18, 2024Updated 2 years ago
- Viewer for text datasets in formats like HuggingFace, JSONL, etc.☆15Feb 25, 2025Updated last year
- A Python library for defining rule-based overrides on messy data☆18Nov 24, 2025Updated 3 months ago
- Node.js port of EtherCalc with widgets☆10Nov 11, 2017Updated 8 years ago
- GloSAT Historical Measurement Table Dataset☆11Dec 3, 2025Updated 3 months ago
- Wrapper around hfsutils to generate DFXML for HFS-formatted disk images☆11Apr 20, 2018Updated 7 years ago
- Win32 Differential Update Library☆14Dec 30, 2019Updated 6 years ago
- Win16 Display Calculator☆11Dec 19, 2018Updated 7 years ago
- A persistent repository for PRONOM Research Week activities☆12May 26, 2021Updated 4 years ago
- Repository for revision of PREMIS OWL ontology group☆13May 12, 2022Updated 3 years ago
- Node.js library for the OpenAI API☆16Jun 23, 2023Updated 2 years ago
- This software (prototype) extracts values of Excel spreadsheet properties and calculates a tentative spreadsheet complexity assessment ba…☆13Dec 13, 2022Updated 3 years ago
- Write Datasette canned queries as plain SQL files☆14Jul 2, 2022Updated 3 years ago
- afl源码分析☆13Aug 9, 2018Updated 7 years ago
- Extract metadata from a video to an sqlite database☆20May 23, 2024Updated last year
- Django app for managing PREMIS Events☆14Feb 26, 2026Updated last week
- Tools for Creating Universal Numeric Fingerprints for Data☆22Apr 12, 2022Updated 3 years ago
- A tool for detecting viruses and NSFW material in WARC files☆17Dec 16, 2025Updated 2 months ago
- Digital Forensics Essentials (DFE)☆13Mar 18, 2024Updated last year
- A package dedicated for running benchmark agreement testing☆17Sep 18, 2025Updated 5 months ago
- "checkit_tiff" is an incredibly fast conformance checker for baseline TIFFs (with various extensions)☆14Nov 30, 2021Updated 4 years ago
- ISCC - Software Development Kit☆19Updated this week
- Rank items in weighted lists☆18Jun 29, 2025Updated 8 months ago
- Test files for conformance testing and benchmarking Jpylyzer.☆18Apr 2, 2024Updated last year
- CTE: Contextualized Table Extraction Dataset☆17Feb 23, 2023Updated 3 years ago
- The most relaxed testing framework of Kubernetes in the world☆15Nov 26, 2019Updated 6 years ago
- Eine kuratierte Liste hilfreicher Informationen zu Offenen Daten☆20Jun 12, 2022Updated 3 years ago
- An auto-scoring capture-the-flag game focusing on TOCTOU vulnerabilities☆21Oct 28, 2020Updated 5 years ago
- A machine learning model explainer that works on top of Apache Spark☆18Jul 14, 2017Updated 8 years ago
- A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.☆20Jul 5, 2024Updated last year