An index of PDF-centric corpora
☆166Jul 4, 2025Updated 8 months ago
Alternatives and similar repositories for pdf-corpora
Users that are interested in pdf-corpora are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆102Mar 20, 2026Updated last week
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆87Mar 13, 2026Updated 2 weeks ago
- Targeted PDFs demonstrating commonly seen PDF differentials and interoperability issues☆13Mar 20, 2026Updated last week
- PDF 2.0 example files☆108Jan 13, 2025Updated last year
- Digital preservation policies and strategies☆12Mar 29, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Code for processing and archiving emails☆18Jul 30, 2024Updated last year
- ☆37Jan 21, 2026Updated 2 months ago
- ☆14Jan 3, 2024Updated 2 years ago
- ☆10Apr 28, 2025Updated 10 months ago
- A persistent repository for PRONOM Research Week activities☆12May 26, 2021Updated 4 years ago
- An interpretation of the content stream structure as described by ISO 32000☆12Jan 31, 2023Updated 3 years ago
- A tool for detecting viruses and NSFW material in WARC files☆18Dec 16, 2025Updated 3 months ago
- Digital Forensics Essentials (DFE)☆13Mar 18, 2024Updated 2 years ago
- Django app for managing PREMIS Events☆14Mar 9, 2026Updated 2 weeks ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.☆202Feb 16, 2026Updated last month
- This software (prototype) extracts values of Excel spreadsheet properties and calculates a tentative spreadsheet complexity assessment ba…☆13Dec 13, 2022Updated 3 years ago
- Prototype wikidata portal project.☆10May 3, 2024Updated last year
- Wrapper around hfsutils to generate DFXML for HFS-formatted disk images☆11Apr 20, 2018Updated 7 years ago
- Write Datasette canned queries as plain SQL files☆14Jul 2, 2022Updated 3 years ago
- Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and te…☆44Jan 18, 2024Updated 2 years ago
- This repository is a concise collection of well known deep learning based document binarization models.☆29Dec 24, 2022Updated 3 years ago
- VSCode extension for understanding and learning PDF syntax☆33Mar 18, 2026Updated last week
- Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is des…☆160Jan 22, 2026Updated 2 months ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Named entity recognition for the legal domain☆43Jun 1, 2021Updated 4 years ago
- XML Schema for Digital Forensics XML☆35Feb 7, 2025Updated last year
- Tools for Creating Universal Numeric Fingerprints for Data☆22Apr 12, 2022Updated 3 years ago
- Data and code: "Answering legal questions from laymen in German civil law system", Büttner & Habernal, EACL'24☆14Mar 2, 2024Updated 2 years ago
- Library and tools to access to optical disc (split) RAW image files (bin/cue, iso/cue)☆27Dec 24, 2025Updated 3 months ago
- A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.☆20Jul 5, 2024Updated last year
- A smart file fuzzer.☆26Aug 19, 2016Updated 9 years ago
- Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"☆37Jul 13, 2023Updated 2 years ago
- Analyze and help extract older "hidden" versions of a pdf from the current pdf.☆100Sep 10, 2022Updated 3 years ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Manuals, lexica, OCR test data for PoCoTo and the profiler☆15Jul 2, 2021Updated 4 years ago
- Fast PDF generation and compression. Deals with millions of pages daily.☆137Mar 2, 2026Updated 3 weeks ago
- 📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF☆39Mar 8, 2022Updated 4 years ago
- Command line $MFT record decoder☆12May 20, 2017Updated 8 years ago
- Eine kuratierte Liste hilfreicher Informationen zu Offenen Daten☆20Jun 12, 2022Updated 3 years ago
- This repository collects proposals to enhance Schematron beyond the ISO specification☆10Feb 8, 2022Updated 4 years ago
- Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.☆161Apr 3, 2024Updated last year