pdf-association/pdf-corpora

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/pdf-association/pdf-corpora)

pdf-association / pdf-corpora

An index of PDF-centric corpora

☆183

Alternatives and similar repositories for pdf-corpora

Users that are interested in pdf-corpora are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pdf-association / pdf-differences
View on GitHub
Targeted PDFs demonstrating commonly seen PDF differentials and interoperability issues
☆15Mar 20, 2026Updated 4 months ago
pdf-association / arlington-pdf-model
View on GitHub
A vendor- and implementation-independent specification-derived, machine-readable model of PDF.
☆108Jul 13, 2026Updated 2 weeks ago
adobe / pdf-names-list
View on GitHub
PDF Name Registry
☆23Updated this week
tballison / file-observatory
View on GitHub
Single server/laptop grade file-observatory
☆10Mar 30, 2023Updated 3 years ago
pdf-association / pdf20examples
View on GitHub
PDF 2.0 example files
☆115Jan 13, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
digipres / policies
View on GitHub
Digital preservation policies and strategies
☆12Mar 29, 2024Updated 2 years ago
digital-preservation / PRONOM_Research
View on GitHub
☆38Jun 16, 2026Updated last month
steffenfritz / filedriller
View on GitHub
☆14Jan 3, 2024Updated 2 years ago
digital-preservation / pronom-research-week
View on GitHub
A persistent repository for PRONOM Research Week activities
☆12May 26, 2021Updated 5 years ago
apache / opennlp-site
View on GitHub
Website sources for the Apache OpenNLP website
☆10Updated this week
NLNZDigitalPreservation / rosetta_sip_factory
View on GitHub
☆10Apr 28, 2025Updated last year
SLUB-digitalpreservation / checkit_tiff
View on GitHub
"checkit_tiff" is an incredibly fast conformance checker for baseline TIFFs (with various extensions)
☆14Nov 30, 2021Updated 4 years ago
PREMIS-OWL-Revision-Team / premis-owl
View on GitHub
Repository for revision of PREMIS OWL ontology group
☆13May 12, 2022Updated 4 years ago
natliblux / warc-safe
View on GitHub
A tool for detecting viruses and NSFW material in WARC files
☆18Jul 15, 2026Updated 2 weeks ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
RvanVeenendaal / Spreadsheet-Complexity-Analyser
View on GitHub
This software (prototype) extracts values of Excel spreadsheet properties and calculates a tentative spreadsheet complexity assessment ba…
☆13May 15, 2026Updated 2 months ago
AILab-UniFI / cte-dataset
View on GitHub
CTE: Contextualized Table Extraction Dataset
☆17Feb 23, 2023Updated 3 years ago
WikiDP / wikidp-portal
View on GitHub
Prototype wikidata portal project.
☆10May 3, 2024Updated 2 years ago
kamwoods / disktype
View on GitHub
A fork of the disktype disk and disk image format detection tool
☆11Nov 16, 2016Updated 9 years ago
stuartemiddleton / glosat_table_dataset
View on GitHub
GloSAT Historical Measurement Table Dataset
☆11Dec 3, 2025Updated 7 months ago
cul-it / hfs2dfxml
View on GitHub
Wrapper around hfsutils to generate DFXML for HFS-formatted disk images
☆11Apr 20, 2018Updated 8 years ago
pdf-association / techniques-for-accessible-pdf
View on GitHub
Techniques for Accessible PDF
☆17Jun 4, 2026Updated last month
another1024 / afl_analyses
View on GitHub
afl源码分析
☆13Aug 9, 2018Updated 7 years ago
RichSu95 / Document_Binarization_Collection
View on GitHub
This repository is a concise collection of well known deep learning based document binarization models.
☆30Dec 24, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
keeps / commons-ip
View on GitHub
Commons IP is project that provide a command-line tool and Java Library to validate and manipulate E-ARK Information Packages, so to cr…
☆15Jul 22, 2026Updated last week
openpreserve / jpylyzer-test-files
View on GitHub
Test files for conformance testing and benchmarking Jpylyzer.
☆18Apr 2, 2024Updated 2 years ago
gambolputty / newscorpus
View on GitHub
A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.
☆20Jul 5, 2024Updated 2 years ago
digipres / awesome-digital-preservation
View on GitHub
Carefully curated list of awesome digital preservation resources.
☆137Aug 1, 2025Updated 11 months ago
PRImA-Research-Lab / prima-core-libs
View on GitHub
Core libraries by the PRImA Research Lab
☆16Jul 30, 2024Updated last year
openlegaldata / legal-ner
View on GitHub
Named entity recognition for the legal domain
☆43Jun 1, 2021Updated 5 years ago
dfxml-working-group / dfxml_schema
View on GitHub
XML Schema for Digital Forensics XML
☆35Feb 7, 2025Updated last year
marhop / literate-binary
View on GitHub
Integrate handcrafted binary and documentation
☆36Oct 20, 2025Updated 9 months ago
mfontanini / sloth-fuzzer
View on GitHub
A smart file fuzzer.
☆26Aug 19, 2016Updated 9 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
AILab-UniFI / GNN-TableExtraction
View on GitHub
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
☆37Jul 13, 2023Updated 3 years ago
iscc / iscc-sdk
View on GitHub
ISCC - Software Development Kit
☆22Updated this week
furkanbiten / idl_data
View on GitHub
OCR Annotations from Amazon Textract for Industry Documents Library
☆103Aug 20, 2022Updated 3 years ago
YebowenHu / MeetingBank-utils
View on GitHub
☆20Apr 5, 2024Updated 2 years ago
H-Ambrose / NTable
View on GitHub
a dataset for camera-based table detection
☆16Jul 30, 2021Updated 4 years ago
ClimSocAna / tecb-de
View on GitHub
German Text Embedding Clustering Benchmark
☆19Mar 15, 2024Updated 2 years ago
transpose-publishing / policies-database
View on GitHub
Database of journal policies: TRANsparency in Scholarly Publishing for Open Scholarship Evolution
☆20Feb 5, 2021Updated 5 years ago