pdf-association / pdf-corporaLinks
An index of PDF-centric corpora
☆128Updated 2 months ago
Alternatives and similar repositories for pdf-corpora
Users that are interested in pdf-corpora are comparing it to the libraries listed below
Sorting:
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆85Updated this week
- PDF to XML ALTO file converter☆238Updated this week
- A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.☆186Updated this week
- Logical structure analysis for visually structured documents☆89Updated 2 years ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆177Updated 2 years ago
- multimodal document analysis☆164Updated 11 months ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆67Updated 4 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆188Updated last week
- Artifacts from the DARPA-funded SafeDocs research program☆24Updated 2 years ago
- An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.☆196Updated this week
- An OCR evaluation tool☆66Updated 2 weeks ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 3 years ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆344Updated 2 years ago
- Document Layout Analysis☆376Updated 2 weeks ago
- veraPDF test corpus for ISO 19005 (PDF/A) and ISO 14289 (PDF/UA)☆78Updated last week
- Master repository which includes most other OCR-D repositories as submodules☆73Updated last week
- Conversions between various OCR formats☆77Updated 2 years ago
- ☆80Updated 3 years ago
- OCR & Ground Truth Resources☆75Updated 3 years ago
- PDF Name Registry☆21Updated last week
- Collection of OCR-related python tools and wrappers from @OCR-D☆128Updated last week
- ReadingBank: A Benchmark Dataset for Reading Order Detection☆105Updated 9 months ago
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- Layout analysis to find layout elements in documents (similar to P2PaLA)☆19Updated last week
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆25Updated 2 years ago
- Targeted PDFs demonstrating commonly seen PDF differentials and interoperability issues☆12Updated 3 weeks ago
- Layout Analysis Dataset with Segmonto (LADaS)☆20Updated this week
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆53Updated 2 years ago
- CERberus -- guardian against character errors☆29Updated last year
- A suite of batches and tools for OCR tasks.☆71Updated 2 years ago