pdf-association / pdf-corpora
An index of PDF-centric corpora
☆107Updated last month
Related projects ⓘ
Alternatives and complementary repositories for pdf-corpora
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆77Updated last week
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆173Updated last year
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆19Updated last year
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆64Updated 4 years ago
- PDF to XML ALTO file converter☆215Updated last month
- Logical structure analysis for visually structured documents☆82Updated 2 years ago
- ReadingBank: A Benchmark Dataset for Reading Order Detection☆91Updated 2 months ago
- OCR & Ground Truth Resources☆74Updated 2 years ago
- Conversions between various OCR formats☆71Updated last year
- ☆74Updated 2 years ago
- Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support☆56Updated 3 years ago
- An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.☆185Updated 10 months ago
- A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.☆179Updated 3 months ago
- Document Layout Analysis☆345Updated this week
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆265Updated last year
- Artifacts from the DARPA-funded SafeDocs research program☆22Updated last year
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆52Updated last year
- OCR evaluation brought to you by University of Alicante☆67Updated 2 years ago
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 2 years ago
- An OCR evaluation tool☆64Updated 3 weeks ago
- Collection of OCR-related python tools and wrappers from @OCR-D☆119Updated this week
- Source code for the paper "Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models"☆35Updated 11 months ago
- Documentation and use cases for ALTO XML☆39Updated 6 years ago
- Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as wel…☆23Updated 3 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆181Updated 3 weeks ago
- A suite of batches and tools for OCR tasks.☆71Updated last year
- A deep learning toolkit specialized for handwritten document analysis☆207Updated 2 months ago
- multimodal document analysis☆159Updated 5 months ago
- guides and test data for OCR4all☆30Updated 2 years ago
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆66Updated 2 weeks ago