pdf-association / pdf-corpora
An index of PDF-centric corpora
☆127Updated this week
Alternatives and similar repositories for pdf-corpora:
Users that are interested in pdf-corpora are comparing it to the libraries listed below
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆175Updated 2 years ago
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆81Updated last week
- An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.☆190Updated last year
- multimodal document analysis☆164Updated 9 months ago
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆23Updated 2 years ago
- A high performance bibliographic information service: https://biblio-glutton.readthedocs.io☆136Updated 6 months ago
- A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.☆185Updated 3 months ago
- An OCR evaluation tool☆65Updated last month
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆69Updated last month
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆187Updated last month
- OCR & Ground Truth Resources☆74Updated 2 years ago
- PDF to XML ALTO file converter☆233Updated last week
- ReadingBank: A Benchmark Dataset for Reading Order Detection☆104Updated 7 months ago
- Logical structure analysis for visually structured documents☆86Updated 2 years ago
- Conversions between various OCR formats☆74Updated last year
- A basic tool that extracts the structure from the PDF files of scientific articles.☆73Updated 3 years ago
- veraPDF test corpus for ISO 19005 (PDF/A) and ISO 14289 (PDF/UA)☆76Updated last month
- Layout analysis to find layout elements in documents (similar to P2PaLA)☆18Updated this week
- Simplified version of a common crawl fetcher☆13Updated last week
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆327Updated 2 years ago
- Documentation and use cases for ALTO XML☆41Updated 6 years ago
- Legal document classification with EuroVoc descriptors on 22 languages.☆25Updated last year
- Document Layout Analysis☆361Updated this week
- ☆32Updated 2 years ago
- Convert between Tesseract hOCR and ALTO XML using XSL stylesheets☆55Updated 8 months ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- CERberus -- guardian against character errors☆28Updated last year
- Artifacts from the DARPA-funded SafeDocs research program☆23Updated last year