pdf-association / pdf-corpora
An index of PDF-centric corpora
☆110Updated last month
Related projects ⓘ
Alternatives and complementary repositories for pdf-corpora
- Logical structure analysis for visually structured documents☆83Updated 2 years ago
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆77Updated this week
- PDF to XML ALTO file converter☆216Updated 2 months ago
- Conversions between various OCR formats☆71Updated last year
- An OCR evaluation tool☆64Updated last month
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆52Updated last year
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆173Updated last year
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆180Updated last month
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆65Updated 4 years ago
- PAGE XML format collection for document image page content and more☆66Updated 3 years ago
- OCR & Ground Truth Resources☆74Updated 2 years ago
- Documentation and use cases for ALTO XML☆39Updated 6 years ago
- Master repository which includes most other OCR-D repositories as submodules☆72Updated last month
- CERberus -- guardian against character errors☆26Updated 9 months ago
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆66Updated this week
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆19Updated last year
- The hOCR Embedded OCR Workflow and Output Format☆74Updated 3 months ago
- Collection of OCR-related python tools and wrappers from @OCR-D☆119Updated this week
- Layout Analysis Dataset with Segmonto (LADaS)☆18Updated 2 weeks ago
- ☆32Updated 2 years ago
- ReadingBank: A Benchmark Dataset for Reading Order Detection☆91Updated 2 months ago
- Layout analysis to find layout elements in documents (similar to P2PaLA)☆17Updated this week
- OCR-D python tools☆33Updated 3 months ago
- Fast PDF generation and compression. Deals with millions of pages daily.☆102Updated 3 months ago
- multimodal document analysis☆160Updated 5 months ago
- A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.☆180Updated last week
- Update of the ISRI Analytic Tools for OCR Evaluation with UTF-8 support☆57Updated 3 years ago
- Recognize text using Calamari OCR and the OCR-D framework☆13Updated 3 weeks ago
- ☆74Updated 2 years ago
- A deep learning toolkit specialized for handwritten document analysis☆207Updated 2 months ago