fmalina / PDFtranscript
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆28Updated 2 months ago
Alternatives and similar repositories for PDFtranscript:
Users that are interested in PDFtranscript are comparing it to the libraries listed below
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- PDF to XML ALTO file converter☆224Updated this week
- Conversion and validation for JATS XML☆52Updated 11 months ago
- The hOCR Embedded OCR Workflow and Output Format☆74Updated 6 months ago
- Run pdf2htmlEX in a Docker container.☆24Updated last year
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- Binary Python bindings for poppler utils for content extraction☆42Updated 3 years ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated 11 months ago
- Named-Entity Recognition extension for Google Refine / OpenRefine☆72Updated 7 years ago
- A small Docker built for the OCRopus OCR system.☆19Updated 7 years ago
- Crop And Splice Segments (of scanned pages)☆14Updated 5 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆94Updated 2 years ago
- ALTO XML schema - latest and all former versions☆52Updated 7 months ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆186Updated 2 weeks ago
- The CIS OCR PostCorrectionTool☆41Updated 2 years ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …☆28Updated 11 years ago
- Working with hOCR in Javascript☆124Updated last year
- Tool for visualizing hOCR output from Tesseract (or other OCR engines that support hOCR).☆23Updated 10 years ago
- Easy extraction of keywords and engines from search engine results pages (SERPs).☆90Updated 3 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 7 years ago
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- Extract tables from PDF pages.☆283Updated 4 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- View HOCR files with Mirador☆26Updated 7 years ago
- Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.☆24Updated 9 years ago
- Process, enhance and evaluate multiple OCR output.☆22Updated 3 months ago
- yael (Yet Another EPUB Library) is a Python library for reading, manipulating, and writing EPUB 2/3 files☆17Updated 9 years ago
- Conversions between various OCR formats☆74Updated last year