fmalina / PDFtranscript
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆31Updated 5 months ago
Alternatives and similar repositories for PDFtranscript:
Users that are interested in PDFtranscript are comparing it to the libraries listed below
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Named-Entity Recognition extension for Google Refine / OpenRefine☆72Updated 7 years ago
- Exploring extracting tables from a PDF to CSV using PDF.JS☆103Updated 8 years ago
- Conversion and validation for JATS XML☆53Updated last year
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- Web service to generate citations and bibliographies using citeproc-js☆62Updated last year
- Working with hOCR in Javascript☆127Updated 2 years ago
- An online annotation platform for teaching and learning in the humanities.☆107Updated 2 months ago
- Experiments mining image collections using OpenCV☆64Updated 9 years ago
- Command line OAI-PMH harvester and client with built-in cache.☆124Updated this week
- View HOCR files with Mirador☆29Updated 7 years ago
- The hOCR Embedded OCR Workflow and Output Format☆74Updated 8 months ago
- Process, enhance and evaluate multiple OCR output.☆22Updated 6 months ago
- CaSSius: a CSS-regions-based PDF typesetter for scholarly communications☆41Updated 2 years ago
- JS for overlaying OCR on image using HOCR formatted HTML☆26Updated 8 years ago
- Convert marc to BIBFRAME 1.0 - see lcnetdev/marc2bibframe2 for current release☆68Updated 8 years ago
- ONIX validation library and commandline tool☆24Updated last month
- neonion is a user-centered collaborative semantic annotation webapp developed at the Human-Centered Computing group at Freie Universität …☆68Updated 6 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆188Updated 2 months ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- A vendor- and implementation-independent specification-derived, machine-readable model of PDF.☆84Updated 2 weeks ago
- RAIS: A IIIF-compliant, 100% open source image server for blazing-fast deep zooming☆78Updated 2 weeks ago
- The CIS OCR PostCorrectionTool☆42Updated 2 years ago
- Convert between Tesseract hOCR and ALTO XML using XSL stylesheets☆55Updated 9 months ago
- ALTO XML schema - latest and all former versions☆52Updated 9 months ago
- Open Access PDF harvester☆40Updated 11 months ago
- A library for extracting tables from PDF files☆90Updated 11 years ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 7 years ago
- HOCR Specification Python Parser☆13Updated 9 years ago