fmalina / PDFtranscriptLinks
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆34Updated 10 months ago
Alternatives and similar repositories for PDFtranscript
Users that are interested in PDFtranscript are comparing it to the libraries listed below
Sorting:
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆396Updated last year
- Ergonomic line-by-line transcription of scanned text.☆53Updated 4 years ago
- Working with hOCR in Javascript☆135Updated 2 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆97Updated 7 years ago
- PDF to XML ALTO file converter☆254Updated 3 weeks ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated last month
- High-level build project for all LAPDF-Text submodules☆103Updated 10 years ago
- Conversion and validation for JATS XML☆54Updated 3 months ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆81Updated 7 years ago
- gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.☆107Updated 4 years ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- Run pdf2htmlEX in a Docker container.☆25Updated last year
- Web service for implementing a large-scale translation memory☆90Updated 4 years ago
- The hOCR Embedded OCR Workflow and Output Format☆74Updated last year
- Fast Word Segmentation with Triangular Matrix☆82Updated 3 years ago
- ALTO XML schema - latest and all former versions☆54Updated last year
- Extract tables from PDF pages.☆296Updated 5 years ago
- A full-stack publishing solution involving different technologies to power digital archives☆158Updated 5 years ago
- Apache Annotator provides annotation enabling code for browsers, servers, and humans.☆240Updated last year
- A backend store for the Annotator☆180Updated 9 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆196Updated 4 months ago
- PDF parser and converter to HTML☆88Updated last year
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- Facilitating the global conversation on academic literature☆267Updated 8 years ago
- Crop And Splice Segments (of scanned pages)☆14Updated 6 years ago
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.☆453Updated 2 years ago
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆293Updated 3 years ago
- Conversions between various OCR formats☆80Updated 2 years ago
- Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your com…☆133Updated 6 months ago
- Extended Date Time Format (ISO 8601-2 / EDTF) Parser for JavaScript☆71Updated last month