fmalina / PDFtranscriptLinks
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆36Updated last year
Alternatives and similar repositories for PDFtranscript
Users that are interested in PDFtranscript are comparing it to the libraries listed below
Sorting:
- High-level build project for all LAPDF-Text submodules☆103Updated 10 years ago
- Working with hOCR in Javascript☆136Updated 2 years ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆81Updated 7 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆100Updated 7 years ago
- An expandable and scalable OCR pipeline☆89Updated 8 years ago
- Ergonomic line-by-line transcription of scanned text.☆54Updated 5 years ago
- Conversion and validation for JATS XML☆55Updated 5 months ago
- Exploring extracting tables from a PDF to CSV using PDF.JS☆104Updated 9 years ago
- PDF to XML ALTO file converter☆257Updated last month
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆296Updated last week
- A simple viewer and inspection tool for text boxes in PDF documents☆96Updated 3 years ago
- Extract tables from PDF pages.☆298Updated 5 years ago
- Textricator is a tool to extract text from documents and generate structured data.☆350Updated 9 months ago
- A full-stack publishing solution involving different technologies to power digital archives☆159Updated 5 years ago
- Apache Annotator provides annotation enabling code for browsers, servers, and humans.☆240Updated last year
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.☆456Updated 2 years ago
- Named-Entity Recognition extension for Google Refine / OpenRefine☆73Updated 8 years ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated 3 weeks ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆196Updated 6 months ago
- The hOCR Embedded OCR Workflow and Output Format☆75Updated last year
- View HOCR files with Mirador☆29Updated 8 years ago
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆276Updated 3 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆49Updated 3 years ago
- Command line OAI-PMH harvester and client with built-in cache.☆128Updated last week
- Extended Date Time Format (ISO 8601-2 / EDTF) Parser for JavaScript☆75Updated 2 months ago
- A backend store for the Annotator☆180Updated 9 years ago
- Ocular is a state-of-the-art historical OCR system.☆265Updated last year
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆216Updated 6 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆99Updated 3 years ago
- Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your com…☆134Updated last month