fmalina / PDFtranscriptLinks
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆33Updated 9 months ago
Alternatives and similar repositories for PDFtranscript
Users that are interested in PDFtranscript are comparing it to the libraries listed below
Sorting:
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆81Updated 7 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆395Updated last year
- Working with hOCR in Javascript☆131Updated 2 years ago
- Ergonomic line-by-line transcription of scanned text.☆53Updated 4 years ago
- PDF to XML ALTO file converter☆252Updated 3 weeks ago
- Exploring extracting tables from a PDF to CSV using PDF.JS☆105Updated 8 years ago
- Extract tables from PDF pages.☆295Updated 5 years ago
- Conversion and validation for JATS XML☆54Updated last month
- a pure javascript frontend for ElasticSearch search indices.☆80Updated 7 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated last week
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- A full-stack publishing solution involving different technologies to power digital archives☆158Updated 5 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆195Updated 3 months ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.☆106Updated 4 years ago
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆294Updated 3 years ago
- Apache Tika Server as a Docker Image☆172Updated 3 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- Apache Annotator provides annotation enabling code for browsers, servers, and humans.☆240Updated last year
- A library for extracting tables from PDF files☆92Updated 5 years ago
- Textricator is a tool to extract text from documents and generate structured data.☆348Updated 5 months ago
- CaSSius: a CSS-regions-based PDF typesetter for scholarly communications☆41Updated 3 years ago
- Binary Python bindings for poppler utils for content extraction☆42Updated 4 years ago
- A backend store for the Annotator☆179Updated 9 years ago
- Ocular is a state-of-the-art historical OCR system.☆264Updated last year
- Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your com…☆131Updated 5 months ago
- A high performance bibliographic information service: https://biblio-glutton.readthedocs.io☆143Updated 2 months ago
- Command line OAI-PMH harvester and client with built-in cache.☆127Updated last week
- FacetView is a pure javascript frontend for ElasticSearch.☆291Updated 10 years ago