fmalina / PDFtranscriptLinks
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆35Updated last year
Alternatives and similar repositories for PDFtranscript
Users that are interested in PDFtranscript are comparing it to the libraries listed below
Sorting:
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆407Updated last year
- Ergonomic line-by-line transcription of scanned text.☆54Updated this week
- Working with hOCR in Javascript☆136Updated 2 years ago
- Exploring extracting tables from a PDF to CSV using PDF.JS☆105Updated 9 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆96Updated 3 years ago
- Extract tables from PDF pages.☆298Updated 5 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆101Updated 7 years ago
- Conversion and validation for JATS XML☆55Updated 7 months ago
- PDF to XML ALTO file converter☆261Updated 2 weeks ago
- A library for extracting tables from PDF files☆92Updated 5 years ago
- An expandable and scalable OCR pipeline☆89Updated 8 years ago
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆296Updated last month
- Textricator is a tool to extract text from documents and generate structured data.☆351Updated 10 months ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated 2 weeks ago
- Hyperaudio Lite - a Super-lightweight Interactive Transcript Player☆160Updated last year
- gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.☆106Updated 5 years ago
- NB is an online collaborative PDF,/HTML/Video annotation webapp.☆106Updated 6 months ago
- A full-stack publishing solution involving different technologies to power digital archives☆158Updated 5 years ago
- Detect and visualize text reuse☆119Updated last year
- This a module to extract RDF from an HTML5 page annotated with microdata. The module implements the algorithm defined and published by th…☆45Updated 3 years ago
- Binary Python bindings for poppler utils for content extraction☆42Updated 4 years ago
- Apache Tika Server as a Docker Image☆174Updated 3 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- Web service to generate citations and bibliographies using citeproc-js☆64Updated 6 months ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆199Updated 8 months ago
- Web service for implementing a large-scale translation memory☆92Updated 4 years ago
- Convert a PDF via OCR to a TXT file in UTF-8 encoding☆156Updated 2 years ago
- Ocular is a state-of-the-art historical OCR system.☆266Updated last year
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆277Updated 3 years ago
- Use visual programming to build data tables based on text data within the Orange data mining software environment☆30Updated 3 months ago