fmalina / PDFtranscript
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆28Updated this week
Related projects ⓘ
Alternatives and complementary repositories for PDFtranscript
- Exploring extracting tables from a PDF to CSV using PDF.JS☆103Updated 8 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆92Updated 6 years ago
- An expandable and scalable OCR pipeline☆86Updated 7 years ago
- NB is an online collaborative PDF,/HTML/Video annotation webapp.☆104Updated last year
- Ergonomic line-by-line transcription of scanned text.☆48Updated 3 years ago
- Web service for implementing a large-scale translation memory☆90Updated 3 years ago
- Experiments mining image collections using OpenCV☆64Updated 9 years ago
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Updated 8 years ago
- Conversion and validation for JATS XML☆51Updated 8 months ago
- Python interface to the Saxon processor for XQuery, XML Schema, and XSLT☆36Updated 4 years ago
- An online annotation platform for teaching and learning in the humanities.☆106Updated 3 weeks ago
- A Python canonicalizer to disambiguate and recognize known names from a poor quality data entry list.☆20Updated 8 years ago
- Create validating ePub publications from journal articles marked up using the NISO JATS xml tagset.☆13Updated 5 years ago
- Working with hOCR in Javascript☆122Updated last year
- CaSSius: a CSS-regions-based PDF typesetter for scholarly communications☆41Updated 2 years ago
- The hOCR Embedded OCR Workflow and Output Format☆74Updated 3 months ago
- Pure-python Z39.50 implementation☆52Updated 3 years ago
- Web application for management formal representations of knowledge, like controlled vocabularies, taxonomies, thesauri and glossaries☆124Updated last month
- AWS Transcribe evaluation pipeline: bulk-process audio files and view the results☆17Updated last year
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string☆36Updated 6 years ago
- Process, enhance and evaluate multiple OCR output.☆20Updated 3 weeks ago
- Solrstrap is a Query-Result interface for Solr written in JavaScript, HTML and CSS☆86Updated 7 years ago
- A python library detect and extract listing data from HTML page.☆109Updated 7 years ago
- JS for overlaying OCR on image using HOCR formatted HTML☆24Updated 8 years ago
- Crop And Splice Segments (of scanned pages)☆14Updated 5 years ago
- gathering point for open source OCR scripts and diffs☆43Updated 10 years ago
- Extract postal addresses from the DOM☆66Updated 12 years ago
- A library for extracting tables from PDF files☆90Updated 11 years ago
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆23Updated 8 years ago