fmalina / PDFtranscriptLinks
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆31Updated 6 months ago
Alternatives and similar repositories for PDFtranscript
Users that are interested in PDFtranscript are comparing it to the libraries listed below
Sorting:
- A library for extracting tables from PDF files☆90Updated 4 years ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 6 years ago
- Binary Python bindings for poppler utils for content extraction☆42Updated 4 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
- NB is an online collaborative PDF,/HTML/Video annotation webapp.☆105Updated last week
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- Conversion and validation for JATS XML☆54Updated last year
- Named-Entity Recognition extension for Google Refine / OpenRefine☆72Updated 8 years ago
- Ergonomic line-by-line transcription of scanned text.☆52Updated 4 years ago
- A python implementation of DEPTA☆83Updated 8 years ago
- LIME (Language Independent Markup Editor)☆31Updated 6 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆49Updated 2 years ago
- [DEPRECATED - please use rups instead] RUPS is an abbreviation for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText®…☆110Updated 6 years ago
- Web service for implementing a large-scale translation memory☆90Updated 4 years ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated this week
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- Specification of NAF, the NLP annotation format☆21Updated 4 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- Run pdf2htmlEX in a Docker container.☆25Updated last year
- This a module to extract RDF from an HTML5 page annotated with microdata. The module implements the algorithm defined and published by th…☆44Updated 3 years ago
- Wandora is a general purpose information extraction, management and publishing application based on Topic Maps and Java.☆132Updated last year
- A natural language date parser. (Python version of chrono.js)☆25Updated 3 weeks ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆99Updated 2 years ago
- Python search module for fast approximate string matching☆54Updated 2 years ago
- Converts Microsoft docx to flat hub XML☆27Updated 2 months ago
- View HOCR files with Mirador☆29Updated 7 years ago
- Extract tables from PDF pages.☆292Updated 5 years ago
- Convert text from PDF to XML.☆45Updated 6 years ago
- PDF to XML ALTO file converter☆244Updated 2 weeks ago
- EPUB samples that conform to the EPUB for Education profile.☆28Updated 4 years ago