fmalina / unilex-transcript
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆28Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for unilex-transcript
- Ergonomic line-by-line transcription of scanned text.☆47Updated 3 years ago
- PDF to XML ALTO file converter☆215Updated last month
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆369Updated 3 months ago
- An expandable and scalable OCR pipeline☆86Updated 6 years ago
- Web service for implementing a large-scale translation memory☆90Updated 3 years ago
- Exploring extracting tables from a PDF to CSV using PDF.JS☆103Updated 8 years ago
- MOVED TO https://gitlab.com/crossref/pdfmark☆33Updated 5 years ago
- Conversions between various OCR formats☆71Updated last year
- A simple viewer and inspection tool for text boxes in PDF documents☆92Updated 2 years ago
- A step-by-step C# implementation of the Docstrum algorithm☆23Updated 3 years ago
- Extract tables from PDF pages.☆276Updated 4 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆92Updated 6 years ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆181Updated 3 weeks ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …☆28Updated 11 years ago
- The hOCR Embedded OCR Workflow and Output Format☆73Updated 2 months ago
- CaSSius: a CSS-regions-based PDF typesetter for scholarly communications☆41Updated 2 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆215Updated 4 years ago
- The CIS OCR PostCorrectionTool☆40Updated 2 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆94Updated 2 years ago
- Conversion and validation for JATS XML☆51Updated 8 months ago
- A library for extracting tables from PDF files☆87Updated 4 years ago
- PDF parser and converter to HTML☆83Updated last month
- Wrapper for pdftohtml that tries to extract paragraph structure☆49Updated 5 years ago
- Working with hOCR in Javascript☆120Updated last year
- Convert text from PDF to XML.☆45Updated 6 years ago
- A python library detect and extract listing data from HTML page.☆109Updated 7 years ago
- View HOCR files with Mirador☆26Updated 7 years ago
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆294Updated 2 years ago