ukwa / docker-pdf2htmlex
Run pdf2htmlEX in a Docker container.
☆24Updated last year
Alternatives and similar repositories for docker-pdf2htmlex:
Users that are interested in docker-pdf2htmlex are comparing it to the libraries listed below
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated 11 months ago
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆28Updated 2 months ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 6 years ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆130Updated last year
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- liberate all kinds of data from PDF and other unstructural format and make the information machine-readable and visualizeable for popul…☆30Updated 6 years ago
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Docker image for creating a single Apache Nutch server, with mongodb as crawl storage and Elasticsearch for indexing☆17Updated 9 years ago
- RAIS: A IIIF-compliant, 100% open source image server for blazing-fast deep zooming☆78Updated 3 months ago
- Python API for Various DB-Backed Simhash Clusters☆64Updated 7 years ago
- table understanding dataset for comparative evaluation of different table understanding algorithms☆14Updated 6 years ago
- Text classifier for Go, aka document categorization.☆41Updated 9 years ago
- Master repository which includes most other OCR-D repositories as submodules☆72Updated last week
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Updated 12 years ago
- Distributed text analysis suite based on Celery☆95Updated 2 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- Table Extraction Tool☆90Updated 6 years ago
- extract data from html table☆86Updated 4 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- A python implementation of DEPTA☆83Updated 8 years ago
- Command line OAI-PMH harvester and client with built-in cache.☆120Updated 3 weeks ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- Knowledge extraction from web data☆92Updated 6 years ago
- Neural Elastic Inference and Search☆19Updated 5 years ago
- The CIS OCR PostCorrectionTool☆41Updated 2 years ago
- Python search module for fast approximate string matching☆54Updated 2 years ago
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆54Updated last year