ukwa / docker-pdf2htmlexLinks
Run pdf2htmlEX in a Docker container.
☆25Updated last year
Alternatives and similar repositories for docker-pdf2htmlex
Users that are interested in docker-pdf2htmlex are comparing it to the libraries listed below
Sorting:
- Wrapper for pdftohtml that tries to extract paragraph structure☆51Updated 6 years ago
- Apache Tika Server as a Docker Image☆172Updated 3 years ago
- PDF to XML ALTO file converter☆254Updated 3 weeks ago
- ☆32Updated 2 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 8 years ago
- Command-line tool to extract a ranked list of relevant keywords from a corpus with the option of using either topic modeling or tf-idf sc…☆40Updated 8 years ago
- Python API for Various DB-Backed Simhash Clusters☆64Updated 8 years ago
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆34Updated 10 months ago
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆47Updated 3 years ago
- Command line OAI-PMH harvester and client with built-in cache.☆127Updated 2 weeks ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆113Updated 8 months ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- PDF parser and converter to HTML☆88Updated last year
- High-level build project for all LAPDF-Text submodules☆103Updated 10 years ago
- Export SOLR documents efficiently with cursors.☆38Updated 6 months ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆99Updated 2 years ago
- RAIS: A IIIF-compliant, 100% open source image server for blazing-fast deep zooming☆80Updated 5 months ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆39Updated last year
- Ergonomic line-by-line transcription of scanned text.☆53Updated 4 years ago
- Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist☆37Updated 6 years ago
- A Named-Entity Recogniser based on Grobid.☆54Updated 4 months ago
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆273Updated 2 years ago
- Highlighting various OCR formats directly in Solr☆86Updated last week
- Machine translation for the real world☆23Updated 5 years ago
- Semanticizest: dump parser and client☆20Updated 9 years ago
- A python implementation of DEPTA☆83Updated 8 years ago
- WordNet Domains, WordNet Affect and SentiWords☆47Updated 9 years ago
- Python package for harvesting records from OAI-PMH provider(s).☆64Updated 3 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
- Process, enhance and evaluate multiple OCR output.☆24Updated 11 months ago