ukwa / docker-pdf2htmlex
Run pdf2htmlEX in a Docker container.
☆24Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for docker-pdf2htmlex
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 5 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- An expandable and scalable OCR pipeline☆86Updated 7 years ago
- The CIS OCR PostCorrectionTool☆40Updated 2 years ago
- ☆32Updated 2 years ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- An efficient data structure for fast string similarity searches☆23Updated 3 years ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- An efficient simhash implementation for python☆125Updated 5 years ago
- A Named-Entity Recogniser based on Grobid.☆49Updated 2 months ago
- Python API for Various DB-Backed Simhash Clusters☆64Updated 7 years ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated 8 months ago
- Python package for harvesting records from OAI-PMH provider(s).☆62Updated 2 years ago
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆52Updated last year
- GROBID extension for identifying and normalizing physical quantities.☆75Updated 2 months ago
- NLP pipeline software using common workflow language☆34Updated 5 years ago
- RAIS: A IIIF-compliant, 100% open source image server for blazing-fast deep zooming☆78Updated last month
- Service for converting and enhancing heterogeneous publisher XML formats into TEI☆44Updated 2 months ago
- PAGE XML format collection for document image page content and more☆66Updated 3 years ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆110Updated 4 months ago
- Python search module for fast approximate string matching☆53Updated last year
- Layout Analysis Dataset with Segmonto (LADaS)☆18Updated 2 weeks ago
- clone of https://code.google.com/p/splitta/ so it can be a git submodule☆34Updated 11 years ago
- A curated list of software, tools, resources and projects by and for libraries.☆16Updated 4 years ago
- Python tools for performing various operations on ALTO XML files☆39Updated last year
- Python module (C extension and plain python) implementing DAWG☆20Updated 2 years ago
- Ergonomic line-by-line transcription of scanned text.☆48Updated 3 years ago
- A context-based spellchecker for correcting OCR output.☆18Updated last year
- Create a Geonames gazetteer index in Elasticsearch☆74Updated last year
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆28Updated this week