bitextor / pdf-extractLinks
PDF parser and converter to HTML
☆85Updated 8 months ago
Alternatives and similar repositories for pdf-extract
Users that are interested in pdf-extract are comparing it to the libraries listed below
Sorting:
- Extract dates from text☆64Updated 4 years ago
- A Named-Entity Recogniser based on Grobid.☆53Updated last month
- PDF to XML ALTO file converter☆244Updated 2 weeks ago
- gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.☆106Updated 4 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- Ergonomic line-by-line transcription of scanned text.☆52Updated 4 years ago
- Multi Tier Annotation Search☆12Updated last year
- Convert between Tesseract hOCR and ALTO XML using XSL stylesheets☆55Updated last month
- Master repository which includes most other OCR-D repositories as submodules☆73Updated this week
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆189Updated last month
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆68Updated 4 years ago
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆53Updated 2 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- Program used to split text into segments☆27Updated 7 months ago
- Conversions between various OCR formats☆78Updated 2 years ago
- OCR-D python tools☆33Updated 10 months ago
- Multi Tier Annotation Search☆26Updated 4 years ago
- Language Tool style grammar handling with spaCy 2.0☆42Updated 6 years ago
- Run tesseract with the tesserocr bindings with @OCR-D's interfaces☆39Updated last month
- Framework for information extraction from tables☆41Updated 6 years ago
- A simple text reuse detection CLI tool.☆134Updated last year
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 7 years ago
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆49Updated 2 years ago
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.☆450Updated last year
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- 🚀GUI for training spaCy models☆55Updated 4 years ago
- OCR evaluation brought to you by University of Alicante☆67Updated 2 years ago
- A context-based spellchecker for correcting OCR output.☆19Updated 2 years ago