elacin / PDFExtract
my take at a PDF text extraction utility
☆13Updated 9 years ago
Related projects ⓘ
Alternatives and complementary repositories for PDFExtract
- an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction☆31Updated last month
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆21Updated 2 years ago
- Rust python bindings for symspell☆18Updated 10 months ago
- Rust bindings for CTranslate2☆13Updated last year
- universal tokenizer☆15Updated 2 years ago
- Given a text, wrap it into phrases and send them to Yandex's search engine. If it yields a "did you mean:", substitute the original phras…☆11Updated 5 years ago
- Corpus Build OCR platform☆8Updated last year
- ☆40Updated 6 years ago
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆22Updated 4 years ago
- A C++ library implementing fast language models estimation using the 1-Sort algorithm.☆17Updated last year
- GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning☆27Updated 3 years ago
- ☆17Updated last year
- NERD and wiKIData (NERD KID) is a machine learning application for classifying Wikidata items into 27 classes (as defined by the Grobid-…☆8Updated last year
- A utility for labeling clusters of text data.☆28Updated 3 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- Some CSS experiments for arXiv HTML documents converted via latexml☆16Updated last month
- Seed Machine Translation Data☆30Updated last week
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- A file utility for accessing both local and remote files through a unified interface.☆36Updated this week
- 🦀 A Rust implementation of a RoBERTa classification model for the SNLI dataset☆13Updated 3 years ago
- Indri search implementation on top of Lucene search engine☆33Updated 8 months ago
- 🧬 A VS Code extension for annotating data with Prodigy☆30Updated 2 years ago
- Targetted language identifier, based on FastText and Hunspell.☆29Updated last month
- Translation demonstrator☆27Updated 4 years ago
- DFKI Layout Detection for OCR-D☆47Updated 2 weeks ago
- Locality Sensitive Hashing☆70Updated last year
- Reasoning by Communicating with Agents☆21Updated last month
- Efficiently computing & storing token n-grams from large corpora☆15Updated last month
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆38Updated 4 years ago
- spaCy entry points for Curated Transformers☆25Updated last month