elacin / PDFExtract
my take at a PDF text extraction utility
☆14Updated 9 years ago
Alternatives and similar repositories for PDFExtract:
Users that are interested in PDFExtract are comparing it to the libraries listed below
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆22Updated 4 years ago
- 🦀 A Rust implementation of a RoBERTa classification model for the SNLI dataset☆13Updated 3 years ago
- Rust bindings for CTranslate2☆14Updated last year
- An efficient data structure for fast string similarity searches☆22Updated 3 years ago
- DFKI Layout Detection for OCR-D☆47Updated 2 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆16Updated 2 years ago
- Python-based research framework for developing, organizing, and deploying Deep Learning models powered by Tensorflow.☆12Updated 2 years ago
- an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction☆34Updated 3 months ago
- Finds linguistic patterns effortlessly☆35Updated last year
- Tool for the Automatic Assessment of Lexical Diversity☆11Updated 4 years ago
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated 10 months ago
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- Framework for information extraction from tables☆41Updated 5 years ago
- Given a text, wrap it into phrases and send them to Yandex's search engine. If it yields a "did you mean:", substitute the original phras…☆11Updated 6 years ago
- Rust implementation of Surya☆56Updated 3 weeks ago
- Node starter kit for semantic-search. Uses Mighty Inference Server with Qdrant vector search.☆15Updated last year
- Converter from UD-trees to BART representation☆36Updated 10 months ago
- A step-by-step C# implementation of the Docstrum algorithm☆23Updated 4 years ago
- A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.☆10Updated 7 years ago
- Tools for evaluating OCR performance relative to ground truth.☆10Updated last year
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆61Updated 4 years ago
- Layout Analysis Dataset with Segmonto (LADaS)☆19Updated 2 months ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆22Updated 3 years ago
- Indri search implementation on top of Lucene search engine☆34Updated 10 months ago
- liberate all kinds of data from PDF and other unstructural format and make the information machine-readable and visualizeable for popul…☆28Updated 6 years ago
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆61Updated 8 months ago
- ☆20Updated 5 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago