usnistgov / ocr-pipeline
Convert a corpus of PDF to clean text files on a distributed architecture
☆38Updated last year
Alternatives and similar repositories for ocr-pipeline:
Users that are interested in ocr-pipeline are comparing it to the libraries listed below
- spaCy-to-naf converter☆21Updated 9 months ago
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- A DeepWalk implementation for ontologies using NetworkX and Gensim☆18Updated 7 years ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆47Updated 3 years ago
- Next generation OCR engine based on LSTMs.☆52Updated 6 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- Specification of NAF, the NLP annotation format☆21Updated 4 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆68Updated last year
- Pikes is a Knowledge Extraction Suite☆23Updated last year
- Semanticizest: dump parser and client☆20Updated 8 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- PDF Extraction Toolkit☆41Updated 4 years ago
- Named-Entity Recognition extension for Google Refine / OpenRefine☆72Updated 7 years ago
- Wrapper to pocketsphinx phoneme labeling tools☆18Updated 8 years ago
- Parser for KAF NAF files written in Python☆16Updated 3 years ago
- IXA pipes Named Entity Tagger (http://ixa2.si.ehu.es/ixa-pipes).☆32Updated 5 years ago
- A repository for the "Combining DBpedia and Topic Modeling" GSoC 2016 idea☆13Updated 8 years ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 3 years ago
- Tools for TICCL☆14Updated 2 months ago
- Pipeline for distributed Natural Language Processing, made in Python☆65Updated 8 years ago
- Meta-repository for the open-source version of the SUMMA Platform☆16Updated 11 months ago
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.☆105Updated 2 years ago
- Quill's library of open source NLP algorithms and data sets.☆52Updated 11 months ago
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18Updated last month
- Build a deep learning model for predicting the named entities from text.☆56Updated 6 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- Entity Linking for the masses☆56Updated 9 years ago
- An efficient data structure for fast string similarity searches☆22Updated 4 years ago