usnistgov / ocr-pipelineLinks
Convert a corpus of PDF to clean text files on a distributed architecture
☆38Updated last year
Alternatives and similar repositories for ocr-pipeline
Users that are interested in ocr-pipeline are comparing it to the libraries listed below
Sorting:
- An expandable and scalable OCR pipeline☆89Updated 8 years ago
- Next generation OCR engine based on LSTMs.☆52Updated 7 years ago
- 🚀GUI for training spaCy models☆55Updated 4 years ago
- Named Entities Recognition Annotator Tool for Europeana Newspapers☆61Updated 8 years ago
- Specification of NAF, the NLP annotation format☆21Updated 5 years ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆113Updated last year
- Ergonomic line-by-line transcription of scanned text.☆54Updated this week
- Parser for KAF NAF files written in Python☆16Updated 4 years ago
- An intelligent reading agent that understands text and translates it into Wikidata statements.☆116Updated 9 years ago
- Ocular is a state-of-the-art historical OCR system.☆266Updated last year
- Build tables of information by extracting facts from indexed text corpora via a simple and effective query language.☆56Updated 6 years ago
- Tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format☆46Updated 10 months ago
- Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.☆27Updated 11 years ago
- Homebase of the IPTC EXTRA project about rule-based text categorization☆13Updated 8 years ago
- Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic pr…☆70Updated last month
- Fast and robust NLP components implemented in Java.☆53Updated 5 years ago
- Semantic Web related concepts converted to Natural language☆44Updated 8 years ago
- PDF Extraction Toolkit☆42Updated 5 years ago
- Soundex Phonetic Code Algorithm Demo for Indian Languages. Supports all indian languages and English. Provides intra-indic string compari…☆59Updated 6 years ago
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆66Updated last month
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆47Updated 4 years ago
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18Updated 9 months ago
- Wikidata embedding☆51Updated last year
- A visualisation tool for Spacy using Hierplane.☆65Updated 3 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆69Updated 2 years ago
- spaCy-to-naf converter☆21Updated 7 months ago
- HOCR Specification Python Parser☆12Updated 10 years ago
- 🤹♀️ Query spaCy's linguistic annotations using GraphQL☆86Updated 7 years ago
- Presentations, tutorials and data for the OCR workshop at LMU☆16Updated 8 years ago
- A toolkit for clustering web pages based on various similarity measures.☆34Updated 4 years ago