LanguageMachines / ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆68Updated this week
Alternatives and similar repositories for ucto:
Users that are interested in ucto are comparing it to the libraries listed below
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆111Updated this week
- A tool for automatic spelling normalization☆20Updated 4 years ago
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆126Updated last month
- Multi Tier Annotation Search☆26Updated 3 years ago
- TiMBL implements several memory-based learning algorithms.☆46Updated last month
- A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.☆23Updated last year
- Various utilities for processing the data.☆205Updated this week
- Hierarchical phrase-based machine translation system☆32Updated 10 years ago
- A web-based, token-level annotation tool for non-standard language data☆10Updated 4 years ago
- Thot toolkit for statistical machine translation☆50Updated 2 years ago
- ConllEditor is a tool to edit dependency syntax trees in CoNLL-U format.☆55Updated last month
- German Morphological Analyzer☆47Updated 3 years ago
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆61Updated 8 months ago
- ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with…☆72Updated last week
- SMOR (Stuttgart Morphology) with alternative lemmatization component☆12Updated last year
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Updated 2 years ago
- A Named-Entity Recogniser based on Grobid.☆50Updated 4 months ago
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆29Updated last month
- spaCy-to-naf converter☆21Updated 7 months ago
- Named Entity Recognition data for Europeana Newspapers☆171Updated last year
- Wiktionary parser tool for many language editions.☆54Updated 2 years ago
- Named Entities Recognition Annotator Tool for Europeana Newspapers☆60Updated 7 years ago
- A part-of-speech tagger with support for domain adaptation and external resources.☆22Updated 2 years ago
- Excitement Open Platform for Recognizing Textual Entailments☆86Updated 7 years ago
- A simple configurable tool for manipulating dependency trees.☆13Updated last month
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- A highly extensible plattform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used st…☆24Updated 3 weeks ago
- Extension of the mate-tools NLP pipeline☆67Updated 8 years ago
- eXternally configurable REference and Non Named Entity Recognizer☆17Updated 7 months ago
- Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…☆75Updated last month