LanguageMachines / ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆68Updated 2 months ago
Alternatives and similar repositories for ucto:
Users that are interested in ucto are comparing it to the libraries listed below
- A tool for automatic spelling normalization☆20Updated 4 years ago
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆126Updated 4 months ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆112Updated 2 months ago
- Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…☆76Updated 4 months ago
- Multi Tier Annotation Search☆26Updated 3 years ago
- Framework for creating and accessing UBY resources – sense-linked lexical resources in standard UBY-LMF format☆22Updated 6 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆68Updated last year
- Thot toolkit for statistical machine translation☆53Updated 2 years ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Updated 2 years ago
- A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.☆23Updated last year
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆63Updated 11 months ago
- TiMBL implements several memory-based learning algorithms.☆51Updated 4 months ago
- Various utilities for processing the data.☆208Updated this week
- Open-source tools for morphological tagging, segmentation and stemming.☆40Updated 5 years ago
- Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist☆36Updated 6 years ago
- Ukb: graph-based WSD and similarity☆106Updated 11 months ago
- Hierarchical phrase-based machine translation system☆32Updated 10 years ago
- FoLiA library for C++☆16Updated last month
- General-Purpose Neural Networks for Sentence Boundary Detection☆73Updated 2 years ago
- Fast Word Clustering Software☆78Updated 2 months ago
- eXternally configurable REference and Non Named Entity Recognizer☆17Updated 10 months ago
- Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser…☆49Updated last month
- Excitement Open Platform for Recognizing Textual Entailments☆89Updated 7 years ago
- Extension of the mate-tools NLP pipeline☆67Updated 9 years ago
- Named Entity Recognition data for Europeana Newspapers☆171Updated 2 years ago
- A Corpus Data Retrieval Index using Lucene for Look-Ups☆17Updated this week
- Wiktionary parser tool for many language editions.☆54Updated 2 years ago
- A powerful, tagset-independent and theory-neutral meta model and API for storing, manipulating, and representing nearly all types of ling…☆15Updated 2 years ago
- A tool for text normalisation via character-level machine translation☆13Updated 4 years ago