LanguageMachines / ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆68Updated last month
Alternatives and similar repositories for ucto:
Users that are interested in ucto are comparing it to the libraries listed below
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆112Updated 2 months ago
- A tool for automatic spelling normalization☆20Updated 4 years ago
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆126Updated 3 months ago
- Multi Tier Annotation Search☆26Updated 3 years ago
- Various utilities for processing the data.☆208Updated this week
- Framework for creating and accessing UBY resources – sense-linked lexical resources in standard UBY-LMF format☆22Updated 6 years ago
- The Global WordNet Association Collaborative Inter-Lingual Index☆41Updated 4 months ago
- Python framework for processing Universal Dependencies data☆55Updated this week
- Ukb: graph-based WSD and similarity☆106Updated 10 months ago
- ConllEditor is a tool to edit dependency syntax trees in CoNLL-U format.☆56Updated last week
- Named Entity Recognition data for Europeana Newspapers☆171Updated last year
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with…☆75Updated last month
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆63Updated 10 months ago
- Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…☆75Updated 3 months ago
- Hierarchical phrase-based machine translation system☆32Updated 10 years ago
- General-Purpose Neural Networks for Sentence Boundary Detection☆72Updated 2 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆68Updated last year
- Machine translation for the real world☆23Updated 5 years ago
- A part-of-speech tagger with support for domain adaptation and external resources.☆22Updated 2 years ago
- Program used to split text into segments☆25Updated 5 months ago
- Open-source tools for morphological tagging, segmentation and stemming.☆41Updated 5 years ago
- German Morphological Analyzer☆47Updated 3 years ago
- eXternally configurable REference and Non Named Entity Recognizer☆17Updated 9 months ago
- A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.☆23Updated last year
- Excitement Open Platform for Recognizing Textual Entailments☆89Updated 7 years ago
- SMOR (Stuttgart Morphology) with alternative lemmatization component☆12Updated last year
- A Named-Entity Recogniser based on Grobid.☆51Updated 6 months ago
- Text-Induced Corpus Clean-up☆20Updated last year
- A simple configurable tool for manipulating dependency trees.☆13Updated 3 months ago