LanguageMachines / ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆67Updated 3 weeks ago
Alternatives and similar repositories for ucto:
Users that are interested in ucto are comparing it to the libraries listed below
- A tool for automatic spelling normalization☆20Updated 4 years ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆111Updated last month
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆126Updated 2 months ago
- Multi Tier Annotation Search☆26Updated 3 years ago
- SMOR (Stuttgart Morphology) with alternative lemmatization component☆12Updated last year
- eXternally configurable REference and Non Named Entity Recognizer☆17Updated 8 months ago
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆62Updated 9 months ago
- Framework for creating and accessing UBY resources – sense-linked lexical resources in standard UBY-LMF format☆22Updated 6 years ago
- Ukb: graph-based WSD and similarity☆106Updated 9 months ago
- German Morphological Analyzer☆47Updated 3 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- A web-based, token-level annotation tool for non-standard language data☆10Updated 4 years ago
- Python framework for processing Universal Dependencies data☆55Updated 3 weeks ago
- spaCy-to-naf converter☆21Updated 8 months ago
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆29Updated 2 months ago
- The NLG tool for Finnish☆22Updated last year
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆68Updated last year
- Ontolex modules☆33Updated this week
- Learning by Reading pipeline of NLP and Entity Linking tools☆84Updated 2 years ago
- Extension of the mate-tools NLP pipeline☆67Updated 8 years ago
- FoLiA library for C++☆16Updated this week
- A highly extensible plattform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used st…☆24Updated 2 months ago
- This repository contains the Framester resource, the main outcome of the framester project.☆33Updated 4 years ago
- Shalmaneser is a Shallow Semantic Parser.☆11Updated 8 years ago
- Text-Induced Corpus Clean-up☆20Updated last year
- PurePos is an open source hybrid morphological tagger.☆16Updated 4 years ago
- A tool for text normalisation via character-level machine translation☆13Updated 4 years ago
- Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/☆17Updated 6 years ago
- Various utilities for processing the data.☆208Updated this week
- Machine translation for the real world☆23Updated 5 years ago