Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆70Feb 9, 2026Updated 2 weeks ago
Alternatives and similar repositories for ucto
Users that are interested in ucto are comparing it to the libraries listed below
Sorting:
- FoLiA library for C++☆17Dec 11, 2025Updated 2 months ago
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆31Feb 2, 2026Updated 3 weeks ago
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18May 2, 2025Updated 9 months ago
- Text-Induced Corpus Clean-up☆20Jun 20, 2023Updated 2 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆69Sep 11, 2023Updated 2 years ago
- An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic a…☆18Nov 18, 2024Updated last year
- Guidelines for software quality & sustainability (CLARIAH WP2 task 54.100)☆18May 29, 2022Updated 3 years ago
- Digital Humanities course site☆21Nov 22, 2021Updated 4 years ago
- Visual Text Analytics for Digital Humanities☆17Apr 22, 2015Updated 10 years ago
- pronunciation LEXicons for Any Low-resource Language☆21Jul 14, 2020Updated 5 years ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Oct 14, 2022Updated 3 years ago
- Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…☆80Dec 11, 2025Updated 2 months ago
- scalding powered machine learning☆109Nov 18, 2014Updated 11 years ago
- Juxta Web Service☆33Jul 7, 2022Updated 3 years ago
- Simple CORPORA list crawler☆10Dec 2, 2016Updated 9 years ago
- The GitHub repository containing all the material related to the Computational Thinking and Programming course of the Digital Humanities …☆20May 11, 2018Updated 7 years ago
- Wikipedia Citations in Wikidata☆10May 6, 2021Updated 4 years ago
- Expected edit distance implementation using OpenFst tools☆11May 13, 2015Updated 10 years ago
- TiMBL implements several memory-based learning algorithms.☆54Dec 11, 2025Updated 2 months ago
- A bunch of modules that use/extend CLTK in order to work with Greek and Latin corpora maintained by the Perseus DL☆12Oct 26, 2019Updated 6 years ago
- Turn CTS TEI corpora into CEX collection files☆12Jun 16, 2021Updated 4 years ago
- A data management tool for humans☆119Oct 31, 2016Updated 9 years ago
- Use spaCy for NLP and output to the FoLiA XML format.☆12Feb 27, 2024Updated 2 years ago
- Miscellaneous Jupyter notebooks and slides for public talks☆11Jan 7, 2019Updated 7 years ago
- Models and training scripts for the English, German and Russian MAGEC systems described in R. Grundkiewicz, M. Junczys-Dowmunt: Minimally…☆12Jul 7, 2021Updated 4 years ago
- Graph-based tool for disambiguation and linking of named entities to Linked Data sets for Digital Humanities and heritage texts☆28Sep 20, 2021Updated 4 years ago
- Polytonic Greek OCR tool suite based on Ocropus 0.7☆13Jul 5, 2023Updated 2 years ago
- a latex cheat sheet with ipython commands and shortcuts☆10Mar 10, 2014Updated 11 years ago
- A KALDI/C++ implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition☆14Sep 4, 2019Updated 6 years ago
- Greek texts (eventually) with linguistic annotation (for Greek Learner Texts Project)☆15Jun 16, 2023Updated 2 years ago
- Sense Disambiguation of Connectives for PDTB-Style Discourse Parsing☆14Jan 13, 2017Updated 9 years ago
- DBpedia Neural Question Answering Dataset☆18Jun 28, 2020Updated 5 years ago
- Deep Learning for Speech Recogntion based on Theano☆15Jul 28, 2017Updated 8 years ago
- Digital edition (TEI XML) of the Arabic monthly journal *al-Muqtabas* (مجلة المقتبس), published by Muḥammad Kurd ʿAlī in Cairo and Damasc…☆18Oct 19, 2025Updated 4 months ago
- Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist☆37Mar 26, 2019Updated 6 years ago
- A set of (string) distance functions written in JavaScript / Python / PHP.☆18Feb 2, 2026Updated 3 weeks ago
- Python bot framework for Lexemes on Wikidata☆19Feb 6, 2021Updated 5 years ago
- A tool for automatic spelling normalization☆21Jan 18, 2021Updated 5 years ago
- Implementation of Needleman-Wunsch algorithm in Python Using Nested Functions.☆13Jul 10, 2018Updated 7 years ago