Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆70May 8, 2026Updated last week
Alternatives and similar repositories for ucto
Users that are interested in ucto are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆31Feb 2, 2026Updated 3 months ago
- Tools for TICCL☆14Dec 12, 2025Updated 5 months ago
- An end-user environment for working with data in the CITE environment—browsing and analyzing texts, viewing objects and images, visualizi…☆15May 5, 2020Updated 6 years ago
- An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic a…☆18Nov 18, 2024Updated last year
- Visual Text Analytics for Digital Humanities☆17Apr 22, 2015Updated 11 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆130Feb 5, 2026Updated 3 months ago
- Juxta Web Service☆33Jul 7, 2022Updated 3 years ago
- Turn CTS TEI corpora into CEX collection files☆12Jun 16, 2021Updated 4 years ago
- Miscellaneous Jupyter notebooks and slides for public talks☆11Jan 7, 2019Updated 7 years ago
- A bunch of modules that use/extend CLTK in order to work with Greek and Latin corpora maintained by the Perseus DL☆12Oct 26, 2019Updated 6 years ago
- Graph-based tool for disambiguation and linking of named entities to Linked Data sets for Digital Humanities and heritage texts☆28Sep 20, 2021Updated 4 years ago
- Polytonic Greek OCR tool suite based on Ocropus 0.7☆13Jul 5, 2023Updated 2 years ago
- TiMBL implements several memory-based learning algorithms.☆55Mar 12, 2026Updated 2 months ago
- Training files for Greek cursive script (in early print)☆15May 26, 2021Updated 4 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆69Sep 11, 2023Updated 2 years ago
- Greek texts (eventually) with linguistic annotation (for Greek Learner Texts Project)☆16Jun 16, 2023Updated 2 years ago
- Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…☆81May 8, 2026Updated last week
- Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist☆37Mar 26, 2019Updated 7 years ago
- Polytonic Greek OCR engine derived from Gamera and based on the work of Dalitz and Brandt☆33Nov 25, 2014Updated 11 years ago
- Implementation of Needleman-Wunsch algorithm in Python Using Nested Functions.☆13Jul 10, 2018Updated 7 years ago
- A set of (string) distance functions written in JavaScript / Python / PHP.☆18Feb 2, 2026Updated 3 months ago
- Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/☆17Apr 11, 2018Updated 8 years ago
- utilities for validating and normalising Ancient Greek text☆23Jul 8, 2020Updated 5 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- resources for the Homeric Epics☆22Oct 8, 2025Updated 7 months ago
- ☆25Sep 17, 2025Updated 8 months ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Oct 14, 2022Updated 3 years ago
- Liddell-Scott-Jones Greek-English Lexicon in JavaScript☆26Feb 8, 2021Updated 5 years ago
- Wikipedia Citations in Wikidata☆10May 6, 2021Updated 5 years ago
- Expected edit distance implementation using OpenFst tools☆11May 13, 2015Updated 11 years ago
- Original 2016 take at what is now Linked Paths, the demonstrator for GeoJSON-T developed under a Pelagios micro-grant☆90Feb 26, 2017Updated 9 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆49Sep 7, 2022Updated 3 years ago
- LSJ as edited for Logeion at Chicago; please report corrections☆28May 5, 2026Updated 2 weeks ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- The what (and how) digital humanities and news nerds want to explore together☆64Nov 6, 2015Updated 10 years ago
- A language-independent post-correction app for POS-tagging and lemmatization☆30Updated this week
- pronunciation LEXicons for Any Low-resource Language☆21Jul 14, 2020Updated 5 years ago
- Code for "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).☆11May 1, 2025Updated last year
- A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of…☆22Nov 28, 2017Updated 8 years ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆113Jan 24, 2025Updated last year
- snf-image is a Ganeti OS definition. It allows Ganeti to launch instances from predefined or untrusted custom Images. The whole process o…☆12Feb 27, 2018Updated 8 years ago