LanguageMachines/ucto

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/LanguageMachines/ucto)

LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…

☆72

Alternatives and similar repositories for ucto

Users that are interested in ucto are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

proycon / python-ucto
View on GitHub
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…
☆32Feb 2, 2026Updated 5 months ago
cite-architecture / CITE-App
View on GitHub
An end-user environment for working with data in the CITE environment—browsing and analyzing texts, viewing objects and images, visualizi…
☆15May 5, 2020Updated 6 years ago
CentreForDigitalHumanities / tscan
View on GitHub
T-scan: an analysis tool for dutch texts to assess the complexity of the text, based on original work by Rogier Kraf
☆19May 28, 2025Updated last year
ecomp-shONgit / text-normalisation
View on GitHub
JS / Python3 / PHP Lib to work with UTF8 polytonic greek and latin
☆10Sep 11, 2024Updated last year
vita-us / ViTA
View on GitHub
Visual Text Analytics for Digital Humanities
☆17Apr 22, 2015Updated 11 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
martinreynaert / TICCL
View on GitHub
Text-Induced Corpus Clean-up
☆20Jun 20, 2023Updated 3 years ago
proycon / colibri-core
View on GitHub
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…
☆131Feb 5, 2026Updated 5 months ago
essepuntato / comp-think
View on GitHub
The GitHub repository containing all the material related to the Computational Thinking and Programming course of the Digital Humanities …
☆20May 11, 2018Updated 8 years ago
amandavisconti / digitalhumanities
View on GitHub
Digital humanities things!
☆21Mar 17, 2026Updated 4 months ago
brobertson / ciaconna
View on GitHub
Polytonic Greek OCR tool suite based on Ocropus 0.7
☆13Jul 5, 2023Updated 3 years ago
cvbrandoe / REDEN
View on GitHub
Graph-based tool for disambiguation and linking of named entities to Linked Data sets for Digital Humanities and heritage texts
☆28Sep 20, 2021Updated 4 years ago
kylepjohnson / notebooks
View on GitHub
Miscellaneous Jupyter notebooks and slides for public talks
☆11Jan 7, 2019Updated 7 years ago
ayoshiaki / tops
View on GitHub
☆37Jun 10, 2024Updated 2 years ago
CLARIAH / software-quality-guidelines
View on GitHub
Guidelines for software quality & sustainability (CLARIAH WP2 task 54.100)
☆18May 29, 2022Updated 4 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
ecomp-shONgit / string-distance
View on GitHub
A set of (string) distance functions written in JavaScript / Python / PHP.
☆18Feb 2, 2026Updated 5 months ago
LanguageMachines / timbl
View on GitHub
TiMBL implements several memory-based learning algorithms.
☆55Jul 6, 2026Updated 2 weeks ago
proycon / LaMachine
View on GitHub
LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…
☆69Sep 11, 2023Updated 2 years ago
LanguageMachines / frog
View on GitHub
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…
☆82Jun 19, 2026Updated last month
xerial / xerial
View on GitHub
Data management utilities for Scala
☆19Dec 13, 2016Updated 9 years ago
alexerdmann / HER
View on GitHub
Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist
☆37Mar 26, 2019Updated 7 years ago
mmtechslv / nwunch
View on GitHub
Implementation of Needleman-Wunsch algorithm in Python Using Nested Functions.
☆13Jul 10, 2018Updated 8 years ago
fginter / dep_search
View on GitHub
Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/
☆17Apr 11, 2018Updated 8 years ago
opencitations / wcw
View on GitHub
Wikipedia Citations in Wikidata
☆10May 6, 2021Updated 5 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
jtauber / greek-normalisation
View on GitHub
utilities for validating and normalising Ancient Greek text
☆24Jul 8, 2020Updated 6 years ago
gregorycrane / Homerica
View on GitHub
resources for the Homeric Epics
☆22Oct 8, 2025Updated 9 months ago
OpenArabicPE / journal_al-muqtabas
View on GitHub
Digital edition (TEI XML) of the Arabic monthly journal *al-Muqtabas* (مجلة المقتبس), published by Muḥammad Kurd ʿAlī in Cairo and Damasc…
☆18Oct 19, 2025Updated 9 months ago
proycon / python-timbl
View on GitHub
python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…
☆18May 2, 2025Updated last year
perseids-project / lsj-js
View on GitHub
Liddell-Scott-Jones Greek-English Lexicon in JavaScript
☆28Feb 8, 2021Updated 5 years ago
graehl / carmel
View on GitHub
finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests
☆41Oct 14, 2022Updated 3 years ago
cisocrgroup / PoCoTo
View on GitHub
The CIS OCR PostCorrectionTool
☆45Nov 7, 2022Updated 3 years ago
dogancan / expected-edit-distance
View on GitHub
Expected edit distance implementation using OpenFst tools
☆11May 13, 2015Updated 11 years ago
UB-Mannheim / malibu
View on GitHub
Mannheim library utilities
☆27Dec 29, 2025Updated 6 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
brobertson / rigaudon
View on GitHub
Polytonic Greek OCR engine derived from Gamera and based on the work of Dalitz and Brandt
☆33Nov 25, 2014Updated 11 years ago
instituutnederlandsetaal / OpenConvert
View on GitHub
Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
☆23Feb 11, 2022Updated 4 years ago
helmadik / LSJLogeion
View on GitHub
LSJ as edited for Logeion at Chicago; please report corrections
☆29Updated this week
hipster-philology / pyrrha
View on GitHub
A language-independent post-correction app for POS-tagging and lemmatization
☆30Jun 17, 2026Updated last month
tastyminerals / ccrawl
View on GitHub
Simple CORPORA list crawler
☆11Dec 2, 2016Updated 9 years ago
lex4all / lex4all
View on GitHub
pronunciation LEXicons for Any Low-resource Language
☆21Jul 14, 2020Updated 6 years ago
proycon / flat
View on GitHub
FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…
☆113Jan 24, 2025Updated last year