Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆70Mar 13, 2026Updated last week
Alternatives and similar repositories for ucto
Users that are interested in ucto are comparing it to the libraries listed below
Sorting:
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆31Feb 2, 2026Updated last month
- Tools for TICCL☆14Dec 12, 2025Updated 3 months ago
- An end-user environment for working with data in the CITE environment—browsing and analyzing texts, viewing objects and images, visualizi…☆15May 5, 2020Updated 5 years ago
- T-scan: an analysis tool for dutch texts to assess the complexity of the text, based on original work by Rogier Kraf☆19May 28, 2025Updated 9 months ago
- Text-Induced Corpus Clean-up☆20Jun 20, 2023Updated 2 years ago
- Digital Humanities course site☆21Nov 22, 2021Updated 4 years ago
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆129Feb 5, 2026Updated last month
- Digital humanities things!☆21Updated this week
- Juxta Web Service☆33Jul 7, 2022Updated 3 years ago
- A bunch of modules that use/extend CLTK in order to work with Greek and Latin corpora maintained by the Perseus DL☆12Oct 26, 2019Updated 6 years ago
- Graph-based tool for disambiguation and linking of named entities to Linked Data sets for Digital Humanities and heritage texts☆28Sep 20, 2021Updated 4 years ago
- Guidelines for software quality & sustainability (CLARIAH WP2 task 54.100)☆18May 29, 2022Updated 3 years ago
- Polytonic Greek OCR tool suite based on Ocropus 0.7☆13Jul 5, 2023Updated 2 years ago
- ☆37Jun 10, 2024Updated last year
- JS / Python3 / PHP Lib to work with UTF8 polytonic greek and latin☆10Sep 11, 2024Updated last year
- Training files for Greek cursive script (in early print)☆15May 26, 2021Updated 4 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆69Sep 11, 2023Updated 2 years ago
- Greek texts (eventually) with linguistic annotation (for Greek Learner Texts Project)☆15Jun 16, 2023Updated 2 years ago
- Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl,…☆80Mar 2, 2026Updated 2 weeks ago
- Data management utilities for Scala☆19Dec 13, 2016Updated 9 years ago
- Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist☆37Mar 26, 2019Updated 6 years ago
- Polytonic Greek OCR engine derived from Gamera and based on the work of Dalitz and Brandt☆33Nov 25, 2014Updated 11 years ago
- Implementation of Needleman-Wunsch algorithm in Python Using Nested Functions.☆13Jul 10, 2018Updated 7 years ago
- A set of (string) distance functions written in JavaScript / Python / PHP.☆18Feb 2, 2026Updated last month
- Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/☆17Apr 11, 2018Updated 7 years ago
- utilities for validating and normalising Ancient Greek text☆22Jul 8, 2020Updated 5 years ago
- Debates in the Digital Humanities☆38Oct 1, 2020Updated 5 years ago
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18May 2, 2025Updated 10 months ago
- Tutorial materials to teach Racket/Scribble to people without a math or CS background☆23Apr 2, 2018Updated 7 years ago
- resources for the Homeric Epics☆22Oct 8, 2025Updated 5 months ago
- ☆23Sep 17, 2025Updated 6 months ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Oct 14, 2022Updated 3 years ago
- Liddell-Scott-Jones Greek-English Lexicon in JavaScript☆25Feb 8, 2021Updated 5 years ago
- Wikipedia Citations in Wikidata☆10May 6, 2021Updated 4 years ago
- The CIS OCR PostCorrectionTool☆44Nov 7, 2022Updated 3 years ago
- Python library for automatic analysis of Ancient Greek hexameter. The algorithm uses linguistic rules and finite-state technology.☆22Feb 13, 2024Updated 2 years ago
- Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)☆23Feb 11, 2022Updated 4 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆49Sep 7, 2022Updated 3 years ago
- LSJ as edited for Logeion at Chicago; please report corrections☆26Updated this week