Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules…
☆70May 8, 2026Updated last month
Alternatives and similar repositories for ucto
Users that are interested in ucto are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆31Feb 2, 2026Updated 4 months ago
- T-scan: an analysis tool for dutch texts to assess the complexity of the text, based on original work by Rogier Kraf☆19May 28, 2025Updated last year
- An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic a…☆18Nov 18, 2024Updated last year
- Text-Induced Corpus Clean-up☆20Jun 20, 2023Updated 2 years ago
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipg…☆130Feb 5, 2026Updated 4 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Digital humanities things!☆21Mar 17, 2026Updated 2 months ago
- Juxta Web Service☆33Jul 7, 2022Updated 3 years ago
- The GitHub repository containing all the material related to the Computational Thinking and Programming course of the Digital Humanities …☆20May 11, 2018Updated 8 years ago
- Turn CTS TEI corpora into CEX collection files☆12Jun 16, 2021Updated 4 years ago
- Miscellaneous Jupyter notebooks and slides for public talks☆11Jan 7, 2019Updated 7 years ago
- A bunch of modules that use/extend CLTK in order to work with Greek and Latin corpora maintained by the Perseus DL☆12Oct 26, 2019Updated 6 years ago
- Graph-based tool for disambiguation and linking of named entities to Linked Data sets for Digital Humanities and heritage texts☆28Sep 20, 2021Updated 4 years ago
- Guidelines for software quality & sustainability (CLARIAH WP2 task 54.100)☆18May 29, 2022Updated 4 years ago
- Polytonic Greek OCR tool suite based on Ocropus 0.7☆13Jul 5, 2023Updated 2 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- ☆37Jun 10, 2024Updated last year
- Training files for Greek cursive script (in early print)☆15May 26, 2021Updated 5 years ago
- Greek texts (eventually) with linguistic annotation (for Greek Learner Texts Project)☆16Jun 16, 2023Updated 2 years ago
- Data management utilities for Scala☆19Dec 13, 2016Updated 9 years ago
- Humanities Entity Recognition: robust, practical, efficient Named Entity Recognition for today's digital humanist☆37Mar 26, 2019Updated 7 years ago
- Polytonic Greek OCR engine derived from Gamera and based on the work of Dalitz and Brandt☆33Nov 25, 2014Updated 11 years ago
- Implementation of Needleman-Wunsch algorithm in Python Using Nested Functions.☆13Jul 10, 2018Updated 7 years ago
- A set of (string) distance functions written in JavaScript / Python / PHP.☆18Feb 2, 2026Updated 4 months ago
- Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/☆17Apr 11, 2018Updated 8 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Debates in the Digital Humanities☆38Oct 1, 2020Updated 5 years ago
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18May 2, 2025Updated last year
- Tutorial materials to teach Racket/Scribble to people without a math or CS background☆23Apr 2, 2018Updated 8 years ago
- resources for the Homeric Epics☆22Oct 8, 2025Updated 8 months ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Oct 14, 2022Updated 3 years ago
- The CIS OCR PostCorrectionTool☆44Nov 7, 2022Updated 3 years ago
- Expected edit distance implementation using OpenFst tools☆11May 13, 2015Updated 11 years ago
- Original 2016 take at what is now Linked Paths, the demonstrator for GeoJSON-T developed under a Pelagios micro-grant☆90Feb 26, 2017Updated 9 years ago
- eComparatio: text diff and support for digital edition☆22Feb 3, 2021Updated 5 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Simple CORPORA list crawler☆11Dec 2, 2016Updated 9 years ago
- A language-independent post-correction app for POS-tagging and lemmatization☆30May 29, 2026Updated last week
- pronunciation LEXicons for Any Low-resource Language☆21Jul 14, 2020Updated 5 years ago
- Related language translation editor☆12Updated this week
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆113Jan 24, 2025Updated last year
- Complete set of English dialect transformation rules and evaluation code☆17Jun 7, 2024Updated 2 years ago
- A fast, simple, multilingual tokenizer☆29May 24, 2017Updated 9 years ago