OpenNMT/Tokenizer

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/OpenNMT/Tokenizer)

OpenNMT / Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

☆334

Alternatives and similar repositories for Tokenizer

Users that are interested in Tokenizer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

OpenNMT / CTranslate
View on GitHub
Lightweight C++ translator for OpenNMT Torch models (deprecated)
☆80Apr 7, 2020Updated 6 years ago
OpenNMT / OpenNMT-tf
View on GitHub
Neural machine translation and sequence learning using TensorFlow
☆1,484Oct 14, 2023Updated 2 years ago
SYSTRAN / fuzzy-match
View on GitHub
Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
☆54Apr 22, 2025Updated last year
OpenNMT / CTranslate2
View on GitHub
Fast inference engine for Transformer models
☆4,579Jul 3, 2026Updated 2 weeks ago
OpenNMT / OpenNMT-py
View on GitHub
Open Source Neural Machine Translation and (Large) Language Models in PyTorch
☆7,007Oct 14, 2025Updated 9 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
marian-nmt / sotastream
View on GitHub
A library for data streaming and augmentation
☆22May 5, 2025Updated last year
Unbabel / word-level-qe-corpus-builder
View on GitHub
Builds a WMT18-like corpus for word-level QE with annotations in the source and target words.
☆10Sep 19, 2022Updated 3 years ago
mlc-ai / tokenizers-cpp
View on GitHub
Universal cross-platform tokenizers binding to HF and sentencepiece
☆497May 20, 2026Updated 2 months ago
marian-nmt / marian
View on GitHub
Fast Neural Machine Translation in C++
☆1,460Aug 25, 2023Updated 2 years ago
OpenNMT / nmt-wizard-docker
View on GitHub
Dockerized NMT frameworks for nmt-wizard
☆39Apr 18, 2023Updated 3 years ago
facebookresearch / mlqe
View on GitHub
We release a dataset based on Wikipedia sentences and the corresponding translations in 6 different languages along with the scores (scal…
☆81Aug 31, 2021Updated 4 years ago
google / sentencepiece
View on GitHub
Unsupervised text tokenizer for Neural Network-based text generation.
☆11,972Updated this week
fyvo / WMT-Biomed-Test
View on GitHub
☆13Aug 23, 2024Updated last year
ictnlp / DiverseNMT
View on GitHub
Source code for the AAAI 2020 long paper <Modeling Fluency and Faithfulness for Diverse Neural Machine Translation>.
☆19Mar 10, 2020Updated 6 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
lilt / alignment-scripts
View on GitHub
Scripts to preprocess training and test data and to run fast_align and giza
☆107Nov 2, 2021Updated 4 years ago
magesh-technovator / awesome-ai-applications
View on GitHub
A Comprehensive survey on business use cases of AI that help them thrive in the digital economy
☆13Oct 7, 2020Updated 5 years ago
clab / fast_align
View on GitHub
Simple, fast unsupervised word aligner
☆769Jul 19, 2022Updated 4 years ago
glample / fastBPE
View on GitHub
Fast BPE
☆677Jun 18, 2024Updated 2 years ago
hplt-project / sacremoses
View on GitHub
Python port of Moses tokenizer, truecaser and normalizer
☆497Feb 6, 2026Updated 5 months ago
Helsinki-NLP / OpusFilter
View on GitHub
OpusFilter - Parallel corpus processing toolkit
☆115Jul 1, 2026Updated 2 weeks ago
thammegowda / mtdata
View on GitHub
A tool that locates, downloads, and extracts machine translation corpora
☆165Apr 13, 2026Updated 3 months ago
Unbabel / OpenKiwi
View on GitHub
Open-Source Machine Translation Quality Estimation in PyTorch
☆233Jun 23, 2022Updated 4 years ago
deadshot465 / novelcrafter-mcp
View on GitHub
An experimental desktop client for using Claude Desktop's MCP with Novelcrafter codices.
☆11Dec 3, 2024Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
VKCOM / YouTokenToMe
View on GitHub
Unsupervised text tokenizer focused on computational efficiency
☆979Mar 29, 2024Updated 2 years ago
rsennrich / subword-nmt
View on GitHub
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
☆2,271Aug 7, 2024Updated last year
modernmt / DataCollection
View on GitHub
Data collection, alignment and TAUS repository
☆24Nov 30, 2017Updated 8 years ago
wangkuiyi / huggingface-tokenizer-in-cxx
View on GitHub
☆72Feb 27, 2023Updated 3 years ago
facebookresearch / LASER
View on GitHub
Language-Agnostic SEntence Representations
☆3,661May 2, 2024Updated 2 years ago
Sorrow321 / huggingface_tokenizer_cpp
View on GitHub
HuggingFace Transformers WordPiece Tokenizer in C++
☆22Mar 14, 2025Updated last year
kpu / fasterText
View on GitHub
Library for fast text representation and classification.
☆31Jan 9, 2024Updated 2 years ago
scosman / voicebox
View on GitHub
Exploration: using technology to aid people who lack both the ability to speak and fine motor control.
☆21Oct 24, 2024Updated last year
yyxxrr739 / autosar-rag
View on GitHub
This is a AUTOSAR documents specific retriever based on LLM and RAG.
☆16Nov 12, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
WeblateOrg / hello
View on GitHub
Hello world demonstration for Weblate
☆15Jan 20, 2026Updated 6 months ago
vale-cli / SubVale
View on GitHub
A Sublime Text 3 client for Vale Server.
☆13Dec 7, 2020Updated 5 years ago
bitextor / bicleaner
View on GitHub
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
☆160Jun 18, 2024Updated 2 years ago
jhclark / tercom
View on GitHub
Translation Error Rate (TER)
☆44May 25, 2018Updated 8 years ago
marian-nmt / marian-dev
View on GitHub
Fast Neural Machine Translation in C++ - development repository
☆288Jul 9, 2025Updated last year
esnme / landscape
View on GitHub
A Stylus-powered frontend CSS toolkit for building rich and beautiful web apps.
☆16Apr 2, 2012Updated 14 years ago
yannvgn / laserembeddings
View on GitHub
LASER multilingual sentence embeddings as a pip package
☆225Aug 11, 2023Updated 2 years ago