cimeister/tokenizer-analysis-suite

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/cimeister/tokenizer-analysis-suite)

cimeister / tokenizer-analysis-suite

☆46

Alternatives and similar repositories for tokenizer-analysis-suite

Users that are interested in tokenizer-analysis-suite are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pchizhov / picky_bpe
View on GitHub
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆27Nov 25, 2024Updated last year
bgub / tokka-bench
View on GitHub
benchmarks for LLM tokenizers
☆18Mar 25, 2026Updated 3 months ago
MeLeLBGU / SaGe
View on GitHub
Code for SaGe subword tokenizer (EACL 2023)
☆28Nov 30, 2024Updated last year
timoschick / form-context-model
View on GitHub
This repository contains the code for the Form-Context Model and its Attentive Mimicking variant.
☆31May 11, 2020Updated 6 years ago
neuml / staticvectors
View on GitHub
🔢 Work with static vector models
☆39Apr 21, 2025Updated last year
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
owos / flexitokens
View on GitHub
FlexiTokens
☆23Dec 27, 2025Updated 6 months ago
PythonNut / superbpe
View on GitHub
Official code release for "SuperBPE: Space Travel for Language Models"
☆96May 28, 2026Updated last month
LuisaMaerz / KnowMAN
View on GitHub
KnowMAN: Weakly Supervised Multinomial Adversarial Networks
☆12Nov 9, 2021Updated 4 years ago
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 2 months ago
zouharvi / tokenization-scorer
View on GitHub
Simple-to-use scoring function for arbitrarily tokenized texts.
☆48Feb 19, 2025Updated last year
kensho-technologies / pathpiece
View on GitHub
PathPiece tokenizer
☆14Nov 10, 2024Updated last year
MaLA-LM / GlotEval
View on GitHub
GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way
☆18Nov 4, 2025Updated 8 months ago
ltgoslo / ltg-bert
View on GitHub
LTG-Bert
☆34Jan 8, 2024Updated 2 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
cindyxinyiwang / SDE
View on GitHub
Source code for the paper "Multilingual Neural Machine Translation with Soft Decoupled Encoding"
☆29Jun 2, 2021Updated 5 years ago
DunZhang / Jasper-Token-Compression-Training
View on GitHub
The training codes of Jasper-Token-Compression-600M
☆20Nov 19, 2025Updated 7 months ago
vered1986 / panic
View on GitHub
PANiC - PAraphrasing Noun-Compounds
☆15Apr 6, 2018Updated 8 years ago
yuvalpinter / LiveQAServerDemo
View on GitHub
Demo server for TREC LiveQA competition
☆11Dec 7, 2016Updated 9 years ago
jmichaelov / PsychFormers
View on GitHub
Tools for calculating psycholinguistically-relevant metrics of language statistics using transformer language models
☆13Nov 11, 2022Updated 3 years ago
qdrant / miniCOIL
View on GitHub
Contextualized per-token embeddings
☆37Jun 23, 2026Updated last week
terrierteam / pyterrier_t5
View on GitHub
☆17Apr 30, 2026Updated 2 months ago
webis-de / mturk-manager
View on GitHub
An alternative front end for Amazon Mechanical Turk
☆12May 13, 2024Updated 2 years ago
MeLeLBGU / tokenizers_intrinsic_benchmark
View on GitHub
Code for the paper "Greed is All You Need: An Evaluation of Tokenizer Inference Methods"
☆13Nov 26, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
alonmln / ILNewsDiff
View on GitHub
Code for the ILNewsDiff Twitter account
☆10May 23, 2023Updated 3 years ago
epfl-dlab / llm-latent-language
View on GitHub
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
☆85Mar 11, 2024Updated 2 years ago
nateraw / spaces-docker-templates
View on GitHub
🚀🤗 A collection of templates for Hugging Face Spaces
☆34Oct 9, 2023Updated 2 years ago
technion-cs-nlp / parametric-faithfulness
View on GitHub
☆24Aug 30, 2025Updated 10 months ago
markusdr / transducersaurus
View on GitHub
Automatically exported from code.google.com/p/transducersaurus
☆11Apr 1, 2015Updated 11 years ago
mayhewsw / multilingual-data-stats
View on GitHub
Statistics on multilingual datasets
☆17Jul 12, 2022Updated 3 years ago
adapter-hub / hgiyt
View on GitHub
Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"
☆28Oct 3, 2021Updated 4 years ago
hipe-eval / HIPE-2022-data
View on GitHub
Data for the HIPE 2022 shared task.
☆23May 15, 2026Updated last month
MurrayTom / SG-Bench
View on GitHub
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
☆26Nov 29, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
pentagonalize / Transformer-Cookbook
View on GitHub
☆18Feb 4, 2025Updated last year
alexub / jax-meta-learning
View on GitHub
Simple, extensible implementations of some meta-learning algorithms in Jax
☆11Oct 6, 2020Updated 5 years ago
Yinghao-Li / CHMM-ALT
View on GitHub
Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"
☆32Jun 20, 2023Updated 3 years ago
cindyxinyiwang / expand-via-lexicon-based-adaptation
View on GitHub
Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"
☆29Apr 2, 2022Updated 4 years ago
rctatman / data-prep-minitoolkit
View on GitHub
Three little Python scripts for data preparation: remove commas, add commas, concatenate files
☆16Jul 26, 2017Updated 8 years ago
PiotrNawrot / dynamic-pooling
View on GitHub
Efficient Transformers with Dynamic Token Pooling
☆68May 20, 2023Updated 3 years ago
imirzadeh / MC-SGD
View on GitHub
Linear Mode Connectivity in Multitask and Continual Learning: https://arxiv.org/abs/2010.04495
☆13Oct 12, 2020Updated 5 years ago