zouharvi/tokenization-scorer

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/zouharvi/tokenization-scorer)

zouharvi / tokenization-scorer

Simple-to-use scoring function for arbitrarily tokenized texts.

☆51

Alternatives and similar repositories for tokenization-scorer

Users that are interested in tokenization-scorer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

catherinearnett / morphscore
View on GitHub
This is the repository for MorphScore, a tokenizer evaluation framework for morphological alignment.
☆17Jul 10, 2025Updated last year
valentinhofmann / superbizarre
View on GitHub
Code and data for "Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words"
☆18Aug 17, 2021Updated 4 years ago
orevaahia / magnet-tokenization
View on GitHub
☆11Mar 17, 2026Updated 4 months ago
cimeister / tokenizer-intrinsic-evals
View on GitHub
TokEval: intrinsic quality metrics for tokenizers across natural language, code, and math
☆46Jul 4, 2026Updated 3 weeks ago
lsj2408 / URPE
View on GitHub
[NeurIPS 2022] Your Transformer May Not be as Powerful as You Expect (official implementation)
☆35Aug 6, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Betswish / Cross-Lingual-Consistency
View on GitHub
Easy-to-use framework for evaluating cross-lingual consistency of factual knowledge (Supported LLaMA, BLOOM, mT5, RoBERTa, etc.) Paper he…
☆28Aug 8, 2025Updated 11 months ago
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
concepticon / norare-data
View on GitHub
Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts
☆16Updated this week
rloganiv / kglm-data
View on GitHub
Code used to create the Linked WikiText-2 dataset
☆16May 22, 2023Updated 3 years ago
NathanHerr / LLM-First-Search
View on GitHub
☆17Jun 9, 2025Updated last year
thunlp / SubCharTokenization
View on GitHub
☆47Feb 5, 2023Updated 3 years ago
rycolab / kl-rb
View on GitHub
This repository contains code for the paper "Better Estimation of the KL Divergence Between Language Models"
☆19May 30, 2025Updated last year
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
hadasah / btm
View on GitHub
☆79Apr 29, 2024Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
ertsiger / induction-subgoal-automata-rl
View on GitHub
Code for the papers "Induction of Subgoal Automata for Reinforcement Learning" (AAAI-20) and "Induction and Exploitation of Subgoal Autom…
☆14Aug 15, 2023Updated 2 years ago
trusthlt / eacl24-german-legal-questions
View on GitHub
Data and code: "Answering legal questions from laymen in German civil law system", Büttner & Habernal, EACL'24
☆17Mar 2, 2024Updated 2 years ago
optimatch / optimatch
View on GitHub
☆10May 14, 2024Updated 2 years ago
benbogin / obverter
View on GitHub
Compositional Obverter Communication Learning From Raw Visual Input - Pytorch Implementation
☆19Jul 8, 2021Updated 5 years ago
malteos / awesome-prompt-optimization
View on GitHub
A curated collection of resources for prompt engineering, optimization, and automatic prompt generation across text, image, video, and mu…
☆18Sep 24, 2025Updated 10 months ago
marcotchen / SimpleGPT
View on GitHub
[ICML 2026] Improving GPT via a simple normalization strategy
☆15May 22, 2026Updated 2 months ago
yangyu12 / lagm
View on GitHub
Learning to Annotate Part Segmentation with Gradient Matching (ICLR 2022)
☆12Apr 26, 2022Updated 4 years ago
ltgoslo / simple_elmo_training
View on GitHub
Minimal code to train ELMo models in recent versions of TensorFlow
☆14Jun 16, 2026Updated last month
cosmaadrian / psymo
View on GitHub
Repository for the WACV 2024 paper "PsyMo: A Dataset for Estimating Self-Reported Psychological Traits from Gait"
☆14Feb 22, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
alexa / ramen
View on GitHub
A software for transferring pre-trained English models to foreign languages
☆20Mar 20, 2023Updated 3 years ago
boschresearch / adversarial_meta_embeddings
View on GitHub
Resources related to EMNLP 2021 paper "FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations"
☆13Dec 14, 2021Updated 4 years ago
malteos / awesome-anonymization-for-llms
View on GitHub
A collection of resources for PII detection, anonymization, privacy-preserving techniques, and GDPR compliance in Large Language Model (L…
☆19Sep 24, 2025Updated 10 months ago
dykang / xslue
View on GitHub
ACL 2021 paper "Style is NOT a single variable: Case Studies for Cross-Style Language Understanding " by Dongyeop Kang and Eduard Hovy
☆15Jul 19, 2021Updated 5 years ago
bayartsogt-ya / albert-mongolian
View on GitHub
ALBERT trained on Mongolian text corpus
☆19Jan 10, 2021Updated 5 years ago
RobertCsordas / moe
View on GitHub
Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"
☆39Jun 11, 2025Updated last year
apartresearch / Integer_Addition
View on GitHub
✱ Understanding the underlying learning dynamics of simple tasks in Transformer networks
☆19Aug 16, 2024Updated last year
bozheng-hit / VoCapXLM
View on GitHub
Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"
☆20Nov 12, 2021Updated 4 years ago
malteos / clp-transfer
View on GitHub
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
☆30Jan 25, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
C-J-Cundy / gpt4-tokenizer
View on GitHub
Hosting the JSON for the GPT4 Tokenizer
☆63Apr 6, 2023Updated 3 years ago
cisnlp / ofa
View on GitHub
[NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Nov 26, 2023Updated 2 years ago
sbera7 / Dialogue-act-classification
View on GitHub
Dialogue Act classification
☆18Jan 15, 2024Updated 2 years ago
ZhenlanJi / DL_CC
View on GitHub
CC: Causality-Aware Coverage Criterion for Deep Neural Networks
☆12Feb 15, 2023Updated 3 years ago
balzer82 / PegidaSprache
View on GitHub
Analyse des Pegida facebook Korpus
☆10Jan 31, 2015Updated 11 years ago
t-systems-on-site-services-gmbh / german-elmo-model
View on GitHub
This is a german ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.
☆28Dec 15, 2019Updated 6 years ago
dvruette / barrel-rec-pytorch
View on GitHub
☆53Jan 18, 2024Updated 2 years ago