MeLeLBGU/tokenizers_intrinsic_benchmark

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/MeLeLBGU/tokenizers_intrinsic_benchmark)

MeLeLBGU / tokenizers_intrinsic_benchmark

Code for the paper "Greed is All You Need: An Evaluation of Tokenizer Inference Methods"

☆13

Alternatives and similar repositories for tokenizers_intrinsic_benchmark

Users that are interested in tokenizers_intrinsic_benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

catherinearnett / morphscore
View on GitHub
This is the repository for MorphScore, a tokenizer evaluation framework for morphological alignment.
☆17Jul 10, 2025Updated last year
kensho-technologies / pathpiece
View on GitHub
PathPiece tokenizer
☆14Nov 10, 2024Updated last year
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
edaaydinea / Pneumonia-Detection-on-Chest-Xray-Images-with-Deep-Leaning
View on GitHub
This repository includes pneumonia detection on Chest X-ray Images by using Deep Learning(Keras).
☆23Nov 6, 2022Updated 3 years ago
salesforce / bite
View on GitHub
Code for "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).
☆11May 1, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
wbkd / leaflet-mapshot
View on GitHub
🌍 A simple script for taking automated screenshots from a Leaflet map
☆15Mar 29, 2018Updated 8 years ago
SALT-NLP / multi-value
View on GitHub
Complete set of English dialect transformation rules and evaluation code
☆16Jun 7, 2024Updated 2 years ago
baixianghuang / authorship-llm
View on GitHub
Can Large Language Models Identify Authorship? (EMNLP 2024 Findings)
☆13Feb 4, 2025Updated last year
samarth-robo / deepnav_cvpr17
View on GitHub
Code and models for the CVPR 2017 paper "DeepNav: Learning to Navigate Large Cities"
☆13Feb 16, 2020Updated 6 years ago
anilshanbhag / pleasebuyless
View on GitHub
☆10Nov 8, 2023Updated 2 years ago
vered1986 / panic
View on GitHub
PANiC - PAraphrasing Noun-Compounds
☆15Apr 6, 2018Updated 8 years ago
yuvalpinter / LiveQAServerDemo
View on GitHub
Demo server for TREC LiveQA competition
☆11Dec 7, 2016Updated 9 years ago
elnino9ykl / DS-PASS
View on GitHub
Detail-Sensitive Panoramic Annular Semantic Segmentation
☆12May 19, 2022Updated 4 years ago
mbollmann / sonnet-finder
View on GitHub
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.
☆13Jan 5, 2023Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
devrimcavusoglu / acl-bib-overleaf
View on GitHub
Split bib files for anthology bibliography for overleaf
☆11Aug 25, 2024Updated last year
alisawuffles / tokenizer-attack
View on GitHub
Official implementation of "Data Mixture Inference: What do BPE tokenizers reveal about their training data?"
☆23May 15, 2025Updated last year
alonmln / ILNewsDiff
View on GitHub
Code for the ILNewsDiff Twitter account
☆10May 23, 2023Updated 3 years ago
aidos-lab / magnipy
View on GitHub
Metric Space Magnitude Computations
☆15Jun 30, 2026Updated 3 weeks ago
dykang / xslue
View on GitHub
ACL 2021 paper "Style is NOT a single variable: Case Studies for Cross-Style Language Understanding " by Dongyeop Kang and Eduard Hovy
☆15Jul 19, 2021Updated 5 years ago
zouharvi / tokenization-scorer
View on GitHub
Simple-to-use scoring function for arbitrarily tokenized texts.
☆51Feb 19, 2025Updated last year
ryokamoi / original_textvae
View on GitHub
TensorFlow implementation of "Generating Sentences from a Continuous Space"
☆11Sep 16, 2019Updated 6 years ago
sbera7 / Dialogue-act-classification
View on GitHub
Dialogue Act classification
☆18Jan 15, 2024Updated 2 years ago
pentagonalize / Transformer-Cookbook
View on GitHub
☆18Feb 4, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
dolphin-Dang / Deformable-Conformer
View on GitHub
EEG-MI signal classification DL model.
☆14Apr 26, 2024Updated 2 years ago
kanishkamisra / wugs-and-daxes
View on GitHub
Collection of academic works in natural language processing, computational linguistics, and computational cognitive science that study th…
☆22Mar 20, 2024Updated 2 years ago
sanderland / script_tok
View on GitHub
Code for the paper "BPE stays on SCRIPT", "Which Pieces Does Unigram Tokenization Really Need?" and MinGram
☆18Updated this week
mayhewsw / multilingual-data-stats
View on GitHub
Statistics on multilingual datasets
☆17Jul 12, 2022Updated 4 years ago
dayeonki / mt_feedback
View on GitHub
Code for "Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations" [NAACL Findings 2024]
☆14Apr 3, 2026Updated 3 months ago
mcfrank / lot-language-learning-2023
View on GitHub
Materials for LOT School 2023, "Language Learning: A Data-Driven Approach"
☆14Aug 14, 2024Updated last year
pchizhov / picky_bpe
View on GitHub
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆27Nov 25, 2024Updated last year
Amazingren / CrossMLP
View on GitHub
(BMVC2021, Oral) The repository offers the official implementation of our BMVC 2021 paper (oral) in PyTorch.
☆18Apr 22, 2022Updated 4 years ago
rctatman / data-prep-minitoolkit
View on GitHub
Three little Python scripts for data preparation: remove commas, add commas, concatenate files
☆16Jul 26, 2017Updated 9 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
tdozat / Parser-v2
View on GitHub
An updated version of the Parser-v1 repo, used for Stanford's submission in the CoNLL17 shared task.
☆45Aug 15, 2018Updated 7 years ago
MeLeLBGU / SaGe
View on GitHub
Code for SaGe subword tokenizer (EACL 2023)
☆28Nov 30, 2024Updated last year
Event-AHU / OpenESL
View on GitHub
Event Stream based Sign-Language-Translation
☆20May 9, 2026Updated 2 months ago
retroflexivity / typst-eggs
View on GitHub
Typst linguistic examples with minimalist syntax
☆18Updated this week
simonkrauter / Open-EV-Charts
View on GitHub
Tracking battery electric car adoption by sales and market share
☆23Updated this week
pdufter / staticlama
View on GitHub
☆13Apr 16, 2021Updated 5 years ago
agrija9 / Avalinguo-Dataset-Speaker-Fluency-Level-Classification-Paper-
View on GitHub
Code for paper "Speaker Fluency Level Classification using Machine Learning Techniques."
☆19Jun 17, 2020Updated 6 years ago