MeLeLBGU/SaGe

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/MeLeLBGU/SaGe)

MeLeLBGU / SaGe

Code for SaGe subword tokenizer (EACL 2023)

☆28

Alternatives and similar repositories for SaGe

Users that are interested in SaGe are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pchizhov / picky_bpe
View on GitHub
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆27Nov 25, 2024Updated last year
sanderland / script_tok
View on GitHub
Code for the paper "BPE stays on SCRIPT", "Which Pieces Does Unigram Tokenization Really Need?" and MinGram
☆18Jun 26, 2026Updated 3 weeks ago
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 7 months ago
cimeister / tokenizer-intrinsic-evals
View on GitHub
TokEval: intrinsic quality metrics for tokenizers across natural language, code, and math
☆46Jul 4, 2026Updated 2 weeks ago
owos / flexitokens
View on GitHub
FlexiTokens
☆23Dec 27, 2025Updated 6 months ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
nateraw / spaces-docker-templates
View on GitHub
🚀🤗 A collection of templates for Hugging Face Spaces
☆35Oct 9, 2023Updated 2 years ago
huggingface / ember
View on GitHub
ANE accelerated embedding models!
☆20Dec 11, 2024Updated last year
cindyxinyiwang / SDE
View on GitHub
Source code for the paper "Multilingual Neural Machine Translation with Soft Decoupled Encoding"
☆29Jun 2, 2021Updated 5 years ago
PythonNut / superbpe
View on GitHub
Official code release for "SuperBPE: Space Travel for Language Models"
☆97May 28, 2026Updated last month
castorini / NanoKnow
View on GitHub
☆15Jun 26, 2026Updated 3 weeks ago
machine-intelligence-laboratory / OptimalNumberOfTopics
View on GitHub
A set of methods for finding an appropriate number of topics in a text collection
☆15Apr 13, 2026Updated 3 months ago
mliarakos / lagom-scalajs-example
View on GitHub
Example Lagom.js application
☆10Jul 10, 2021Updated 5 years ago
gautierdag / tokenizer-bench
View on GitHub
Code for the paper "Getting the most out of your tokenizer for pre-training and domain adaptation"
☆22Feb 14, 2024Updated 2 years ago
timoschick / form-context-model
View on GitHub
This repository contains the code for the Form-Context Model and its Attentive Mimicking variant.
☆30May 11, 2020Updated 6 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
project-mandolin / mandolin
View on GitHub
Large-scale Machine Learning using Apache Spark
☆14May 6, 2019Updated 7 years ago
wikit-ai / chunknorris
View on GitHub
ChunkNorris is a black belt in document chunking to feed your LLMs and RAG apps 🥋🔪
☆28Jun 26, 2026Updated 3 weeks ago
vered1986 / panic
View on GitHub
PANiC - PAraphrasing Noun-Compounds
☆15Apr 6, 2018Updated 8 years ago
LilithHafner / Jokes
View on GitHub
Jokes
☆13Jul 11, 2024Updated 2 years ago
yuvalpinter / LiveQAServerDemo
View on GitHub
Demo server for TREC LiveQA competition
☆11Dec 7, 2016Updated 9 years ago
Knowledgator / FlashDeBERTa
View on GitHub
Trully flash implementation of DeBERTa disentangled attention mechanism.
☆90Feb 10, 2026Updated 5 months ago
mbollmann / sonnet-finder
View on GitHub
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.
☆13Jan 5, 2023Updated 3 years ago
marian-nmt / sotastream
View on GitHub
A library for data streaming and augmentation
☆22May 5, 2025Updated last year
OpenNLPLab / HGRN2
View on GitHub
HGRN2: Gated Linear RNNs with State Expansion
☆58Aug 20, 2024Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
AI21Labs / pmi-masking
View on GitHub
This repository includes the masking vocabulary used in the ICLR 2021 spotlight PMI-Masking paper
☆14Aug 9, 2021Updated 4 years ago
davanstrien / hub-semantic-search-mcp
View on GitHub
☆20Jun 9, 2025Updated last year
yuweihao / LV-BERT
View on GitHub
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)
☆18May 10, 2023Updated 3 years ago
AnswerDotAI / ModernBERT-Instruct-mini-cookbook
View on GitHub
☆53Feb 10, 2025Updated last year
jouniluoma / bert-ner-cmv
View on GitHub
☆13Dec 17, 2021Updated 4 years ago
orevaahia / magnet-tokenization
View on GitHub
☆11Mar 17, 2026Updated 4 months ago
DunZhang / Jasper-Token-Compression-Training
View on GitHub
The training codes of Jasper-Token-Compression-600M
☆20Nov 19, 2025Updated 8 months ago
ivanleomk / modal-grpo
View on GitHub
☆19Mar 16, 2025Updated last year
wroberts / nltk_tgrep
View on GitHub
tgrep2 Searching for NLTK Trees
☆15Oct 28, 2016Updated 9 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
iamtrask / python-paillier
View on GitHub
A library for Partially Homomorphic Encryption in Python
☆12May 30, 2017Updated 9 years ago
bigscience-workshop / lam
View on GitHub
Libraries, Archives and Museums (LAM)
☆89Oct 4, 2022Updated 3 years ago
huggingface / hub-sync
View on GitHub
A GitHub Action that syncs your GitHub repository to Hugging Face Hub 🤗
☆21Updated this week
insait-institute / ritranslation
View on GitHub
[ACL'26 Findings] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
☆20Jun 27, 2026Updated 3 weeks ago
maharshi95 / submititnow
View on GitHub
A toolkit to create, launch and monitor SLURM jobs over existing python scripts.
☆12May 13, 2024Updated 2 years ago
cisnlp / ofa
View on GitHub
[NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Nov 26, 2023Updated 2 years ago
facebookresearch / CCQA
View on GitHub
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training
☆33Jul 20, 2022Updated 4 years ago