asahi417/lm-vocab-trimmer

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/asahi417/lm-vocab-trimmer)

asahi417 / lm-vocab-trimmer

Vocabulary Trimming (VT) is a model compression technique, which reduces a multilingual LM vocabulary to a target language by deleting irrelevant tokens from its vocabulary. This repository contains a python-library vocabtrimmer, that remove irrelevant tokens from a multilingual LM vocabulary for the target language.

☆67

Alternatives and similar repositories for lm-vocab-trimmer

Users that are interested in lm-vocab-trimmer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

CreaLabs / Enhanced-BGE-M3-with-CLP-and-MoE
View on GitHub
This repository provides the code for applying Contrastive Learning Penalty Loss (CLPL) and Mixture of Experts (MoE) to the BGE-M3 text e…
☆11Dec 27, 2024Updated last year
ielab / Starbucks
View on GitHub
Starbucks: Improved Training for 2D Matryoshka Embeddings
☆25Jun 30, 2025Updated last year
dadelani / sib-200
View on GitHub
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
☆26May 20, 2026Updated 2 months ago
kensho-technologies / pathpiece
View on GitHub
PathPiece tokenizer
☆14Nov 10, 2024Updated last year
nixiesearch / negminer
View on GitHub
A hard negative mining tool for embedding model training
☆15Sep 27, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
mrpeerat / SCT
View on GitHub
SCT: An Efficient Self-Supervised Cross-View Training For Sentence Embedding (TACL)
☆16Jul 27, 2024Updated last year
worldbank / GISTEmbed
View on GitHub
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings
☆45Mar 6, 2024Updated 2 years ago
sanderland / script_tok
View on GitHub
Code for the paper "BPE stays on SCRIPT", "Which Pieces Does Unigram Tokenization Really Need?" and MinGram
☆18Jun 26, 2026Updated 3 weeks ago
jfkback / hypencoder-paper
View on GitHub
Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"
☆40Sep 20, 2025Updated 10 months ago
masakhane-io / afriqa
View on GitHub
Crosslingual Question Answering for African Languages
☆31Sep 27, 2024Updated last year
ottowg / gsap-ner
View on GitHub
☆10Oct 2, 2024Updated last year
owos / flexitokens
View on GitHub
FlexiTokens
☆23Dec 27, 2025Updated 6 months ago
feyninc / tokie
View on GitHub
🍡 30x faster tokenization for every HuggingFace model
☆47May 28, 2026Updated last month
knowledgeable-embedding / knowledgeable-embedding
View on GitHub
Knowledgeable Embedding: Injecting dynamically updatable entity knowledge into embeddings to enhance RAG
☆15Aug 31, 2025Updated 10 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
hotchpotch / yasem
View on GitHub
YASEM - Yet Another Splade|Sparse Embedder - A simple and efficient library for SPLADE embeddings
☆13May 22, 2025Updated last year
huggingface / peft-pytorch-conference
View on GitHub
Code for the examples presented in the talk "Training a Llama in your backyard: fine-tuning very large models on consumer hardware" given…
☆15Oct 16, 2023Updated 2 years ago
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 7 months ago
konstantinjdobler / focus
View on GitHub
[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
☆37Jun 7, 2025Updated last year
datologyai / luxical
View on GitHub
☆81Dec 12, 2025Updated 7 months ago
JHU-CLSP / mmBERT
View on GitHub
A massively multilingual modern encoder language model
☆145Jan 20, 2026Updated 6 months ago
dadelani / africanlp-resources
View on GitHub
List of all the resources I developed in collaboration with LSV and Masakhane during my doctoral studies and beyond
☆13Aug 15, 2022Updated 3 years ago
JHU-CLSP / ettin-encoder-vs-decoder
View on GitHub
State-of-the-art paired encoder and decoder models (17M-1B params)
☆74Aug 6, 2025Updated 11 months ago
StephAO / olfmlm
View on GitHub
☆18Nov 25, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
alvarobartt / opentrain
View on GitHub
🚂 Fine-tune OpenAI models for text classification, question answering, and more
☆17May 1, 2023Updated 3 years ago
cisnlp / ofa
View on GitHub
[NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Nov 26, 2023Updated 2 years ago
latynt / ans
View on GitHub
Arabic News Stance Corpus
☆11Feb 5, 2021Updated 5 years ago
bminixhofer / tokenkit
View on GitHub
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
☆69Jul 6, 2025Updated last year
MeLeLBGU / SaGe
View on GitHub
Code for SaGe subword tokenizer (EACL 2023)
☆28Nov 30, 2024Updated last year
nateraw / spaces-docker-templates
View on GitHub
🚀🤗 A collection of templates for Hugging Face Spaces
☆35Oct 9, 2023Updated 2 years ago
L-Zhe / BTmPG
View on GitHub
Code for paper Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach by Zhe Lin, Xiaojun Wan. This…
☆14Aug 10, 2021Updated 4 years ago
ehsanasgari / 1000Langs
View on GitHub
Creating super-parallel corpora of more than 1500+ unique languages for NLP research
☆33Dec 8, 2022Updated 3 years ago
cisnlp / GlotCC
View on GitHub
[NeurIPS 2024] 🕸 GlotCC Dataset and Pipline
☆21Apr 6, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
Geotrend-research / smaller-transformers
View on GitHub
Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.
☆108May 20, 2022Updated 4 years ago
rasyosef / splade-index
View on GitHub
Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba
☆38Oct 16, 2025Updated 9 months ago
ariG23498 / smart-commit
View on GitHub
Smart commit messages
☆18Oct 25, 2024Updated last year
raphaelsty / LeNLP
View on GitHub
NLP with Rust for Python 🦀🐍
☆72Jun 9, 2026Updated last month
castorini / hf-spacerini
View on GitHub
Plug-and-play Search Interfaces with Pyserini and Hugging Face
☆31Aug 5, 2023Updated 2 years ago
Liyan06 / AggreFact
View on GitHub
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors (ACL 2023)
☆28Mar 26, 2024Updated 2 years ago
malteos / clp-transfer
View on GitHub
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
☆30Jan 25, 2023Updated 3 years ago