pchizhov / picky_bpeLinks

BPE modification that implements removing of the intermediate tokens during tokenizer training.

☆25

Alternatives and similar repositories for picky_bpe

Users that are interested in picky_bpe are comparing it to the libraries listed below

Sorting:

Knowledgator / FlashDeBERTa
Trully flash implementation of DeBERTa disentangled attention mechanism.
☆66Updated last month
MeLeLBGU / SaGe
Code for SaGe subword tokenizer (EACL 2023)
☆27Updated 11 months ago
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆140Updated 9 months ago
bminixhofer / tokenkit
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
☆47Updated 3 months ago
davanstrien / haiku-dpo
Using open source LLMs to build synthetic datasets for direct preference optimization
☆68Updated last year
LAGoM-NLP / transtokenizer
☆52Updated 9 months ago
Pleias / Pleias-RAG-Library
Python library to use Pleias-RAG models
☆64Updated 6 months ago
Aleph-Alpha-Research / trigrams
☆57Updated last month
Hannibal046 / nanoColBERT
Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).
☆79Updated last year
EleutherAI / improved-t5
Experiments for efforts to train a new and improved t5
☆75Updated last year
AnswerDotAI / ModernBERT-Instruct-mini-cookbook
☆50Updated 8 months ago
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆60Updated last year
castorini / hf-spacerini
Plug-and-play Search Interfaces with Pyserini and Hugging Face
☆32Updated 2 years ago
jxmorris12 / bm25_pt
minimal pytorch implementation of bm25 (with sparse tensors)
☆104Updated this week
allenai / infinigram-api
☆80Updated 2 weeks ago
mungg / FABLES
☆58Updated last year
AnswerDotAI / fastkmeans
☆83Updated 3 months ago
sileod / tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
☆188Updated 3 months ago
JHU-CLSP / ettin-encoder-vs-decoder
State-of-the-art paired encoder and decoder models (17M-1B params)
☆53Updated 2 months ago
luyug / magix
Supercharge huggingface transformers with model parallelism.
☆77Updated 3 months ago
cisnlp / GlotCC
🕸 GlotCC Dataset and Pipline -- NeurIPS 2024
☆20Updated 6 months ago
argilla-io / distilabel-spin-dibt
Repository containing the SPIN experiments on the DIBT 10k ranked prompts
☆24Updated last year
hamishivi / EasyLM
Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…
☆75Updated last year
Pleias / Quest-Best-Tokens
An introduction to LLM Sampling
☆79Updated 10 months ago
allenai / bff
☆39Updated last year
Weyaxi / scrape-open-llm-leaderboard
Scrape and export data from the Open LLM Leaderboard.
☆47Updated 10 months ago
malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆62Updated last year
jxmorris12 / cde
code for training & evaluating Contextual Document Embedding models
☆199Updated 5 months ago
liujch1998 / infini-gram
☆73Updated 2 months ago
jxmorris12 / embzip
lossily compress representation vectors using product quantization
☆59Updated 6 months ago