kensho-technologies / pathpieceLinks

PathPiece tokenizer

☆12

Alternatives and similar repositories for pathpiece

Users that are interested in pathpiece are comparing it to the libraries listed below

Sorting:

ahmetustun / hyperx
☆20Updated 2 years ago
cisnlp / ofa
A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Updated last year
JHU-CLSP / ettin-encoder-vs-decoder
State-of-the-art paired encoder and decoder models (17M-1B params)
☆25Updated this week
gautierdag / tokenizer-bench
Code for the paper "Getting the most out of your tokenizer for pre-training and domain adaptation"
☆19Updated last year
CPJKU / wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
☆82Updated 10 months ago
PythonNut / superbpe
Official code release for "SuperBPE: Space Travel for Language Models"
☆61Updated last week
google-research-datasets / swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…
☆48Updated last year
ielab / Starbucks
Starbucks: Improved Training for 2D Matryoshka Embeddings
☆21Updated 3 weeks ago
allenai / bff
☆38Updated last year
orevaahia / magnet-tokenization
☆12Updated 7 months ago
Leukas / CUTE
☆14Updated 2 months ago
juletx / self-translate
Do Multilingual Language Models Think Better in English?
☆42Updated last year
frankxu2004 / knnlm-why
Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"
☆58Updated 2 years ago
ltgoslo / gpt-bert
Official implementation of "GPT or BERT: why not both?"
☆55Updated this week
monologg / EncT5
Pytorch Implementation of EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks
☆63Updated 3 years ago
microsoft / encoder-decoder-slm
Efficient encoder-decoder architecture for small language models (≤1B parameters) with cross-architecture knowledge distillation and visi…
☆29Updated 5 months ago
tatHi / maxmatch_dropout
☆10Updated 2 years ago
worldbank / GISTEmbed
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings
☆43Updated last year
yuzhaouoe / pretraining-data-packing
[ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training
☆22Updated 11 months ago
uiuctml / MergeBench
MergeBench: A Benchmark for Merging Domain-Specialized LLMs
☆17Updated 2 months ago
tigerchen52 / LOVE
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost
☆41Updated last year
bigscience-workshop / multilingual-modeling
BLOOM+1: Adapting BLOOM model to support a new unseen language
☆73Updated last year
kaistAI / InstructIR
IntructIR, a novel benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our foc…
☆32Updated last year
NathanGodey / headless-lm
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆27Updated last year
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆133Updated 6 months ago
martiansideofthemoon / longeval-summarization
Official repository for our EACL 2023 paper "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (https…
☆44Updated 11 months ago
kaistAI / LangBridge
[ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision
☆92Updated 8 months ago
ZurichNLP / mbr
Minimum Bayes Risk Decoding for Hugging Face Transformers
☆58Updated last year
salesforce / simplification
☆22Updated 5 months ago
MeLeLBGU / SaGe
Code for SaGe subword tokenizer (EACL 2023)
☆25Updated 7 months ago