cisnlp / GlotCCLinks

🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

☆20

Alternatives and similar repositories for GlotCC

Users that are interested in GlotCC are comparing it to the libraries listed below

Sorting:

bminixhofer / tokenkit
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
☆49Updated 4 months ago
cisnlp / MEXA
🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Updated 7 months ago
salesforce / simplification
☆22Updated 9 months ago
Leukas / CUTE
☆15Updated last month
LAGoM-NLP / transtokenizer
☆53Updated 9 months ago
bigscience-workshop / multilingual-modeling
BLOOM+1: Adapting BLOOM model to support a new unseen language
☆74Updated last year
davanstrien / haiku-dpo
Using open source LLMs to build synthetic datasets for direct preference optimization
☆69Updated last year
google-research-datasets / QAmeleon
QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning P…
☆35Updated 2 years ago
dadelani / sib-200
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
☆22Updated 9 months ago
nbroad1881 / strideformer
Using short models to classify long texts
☆21Updated 2 years ago
orevaahia / magnet-tokenization
☆13Updated 11 months ago
malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆62Updated last year
facebookresearch / mexma
MEXMA: Token-level objectives improve sentence representations
☆42Updated 10 months ago
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆60Updated last year
argilla-io / distilabel-spin-dibt
Repository containing the SPIN experiments on the DIBT 10k ranked prompts
☆24Updated last year
NathanGodey / headless-lm
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆27Updated last year
pchizhov / picky_bpe
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆25Updated 11 months ago
castorini / hf-spacerini
Plug-and-play Search Interfaces with Pyserini and Hugging Face
☆32Updated 2 years ago
huggingface / olm-training
Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.
☆96Updated 2 years ago
ielab / Starbucks
Starbucks: Improved Training for 2D Matryoshka Embeddings
☆22Updated 4 months ago
MeLeLBGU / SaGe
Code for SaGe subword tokenizer (EACL 2023)
☆27Updated 11 months ago
lucy3 / whos_filtered
☆14Updated last year
r-three / fib
☆26Updated 2 years ago
Knowledgator / FlashDeBERTa
Trully flash implementation of DeBERTa disentangled attention mechanism.
☆66Updated last month
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆140Updated 10 months ago
McGill-NLP / CHASE
Synthetic Data Generation for Evaluation
☆13Updated 8 months ago
ottowg / gsap-ner
☆10Updated last year
ZurichNLP / mbr
Minimum Bayes Risk Decoding for Hugging Face Transformers
☆60Updated last year
google-research-datasets / swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…
☆49Updated 2 years ago
JHU-CLSP / ettin-encoder-vs-decoder
State-of-the-art paired encoder and decoder models (17M-1B params)
☆53Updated 3 months ago