r-three / common-pile
Repo to hold code and track issues for the collection of permissively licensed data
☆22Updated 3 weeks ago
Related projects: ⓘ
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆27Updated last week
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆91Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆21Updated this week
- Tools for managing datasets for governance and training.☆77Updated last month
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆29Updated last year
- ☆38Updated 5 months ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆73Updated last week
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆51Updated 3 months ago
- Simple-to-use scoring function for arbitrarily tokenized texts.☆27Updated last week
- LTG-Bert☆25Updated 8 months ago
- ☆94Updated last year
- ☆65Updated last year
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- Library for fast text representation and classification.☆28Updated 8 months ago
- ☆44Updated 2 months ago
- ☆73Updated last year
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆42Updated 10 months ago
- ☆13Updated this week
- ☆45Updated 2 years ago
- Repository for "Attribute First, then Generate: Locally-attributable Grounded Text Generation", ACL 2024☆25Updated 5 months ago
- minimal pytorch implementation of bm25 (with sparse tensors)☆82Updated 6 months ago
- Efficiently find the best-suited language model (LM) for your NLP task☆12Updated this week
- Code repo for "Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers" (ACL 2023)☆22Updated 10 months ago
- Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.☆30Updated last year
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆13Updated last week
- Shared code for training sentence embeddings with Flax / JAX☆27Updated 3 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆69Updated 6 months ago
- Code for Zero-Shot Tokenizer Transfer☆109Updated 2 months ago
- ☆12Updated last year
- M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, Luke Zettlemoyer☆53Updated last year