r-three / common-pileLinks
Code for collecting, processing, and preparing datasets for the Common Pile
☆247Updated 3 months ago
Alternatives and similar repositories for common-pile
Users that are interested in common-pile are comparing it to the libraries listed below
Sorting:
- ☆59Updated last month
- ☆212Updated last month
- ☆88Updated last week
- ☆87Updated last week
- code for training & evaluating Contextual Document Embedding models☆201Updated 7 months ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Python library to use Pleias-RAG models☆67Updated 7 months ago
- ☆258Updated 8 months ago
- Datamodels for hugging face tokenizers☆86Updated 3 weeks ago
- ☆53Updated 10 months ago
- Code for SaGe subword tokenizer (EACL 2023)☆27Updated last year
- ☆58Updated last year
- A massively multilingual modern encoder language model☆116Updated 2 months ago
- ☆67Updated last year
- Alice in Wonderland code base for experiments and raw experiments data☆131Updated 3 months ago
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆189Updated 5 months ago
- Code for "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs"☆85Updated 9 months ago
- Code and data to support "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4"☆69Updated 2 years ago
- ☆53Updated last year
- An introduction to LLM Sampling☆79Updated last year
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆66Updated 2 weeks ago
- A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.☆58Updated 5 months ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆25Updated last year
- An attribution library for LLMs☆46Updated last year
- ☆144Updated 3 months ago
- ☆29Updated 5 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆67Updated 2 months ago
- Multi-Domain Expert Learning☆67Updated last year
- A robust web archive analytics toolkit☆124Updated 2 months ago
- ☆98Updated 6 months ago