r-three / common-pileLinks
Code for collecting, processing, and preparing datasets for the Common Pile
☆234Updated last month
Alternatives and similar repositories for common-pile
Users that are interested in common-pile are comparing it to the libraries listed below
Sorting:
- Python library to use Pleias-RAG models☆63Updated 5 months ago
- ☆72Updated 2 months ago
- ☆196Updated 3 months ago
- Code for SaGe subword tokenizer (EACL 2023)☆26Updated 10 months ago
- Datamodels for hugging face tokenizers☆77Updated 3 weeks ago
- ☆57Updated last year
- ☆254Updated 6 months ago
- ☆80Updated this week
- An introduction to LLM Sampling☆79Updated 10 months ago
- code for training & evaluating Contextual Document Embedding models☆197Updated 5 months ago
- ☆67Updated last year
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆66Updated 3 weeks ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆25Updated 10 months ago
- State-of-the-art paired encoder and decoder models (17M-1B params)☆50Updated 2 months ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- ☆57Updated 2 weeks ago
- ☆49Updated 8 months ago
- ☆53Updated 11 months ago
- minimal pytorch implementation of bm25 (with sparse tensors)☆104Updated last year
- ☆52Updated 8 months ago
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆188Updated 3 months ago
- ☆83Updated 4 months ago
- A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.☆46Updated 3 months ago
- Learning to route instances for Human vs AI Feedback (ACL Main '25)☆24Updated 2 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆58Updated last week
- Pre-train Static Word Embeddings☆87Updated last month
- A massively multilingual modern encoder language model☆97Updated this week
- Alice in Wonderland code base for experiments and raw experiments data☆131Updated last month
- An attribution library for LLMs☆43Updated last year
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆145Updated 7 months ago