allenai / duplodocusLinks
Tooling for exact and MinHash deduplication of large-scale text datasets
β26Updated 2 weeks ago
Alternatives and similar repositories for duplodocus
Users that are interested in duplodocus are comparing it to the libraries listed below
Sorting:
- DPO, but faster πβ46Updated 11 months ago
- β78Updated 2 weeks ago
- Supercharge huggingface transformers with model parallelism.β77Updated 4 months ago
- Verifiers for LLM Reinforcement Learningβ80Updated 7 months ago
- A collection of reproducible inference engine benchmarksβ38Updated 7 months ago
- Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTOβ¦β57Updated last week
- β39Updated last year
- β48Updated last year
- β52Updated 9 months ago
- [COLM 2024] Early Weight Averaging meets High Learning Rates for LLM Pre-trainingβ18Updated last year
- β35Updated 2 months ago
- β65Updated last year
- Improving Text Embedding of Language Models Using Contrastive Fine-tuningβ65Updated last year
- β66Updated 8 months ago
- A massively multilingual modern encoder language modelβ113Updated last month
- In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuningβ35Updated 2 years ago
- β58Updated 2 weeks ago
- Benchmark for machine learning model online serving (LLM, embedding, Stable-Diffusion, Whisper)β28Updated 2 years ago
- Open sourced backend for Martian's LLM Inference Provider Leaderboardβ19Updated last year
- Lightweight toolkit package to train and fine-tune 1.58bit Language modelsβ100Updated 6 months ago
- β95Updated 6 months ago
- Simple and efficient DeepSeek V3 SFT using pipeline parallel and expert parallel, with both FP8 and BF16 trainingsβ98Updated 4 months ago
- A repository for research on medium sized language models.β78Updated last year
- Model implementation for the contextual embeddings projectβ36Updated 6 months ago
- Aioli: A unified optimization framework for language model data mixingβ31Updated 10 months ago
- Efficient non-uniform quantization with GPTQ for GGUFβ53Updated 2 months ago
- Code for KaLM-Embedding modelsβ101Updated 5 months ago
- β48Updated last month
- β30Updated 4 months ago
- β52Updated last year