huggingface / datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆2,026Updated last week
Related projects ⓘ
Alternatives and complementary repositories for datatrove
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆1,609Updated this week
- Minimalistic large language model 3D-parallelism training☆1,220Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆787Updated last week
- ReFT: Representation Finetuning for Language Models☆1,145Updated this week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆2,178Updated this week
- Evaluate your LLM's response with Prometheus and GPT4 💯☆792Updated 2 months ago
- TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.☆1,797Updated last week
- Tools for merging pretrained large language models.☆4,788Updated this week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,548Updated 2 months ago
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,745Updated 9 months ago
- Enforce the output format (JSON Schema, Regex etc) of a language model☆1,514Updated 3 weeks ago
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆871Updated last week
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,035Updated this week
- Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…☆3,031Updated 2 months ago
- Implementation of the training framework proposed in Self-Rewarding Language Model, from MetaAI☆1,333Updated 6 months ago
- Efficient Retrieval Augmentation and Generation Framework☆1,326Updated this week
- High-quality datasets, tools, and concepts for LLM fine-tuning.☆1,965Updated 2 weeks ago
- Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'☆1,255Updated last month
- Training LLMs with QLoRA + FSDP☆1,419Updated this week
- DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤☆838Updated 3 months ago
- Scalable data pre processing and curation toolkit for LLMs☆576Updated this week
- ☆2,732Updated last month
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…☆764Updated this week
- ☆1,263Updated this week
- Automatically evaluate your LLMs in Google Colab☆556Updated 6 months ago
- Robust recipes to align language models with human and AI preferences☆4,663Updated last month
- PyTorch native finetuning library☆4,267Updated this week
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,385Updated 8 months ago