huggingface / datatroveLinks
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆2,660Updated last week
Alternatives and similar repositories for datatrove
Users that are interested in datatrove are comparing it to the libraries listed below
Sorting:
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,973Updated this week
- Minimalistic large language model 3D-parallelism training☆2,239Updated last month
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆2,899Updated last week
- Scalable data pre processing and curation toolkit for LLMs☆1,165Updated this week
- Data and tools for generating and inspecting OLMo pre-training data.☆1,321Updated last week
- Stanford NLP Python library for Representation Finetuning (ReFT)☆1,514Updated 8 months ago
- Bringing BERT into modernity via both architecture changes and scaling☆1,529Updated 3 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy