π€ Benchmark Large Language Models Reliably On Your Data
β429Dec 30, 2025Updated last month
Alternatives and similar repositories for yourbench
Users that are interested in yourbench are comparing it to the libraries listed below
Sorting:
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsβ2,311Feb 20, 2026Updated last week
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard aβ¦β2,063Dec 3, 2025Updated 2 months ago
- β15Apr 26, 2025Updated 10 months ago
- A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.β63Jul 6, 2025Updated 7 months ago
- Tool for generating high quality Synthetic datasetsβ1,508Oct 28, 2025Updated 3 months ago
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domainsβ50Feb 4, 2026Updated 3 weeks ago
- Train your own SOTA deductive reasoning modelβ107Mar 6, 2025Updated 11 months ago
- Build datasets using natural languageβ567Sep 19, 2025Updated 5 months ago
- A lightweight adjustment tool for smoothing token probabilities in the Qwen models to encourage balanced multilingual generation.β104Jul 9, 2025Updated 7 months ago
- Everything about the SmolLM and SmolVLM family of modelsβ3,627Jan 13, 2026Updated last month
- Fast Multimodal Semantic Deduplication & Filteringβ890Jan 20, 2026Updated last month
- A framework for pitting LLMs against each other in an evolving library of games ββ34Apr 20, 2025Updated 10 months ago
- β162Dec 2, 2024Updated last year
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,903Updated this week
- β17Dec 2, 2025Updated 2 months ago
- Let's build better datasets, together!β271Dec 20, 2024Updated last year
- β23Jan 5, 2026Updated last month
- Python library to use Pleias-RAG modelsβ68May 1, 2025Updated 9 months ago
- Async RL Training at Scaleβ1,096Updated this week
- Exploring Applications of GRPOβ250Aug 25, 2025Updated 6 months ago
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifiβ¦β3,100Feb 16, 2026Updated last week
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafteβ¦β84Oct 29, 2024Updated last year
- moodistβ24Feb 20, 2026Updated last week
- Synthetic data curation for post-training and structured data extractionβ1,637Jan 24, 2026Updated last month
- β14Jun 25, 2025Updated 8 months ago
- The official implementation of the paper "Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models".β87Mar 25, 2025Updated 11 months ago
- π€ smolagents: a barebones library for agents that think in code.β25,615Feb 21, 2026Updated last week
- β65Feb 9, 2026Updated 2 weeks ago
- The LLM Evaluation Frameworkβ13,787Updated this week
- A course on aligning smol models.β6,587Feb 6, 2026Updated 3 weeks ago
- Datamodels for hugging face tokenizersβ99Feb 20, 2026Updated last week
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on taskβ¦β184Sep 23, 2024Updated last year
- Robust recipes to align language models with human and AI preferencesβ5,506Sep 8, 2025Updated 5 months ago
- Automatically evaluate your LLMs in Google Colabβ686May 7, 2024Updated last year
- Democratizing Reinforcement Learning for LLMsβ5,135Feb 20, 2026Updated last week
- An advanced research assistant that utilizes AI agents to generate novel research directions and analyze scientific literature. This platβ¦β16Feb 26, 2025Updated last year
- A framework for few-shot evaluation of language models.β11,478Feb 15, 2026Updated last week
- Synthetic Text Dataset Generation for LLM projectsβ56Feb 19, 2026Updated last week
- Evaluation framework for document processing models and services.β63Feb 12, 2026Updated 2 weeks ago