huggingface / evaluation-guidebookLinks
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
β1,385Updated 4 months ago
Alternatives and similar repositories for evaluation-guidebook
Users that are interested in evaluation-guidebook are comparing it to the libraries listed below
Sorting:
- A reading list on LLM based Synthetic Data Generation π₯β1,280Updated last week
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifiβ¦β2,712Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsβ1,563Updated last week
- β656Updated last month
- Curated list of datasets and tools for post-training.β3,096Updated 4 months ago
- System 2 Reasoning Link Collectionβ834Updated 2 months ago
- Best practices for distilling large language models.β542Updated last year
- Automatically evaluate your LLMs in Google Colabβ629Updated last year
- Synthetic data curation for post-training and structured data extractionβ1,352Updated last week
- β1,735Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,390Updated this week
- Textbook on reinforcement learning from human feedbackβ938Updated this week
- Recipes to scale inference-time compute of open modelsβ1,073Updated last week
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.β1,436Updated this week
- Verifiers for LLM Reinforcement Learningβ1,019Updated last week
- Evaluate your LLM's response with Prometheus and GPT4 π―β948Updated last month
- Deep learning for dummies. All the practical details and useful utilities that go into working with real models.β793Updated last month
- Minimalistic large language model 3D-parallelism trainingβ1,888Updated last week
- Tool for generating high quality Synthetic datasetsβ878Updated last week
- π€ Benchmark Large Language Models Reliably On Your Dataβ315Updated this week
- β1,020Updated 5 months ago
- Stanford NLP Python library for Representation Finetuning (ReFT)β1,476Updated 3 months ago
- Use late-interaction multi-modal models such as ColPali in just a few lines of code.β787Updated 4 months ago
- Build datasets using natural languageβ479Updated 2 weeks ago
- AIDE: AI-Driven Exploration in the Space of Code. State of the Art machine Learning engineering agents that automates AI R&D.β912Updated last month
- A collection of notebooks/recipes showcasing usecases of open-source models with Together AI.β920Updated this week
- Optimizing inference proxy for LLMsβ2,427Updated this week
- Automated Evaluation of RAG Systemsβ596Updated 2 months ago
- Implementing the 4 agentic patterns from scratchβ1,338Updated 2 months ago
- DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. β π€π€β1,017Updated 3 months ago