huggingface / evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
β971Updated 3 weeks ago
Alternatives and similar repositories for evaluation-guidebook:
Users that are interested in evaluation-guidebook are comparing it to the libraries listed below
- A reading list on LLM based Synthetic Data Generation π₯β993Updated 2 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsβ1,022Updated this week
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.β1,260Updated last week
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifiβ¦β2,064Updated this week
- β594Updated last month
- Evaluate your LLM's response with Prometheus and GPT4 π―β854Updated 3 weeks ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,167Updated this week
- Use late-interaction multi-modal models such as ColPali in just a few lines of code.β705Updated this week
- β1,429Updated last week
- Recipes for shrinking, optimizing, customizing cutting edge vision models. πβ1,115Updated last month
- Implementing the 4 agentic patterns from scratchβ995Updated this week
- Curated list of datasets and tools for post-training.β2,560Updated 2 weeks ago
- The code used to train and run inference with the ColPali architecture.β1,415Updated last week
- System 2 Reasoning Link Collectionβ751Updated this week
- Automated Evaluation of RAG Systemsβ532Updated 2 months ago
- Framework for enhancing LLMs for RAG tasks using fine-tuning.β522Updated last month
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipyβ987Updated last week
- TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.β2,011Updated this week
- Automatically evaluate your LLMs in Google Colabβ583Updated 8 months ago
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR tβ¦β365Updated 3 months ago
- Recipes to scale inference-time compute of open modelsβ971Updated last week
- awesome synthetic (text) datasetsβ256Updated 3 months ago
- Synthetic Data curation for post-training and structured data extractionβ539Updated this week
- Bringing BERT into modernity via both architecture changes and scalingβ1,108Updated last week
- Dynamiq is an orchestration framework for agentic AI and LLM applicationsβ679Updated this week
- LOTUS: A semantic query engine for fast and easy LLM-powered data processingβ993Updated this week
- The Fastest State-of-the-Art Static Embeddings in the Worldβ777Updated this week
- A collection of notebooks/recipes showcasing usecases of open-source models with Together AI.β607Updated this week
- A library for prompt engineering and optimization (SAMMO = Structure-aware Multi-Objective Metaprompt Optimization)β630Updated last month
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineeringβ601Updated 2 weeks ago