openai / simple-evalsLinks
☆4,287Updated 5 months ago
Alternatives and similar repositories for simple-evals
Users that are interested in simple-evals are comparing it to the libraries listed below
Sorting:
- AllenAI's post-training codebase☆3,523Updated this week
- A unified evaluation framework for large language models☆2,771Updated 3 months ago
- Tools for merging pretrained large language models.☆6,680Updated last week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆2,137Updated last year
- Robust recipes to align language models with human and AI preferences☆5,473Updated 4 months ago
- PyTorch native post-training library☆5,642Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,251Updated this week
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,933Updated 5 months ago
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆3,039Updated 3 weeks ago
- A library for advanced large language model reasoning☆2,319Updated 7 months ago
- A framework for few-shot evaluation of language models.☆11,177Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,812Updated last week
- Democratizing Reinforcement Learning for LLMs☆4,965Updated this week
- Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …☆2,623Updated this week
- ☆4,110Updated last year
- Our library for RL environments + evals☆3,730Updated this week
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,537Updated 2 years ago
- SWE-bench: Can Language Models Resolve Real-world Github Issues?☆4,115Updated last week
- [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which ach…☆5,755Updated 2 months ago
- DataComp for Language Models☆1,406Updated 4 months ago
- Minimalistic large language model 3D-parallelism training☆2,411Updated last month
- Arena-Hard-Auto: An automatic LLM benchmark.☆978Updated 6 months ago
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆1,005Updated this week
- ☆1,343Updated last year
- This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai,…☆2,289Updated last year
- ☆2,548Updated last year
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,068Updated last month
- Stanford NLP Python library for Representation Finetuning (ReFT)☆1,551Updated this week
- Sky-T1: Train your own O1 preview model within $450☆3,367Updated 6 months ago
- Curated list of datasets and tools for post-training.☆4,149Updated 2 months ago