openai / simple-evalsLinks
☆4,192Updated 4 months ago
Alternatives and similar repositories for simple-evals
Users that are interested in simple-evals are comparing it to the libraries listed below
Sorting:
- AllenAI's post-training codebase☆3,373Updated this week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆2,091Updated last year
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆2,956Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,141Updated last week
- SWE-bench: Can Language Models Resolve Real-world Github Issues?☆3,873Updated 2 weeks ago
- An Open Large Reasoning Model for Real-World Solutions☆1,528Updated 6 months ago
- Democratizing Reinforcement Learning for LLMs☆4,792Updated this week
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,918Updated 3 months ago
- Curated list of datasets and tools for post-training.☆4,026Updated 3 weeks ago
- A library for advanced large language model reasoning☆2,313Updated 5 months ago
- Tools for merging pretrained large language models.☆6,494Updated last week
- Robust recipes to align language models with human and AI preferences☆5,431Updated 2 months ago
- Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …☆2,554Updated last week
- PyTorch native post-training library☆5,604Updated last week
- TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients. Published in Nature.☆3,120Updated 4 months ago
- ☆4,110Updated last year
- DataComp for Language Models☆1,394Updated 2 months ago
- Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL☆3,580Updated 2 weeks ago
- A framework for few-shot evaluation of language models.☆10,776Updated last week
- A reading list on LLM based Synthetic Data Generation 🔥☆1,465Updated 5 months ago
- Agentless🐱: an agentless approach to automatically solve software development problems☆1,978Updated 11 months ago
- ☆1,348Updated last year
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,199Updated last week
- Scalable RL solution for advanced reasoning of language models☆1,779Updated 8 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆963Updated 5 months ago
- Fully open data curation for reasoning models☆2,152Updated 3 months ago
- Modeling, training, eval, and inference code for OLMo☆6,197Updated last week
- Recipes to scale inference-time compute of open models☆1,118Updated 6 months ago
- Synthetic data curation for post-training and structured data extraction☆1,564Updated 4 months ago
- Code for the paper "Evaluating Large Language Models Trained on Code"☆3,034Updated 10 months ago