openai / simple-evals
☆2,475Updated last week
Alternatives and similar repositories for simple-evals:
Users that are interested in simple-evals are comparing it to the libraries listed below
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,336Updated this week
- SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?☆2,680Updated this week
- PyTorch native post-training library☆5,026Updated this week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,767Updated 7 months ago
- ☆4,070Updated 9 months ago
- AllenAI's post-training codebase☆2,840Updated this week
- Recipes to scale inference-time compute of open models☆1,044Updated last month
- Robust recipes to align language models with human and AI preferences☆5,090Updated 4 months ago
- Tools for merging pretrained large language models.☆5,478Updated this week
- Data and tools for generating and inspecting OLMo pre-training data.☆1,170Updated 2 weeks ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆765Updated last week
- Minimalistic large language model 3D-parallelism training☆1,715Updated this week
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,699Updated 3 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆652Updated 2 months ago
- Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09…☆2,137Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,322Updated this week
- DataComp for Language Models☆1,267Updated last week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆2,842Updated 3 weeks ago
- The official implementation of Self-Play Fine-Tuning (SPIN)☆1,138Updated 10 months ago
- [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling☆1,640Updated 8 months ago
- Agentless🐱: an agentless approach to automatically solve software development problems☆1,584Updated 3 months ago
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆2,578Updated this week
- A bibliography and survey of the papers surrounding o1☆1,183Updated 4 months ago
- ☆1,011Updated 3 months ago
- A library for advanced large language model reasoning☆2,065Updated last month
- A PyTorch native library for large model training☆3,488Updated this week
- Modeling, training, eval, and inference code for OLMo☆5,429Updated last week
- Democratizing Reinforcement Learning for LLMs☆2,113Updated last month
- Code for the paper "Evaluating Large Language Models Trained on Code"☆2,652Updated 2 months ago
- [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which ach…☆4,969Updated 2 weeks ago