mlfoundations / evalchemy
Automatic Evals for LLMs
☆201Updated this week
Alternatives and similar repositories for evalchemy:
Users that are interested in evalchemy are comparing it to the libraries listed below
- A simple unified framework for evaluating LLMs☆195Updated last week
- The official evaluation suite and dynamic data release for MixEval.☆231Updated 3 months ago
- awesome synthetic (text) datasets☆259Updated 3 months ago
- ☆496Updated 2 months ago
- Manage scalable open LLM inference endpoints in Slurm clusters☆252Updated 7 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆167Updated last month
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆200Updated 3 months ago
- Attribute (or cite) statements generated by LLMs back to in-context information.☆197Updated 4 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆182Updated last week
- Reproducible, flexible LLM evaluations☆158Updated 2 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆296Updated last year
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆160Updated this week
- An Open Source Toolkit For LLM Distillation☆486Updated last month
- Benchmarking LLMs with Challenging Tasks from Real Users☆215Updated 3 months ago
- ☆146Updated last week
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆451Updated 10 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆347Updated 5 months ago
- Evaluating LLMs with fewer examples☆145Updated 10 months ago
- A comprehensive repository of reasoning tasks for LLMs (and beyond)☆407Updated 4 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆270Updated this week
- LOFT: A 1 Million+ Token Long-Context Benchmark☆172Updated 3 months ago
- ☆158Updated last month
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆325Updated last month
- RewardBench: the first evaluation tool for reward models.☆503Updated this week
- code for training & evaluating Contextual Document Embedding models☆173Updated last month
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆128Updated 3 months ago
- Banishing LLM Hallucinations Requires Rethinking Generalization☆270Updated 7 months ago
- AWM: Agent Workflow Memory☆239Updated 2 weeks ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆175Updated 6 months ago