mlfoundations / evalchemy
Automatic Evals for Instruction-Tuned Models
☆100Updated this week
Alternatives and similar repositories for evalchemy:
Users that are interested in evalchemy are comparing it to the libraries listed below
- A simple unified framework for evaluating LLMs☆164Updated 3 weeks ago
- Functional Benchmarks and the Reasoning Gap☆82Updated 3 months ago
- ☆115Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆154Updated 2 months ago
- Attribute (or cite) statements generated by LLMs back to in-context information.☆184Updated 3 months ago
- ☆115Updated 3 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆101Updated 7 months ago
- Manage scalable open LLM inference endpoints in Slurm clusters☆247Updated 6 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆157Updated this week
- Evaluating LLMs with fewer examples☆141Updated 9 months ago
- Code for PHATGOOSE introduced in "Learning to Route Among Specialized Experts for Zero-Shot Generalization"☆80Updated 10 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆182Updated this week
- Code for the paper "Fishing for Magikarp"☆139Updated this week
- awesome synthetic (text) datasets☆253Updated 2 months ago
- Textbook on reinforcement learning from human feedback☆111Updated this week
- ☆135Updated this week
- Synthetic Data curation for post-training and structured data extraction☆316Updated this week
- Codebase accompanying the Summary of a Haystack paper.☆75Updated 3 months ago
- PyTorch library for Active Fine-Tuning☆52Updated last week
- A toolkit for describing model features and intervening on those features to steer behavior.☆149Updated 2 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆202Updated this week
- ☆108Updated 3 months ago
- ☆93Updated 6 months ago
- Just a bunch of benchmark logs for different LLMs☆116Updated 5 months ago
- Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).☆79Updated 9 months ago
- ☆89Updated this week
- Discovering Data-driven Hypotheses in the Wild☆51Updated last month
- code for training & evaluating Contextual Document Embedding models☆160Updated this week
- Red-Teaming Language Models with DSPy☆153Updated 9 months ago
- ☆135Updated 3 months ago