WildEval / ZeroEval
A simple unified framework for evaluating LLMs
β197Updated 2 weeks ago
Alternatives and similar repositories for ZeroEval:
Users that are interested in ZeroEval are comparing it to the libraries listed below
- Benchmarking LLMs with Challenging Tasks from Real Usersβ215Updated 3 months ago
- πΎ OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.β194Updated last week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.β161Updated last week
- β149Updated last week
- The official evaluation suite and dynamic data release for MixEval.β231Updated 3 months ago
- Reproducible, flexible LLM evaluationsβ160Updated 2 months ago
- β108Updated 3 weeks ago
- π Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Papβ¦β145Updated 2 months ago
- Evaluating LLMs with fewer examplesβ145Updated 10 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)β204Updated 9 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Contextβ451Updated 11 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factualityβ172Updated 6 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'β182Updated 2 months ago
- β305Updated 8 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language Mβ¦β203Updated 3 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.β167Updated last month
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"β296Updated last year
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyβ178Updated 7 months ago
- β95Updated 7 months ago
- The HELMET Benchmarkβ115Updated this week
- LOFT: A 1 Million+ Token Long-Context Benchmarkβ172Updated 3 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]β135Updated 3 months ago
- RewardBench: the first evaluation tool for reward models.β505Updated this week
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.β185Updated 3 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"β153Updated 2 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".β157Updated this week
- Codes for the paper "βBench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718β307Updated 4 months ago