JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆235Updated 5 months ago
Alternatives and similar repositories for MixEval:
Users that are interested in MixEval are comparing it to the libraries listed below
- Reproducible, flexible LLM evaluations☆191Updated 3 weeks ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆220Updated 5 months ago
- A simple unified framework for evaluating LLMs☆209Updated last week
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆458Updated last year
- LOFT: A 1 Million+ Token Long-Context Benchmark☆187Updated 2 weeks ago
- RewardBench: the first evaluation tool for reward models.☆555Updated last month
- A project to improve skills of large language models☆283Updated this week
- ☆512Updated 5 months ago
- Official repository for ORPO☆448Updated 10 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆354Updated 7 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 11 months ago
- Manage scalable open LLM inference endpoints in Slurm clusters☆254Updated 9 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆211Updated 5 months ago
- ☆282Updated last month
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆139Updated 5 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆325Updated this week
- ☆166Updated this week
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆170Updated 3 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆133Updated 5 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆299Updated last year
- ☆114Updated 2 months ago
- ☆96Updated 9 months ago
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆198Updated last week
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆187Updated 4 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated last week
- Evaluating LLMs with fewer examples☆151Updated last year
- ☆148Updated 4 months ago
- Automatic evals for LLMs☆370Updated this week
- ☆308Updated 10 months ago
- ☆267Updated 8 months ago