JinjieNi / MixEvalLinks

The official evaluation suite and dynamic data release for MixEval.

☆242

Alternatives and similar repositories for MixEval

Users that are interested in MixEval are comparing it to the libraries listed below

Sorting:

allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆233Updated 9 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆226Updated 3 weeks ago
booydar / babilong
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
☆208Updated 2 months ago
FranxYao / Long-Context-Data-Engineering
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆467Updated last year
tianyi-lab / Reflection_Tuning
[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
☆359Updated 10 months ago
lm-sys / llm-decontaminator
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆306Updated last year
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆235Updated 3 months ago
dwzhu-pku / LongEmbed
LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
☆140Updated 8 months ago
huggingface / llm-swarm
Manage scalable open LLM inference endpoints in Slurm clusters
☆268Updated last year
xfactlab / orpo
Official repository for ORPO
☆461Updated last year
arcee-ai / EvolKit
EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…
☆230Updated 9 months ago
dwzhu-pku / PoSE
Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)
☆205Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆146Updated 9 months ago
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆207Updated last month
Re-Align / URIAL
☆311Updated last year
huggingface / cosmopedia
☆529Updated 8 months ago
SALT-NLP / demonstrated-feedback
☆124Updated 10 months ago
architsharma97 / dpo-rlaif
☆99Updated last year
zai-org / ComplexFuncBench
Complex Function Calling Benchmark.
☆123Updated 6 months ago
microsoft / FILM
Official repo for "Make Your LLM Fully Utilize the Context"
☆253Updated last year
princeton-nlp / HELMET
The HELMET Benchmark
☆161Updated 3 months ago
OpenBMB / Eurus
☆320Updated 10 months ago
Digitous / LLM-SLERP-Merge
Spherical Merge Pytorch/HF format Language Models with minimal feature loss.
☆135Updated last year
facebookresearch / ReasonIR
Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".
☆187Updated last month
kaistAI / CoT-Collection
[EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
☆245Updated last year
jakespringer / echo-embeddings
☆152Updated last year
da03 / Internalize_CoT_Step_by_Step
☆187Updated 3 months ago
sail-sg / sailcraft
🚢 Data Toolkit for Sailor Language Models
☆94Updated 5 months ago
kyegomez / Lets-Verify-Step-by-Step
"Improving Mathematical Reasoning with Process Supervision" by OPENAI
☆112Updated last week