RenzeLou / AAAR-1.0

The source code for running LLMs on the AAAR-1.0 benchmark.

☆11

Related projects ⓘ

Alternatives and complementary repositories for AAAR-1.0

RenzeLou / Muffin
MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following
☆12Updated last week
john-hewitt / implicit-ins
Codebase for Instruction Following without Instruction Tuning
☆30Updated last month
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆36Updated 8 months ago
yale-nlp / refdpo
☆15Updated 3 months ago
Di-viner / LLM-Robustness-to-Irrelevant-Information
[COLM'24] "How Easily do Irrelevant Inputs Skew the Responses of Large Language Models?"
☆19Updated 3 weeks ago
princeton-nlp / PTP
Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073
☆26Updated 4 months ago
Yifan-Song793 / GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
☆25Updated 3 months ago
sail-sg / symbolic-instruction-tuning
The official repository for the paper "From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning".
☆61Updated last year
chenllliang / MMEvalPro
Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs
☆22Updated last month
jdf-prog / LLM-Engines
☆29Updated this week
qtli / GSM-Plus
GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.
☆46Updated 4 months ago
meowpass / FollowComplexInstruction
Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…
☆37Updated 4 months ago
starrYYxuan / LeCo
This the implementation of LeCo
☆27Updated 3 months ago
chujiezheng / LLM-Extrapolation
Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"
☆67Updated 5 months ago
Timothyxxx / KVCachePapers
☆19Updated 5 months ago
weizhepei / InstructRAG
InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales
☆51Updated 3 weeks ago
GAIR-NLP / OlympicArena
This is the official repository of the paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI"
☆85Updated last month
dvlab-research / Mr-Ben
This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"
☆43Updated last week
CriticBench / CriticBench
[ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
☆20Updated 8 months ago
liyucheng09 / Contamination_Detector
Lightweight tool to identify Data Contamination in LLMs evaluation
☆40Updated 8 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆56Updated 8 months ago
shankarp8 / knowledge_distillation
Repository for "Propagating Knowledge Updates to LMs Through Distillation" (NeurIPS 2023).
☆24Updated 2 months ago
ZitongYang / Synthetic_Continued_Pretraining
Code implementation of synthetic continued pretraining
☆54Updated last month
gmftbyGMFTBY / Rep-Dropout
[NeurIPS 2023] Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective
☆29Updated last year
KbsdJames / Omni-MATH
The official repository of the Omni-MATH benchmark.
☆47Updated last week
maszhongming / ParaKnowTransfer
Code for "Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective"
☆29Updated 6 months ago
casmlab / NPHardEval
Repository for NPHardEval, a quantified-dynamic benchmark of LLMs
☆48Updated 7 months ago
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆35Updated 5 months ago
wzhouad / context-faithful-llm
Code and data for paper "Context-faithful Prompting for Large Language Models".
☆39Updated last year
GAIR-NLP / weak-to-strong-reasoning
☆53Updated 2 months ago