dvlab-research / MR-GSM8KLinks

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs

☆51

Alternatives and similar repositories for MR-GSM8K

Users that are interested in MR-GSM8K are comparing it to the libraries listed below

Sorting:

sanyalsunny111 / LLM-Inheritune
This is the official repository for Inheritune.
☆117Updated 10 months ago
microsoft / simulated-trial-and-error
☆122Updated last year
GAIR-NLP / OlympicArena
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆108Updated 9 months ago
clinicalml / co-llm
Co-LLM: Learning to Decode Collaboratively with Multiple Language Models
☆123Updated last year
wuhy68 / Parameter-Efficient-MoE
Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks (EMNLP'24)
☆147Updated last year
open-compass / Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
☆55Updated 7 months ago
architsharma97 / dpo-rlaif
☆100Updated last year
SALT-NLP / demonstrated-feedback
☆129Updated last year
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆246Updated last year
IBM / SALMON
Self-Alignment with Principle-Following Reward Models
☆169Updated 3 months ago
SalesforceAIResearch / GemFilter
☆85Updated last month
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆118Updated last year
GAIR-NLP / ReAlign
Reformatted Alignment
☆112Updated last year
xufangzhi / phi-Decoding
[ACL 2025] An inference-time decoding strategy with adaptive foresight sampling
☆105Updated 7 months ago
dwzhu-pku / PoSE
Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)
☆205Updated last year
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆84Updated last year
lz1oceani / verify_cot
☆137Updated 2 years ago
GAIR-NLP / Entropy-ABF
Official implementation for 'Extending LLMs’ Context Window with 100 Samples'
☆81Updated last year
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆149Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆170Updated last year
facebookresearch / llm-cross-capabilities
Official implementation for "Law of the Weakest Link: Cross capabilities of Large Language Models"
☆43Updated last year
casmlab / NPHardEval
Repository for NPHardEval, a quantified-dynamic benchmark of LLMs
☆61Updated last year
18907305772 / FuseAI
FuseAI Project
☆87Updated 11 months ago
da03 / implicit_chain_of_thought
☆139Updated last year
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆61Updated last year
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆146Updated last year
dwzhu-pku / LongEmbed
LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
☆145Updated last year
google / sycophancy-intervention
Scripts for generating synthetic finetuning data for reducing sycophancy.
☆117Updated 2 years ago
SalesforceAIResearch / LaTRO
☆125Updated 10 months ago
bigcode-project / astraios
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆63Updated last year