felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆151Updated last year
Alternatives and similar repositories for tinyBenchmarks:
Users that are interested in tinyBenchmarks are comparing it to the libraries listed below
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 6 months ago
- ☆120Updated 7 months ago
- Functional Benchmarks and the Reasoning Gap☆85Updated 7 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated 2 months ago
- ☆170Updated 2 weeks ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆104Updated 7 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆107Updated 7 months ago
- Replicating O1 inference-time scaling laws☆84Updated 5 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- A simple unified framework for evaluating LLMs☆209Updated 3 weeks ago
- The HELMET Benchmark☆142Updated 2 weeks ago
- ☆97Updated 10 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆172Updated 3 months ago
- Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"☆120Updated 8 months ago
- ☆114Updated 2 months ago
- ☆60Updated last year
- ☆72Updated 5 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆171Updated 2 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆163Updated 2 weeks ago
- The official evaluation suite and dynamic data release for MixEval.☆238Updated 5 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆189Updated 5 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated last week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆172Updated last month
- This is the official repository for Inheritune.☆111Updated 2 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆75Updated last year
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆139Updated 6 months ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆133Updated 7 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆104Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆198Updated 7 months ago
- Critique-out-Loud Reward Models☆63Updated 6 months ago