felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆141Updated 9 months ago
Alternatives and similar repositories for tinyBenchmarks:
Users that are interested in tinyBenchmarks are comparing it to the libraries listed below
- Benchmarking LLMs with Challenging Tasks from Real Users☆206Updated 2 months ago
- ☆135Updated 3 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆97Updated 3 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated 11 months ago
- The official evaluation suite and dynamic data release for MixEval.☆233Updated 2 months ago
- ☆115Updated 3 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆129Updated 2 months ago
- Reproducible, flexible LLM evaluations☆118Updated last month
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆66Updated last year
- Evaluating LLMs with CommonGen-Lite☆87Updated 9 months ago
- PASTA: Post-hoc Attention Steering for LLMs☆109Updated last month
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆126Updated 2 months ago
- ☆93Updated 6 months ago
- A simple unified framework for evaluating LLMs☆164Updated 3 weeks ago
- Functional Benchmarks and the Reasoning Gap☆82Updated 3 months ago
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆108Updated last year
- Code for PHATGOOSE introduced in "Learning to Route Among Specialized Experts for Zero-Shot Generalization"☆80Updated 10 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆97Updated 6 months ago
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆76Updated last year
- ☆50Updated 2 months ago
- ☆89Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆174Updated 3 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆164Updated 2 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆120Updated last month
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆97Updated 4 months ago
- This is the official repository for Inheritune.☆108Updated 3 months ago
- Codebase accompanying the Summary of a Haystack paper.☆75Updated 3 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆134Updated last month
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆215Updated last year
- Code for the paper "Fishing for Magikarp"☆139Updated this week