microsoft / eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
☆142Updated this week
Alternatives and similar repositories for eureka-ml-insights
Users that are interested in eureka-ml-insights are comparing it to the libraries listed below
Sorting:
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 2 months ago
- Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.01935☆100Updated last month
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆199Updated last week
- ☆56Updated last week
- ☆38Updated 10 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆119Updated 11 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆222Updated 6 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆105Updated 7 months ago
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆105Updated 7 months ago
- ☆129Updated last month
- A method for steering llms to better follow instructions☆37Updated 2 weeks ago
- Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"☆122Updated 9 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆110Updated 8 months ago
- ☆143Updated 9 months ago
- PyTorch library for Active Fine-Tuning☆72Updated 2 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆146Updated 2 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆89Updated last month
- [arXiv] EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees☆18Updated 2 months ago
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 7 months ago
- Evaluating LLMs with fewer examples☆153Updated last year
- ☆78Updated this week
- Reproducible, flexible LLM evaluations☆200Updated last week
- Code for the paper 🌳 Tree Search for Language Model Agents☆199Updated 9 months ago
- Functional Benchmarks and the Reasoning Gap☆86Updated 7 months ago
- Verifiers for LLM Reinforcement Learning☆50Updated last month
- Repository for the paper Stream of Search: Learning to Search in Language☆146Updated 3 months ago
- ☆120Updated 7 months ago
- ☆27Updated last week
- Source code for the collaborative reasoner research project at Meta FAIR.☆74Updated 3 weeks ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆105Updated last year