microsoft / eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
☆109Updated this week
Alternatives and similar repositories for eureka-ml-insights:
Users that are interested in eureka-ml-insights are comparing it to the libraries listed below
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated this week
- Evaluating LLMs with fewer examples☆147Updated 11 months ago
- The first dense retrieval model that can be prompted like an LM☆68Updated 6 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆102Updated 5 months ago
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆102Updated 6 months ago
- Codebase accompanying the Summary of a Haystack paper.☆75Updated 6 months ago
- ☆34Updated 8 months ago
- ☆143Updated 8 months ago
- Functional Benchmarks and the Reasoning Gap☆84Updated 5 months ago
- Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.01935☆87Updated last week
- CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments☆49Updated last month
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆111Updated 9 months ago
- Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya☆107Updated last month
- ☆81Updated last year
- PyTorch library for Active Fine-Tuning☆62Updated last month
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆104Updated 6 months ago
- RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker☆107Updated 2 weeks ago
- The Granite Guardian models are designed to detect risks in prompts and responses.☆72Updated last week
- ☆37Updated last month
- ☆67Updated 7 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆167Updated last month
- Interaction-first method for generating demonstrations for web-agents on any website☆31Updated 3 weeks ago
- Code for ExploreTom☆79Updated 3 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Benchmarking LLMs with Challenging Tasks from Real Users☆219Updated 4 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆135Updated last month
- Complex Function Calling Benchmark.☆85Updated 2 months ago
- ☆142Updated 11 months ago
- ☆284Updated 9 months ago
- ☆68Updated last year