microsoft / eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
☆124Updated this week
Alternatives and similar repositories for eureka-ml-insights:
Users that are interested in eureka-ml-insights are comparing it to the libraries listed below
- Functional Benchmarks and the Reasoning Gap☆85Updated 6 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 5 months ago
- Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation☆29Updated 2 months ago
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆104Updated 6 months ago
- Codebase accompanying the Summary of a Haystack paper.☆77Updated 7 months ago
- Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.01935☆96Updated last month
- The Granite Guardian models are designed to detect risks in prompts and responses.☆78Updated last month
- Large language models for document ranking.☆48Updated last week
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆117Updated 10 months ago
- A method for steering llms to better follow instructions☆30Updated last week
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆86Updated 2 weeks ago
- ☆51Updated last week
- Complex Function Calling Benchmark.☆98Updated 3 months ago
- Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"☆120Updated 8 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆55Updated 7 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆106Updated 7 months ago
- PyTorch library for Active Fine-Tuning☆64Updated 2 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆170Updated last month
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated 2 weeks ago
- code for training & evaluating Contextual Document Embedding models☆181Updated last week
- ☆196Updated 2 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆133Updated 5 months ago
- ☆55Updated 2 weeks ago
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆50Updated 2 months ago
- Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…☆48Updated last month
- ☆120Updated 6 months ago
- Code for EMNLP 2024 paper "Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning"☆53Updated 6 months ago
- ☆19Updated last week
- ☆114Updated 2 months ago
- The first dense retrieval model that can be prompted like an LM☆71Updated 7 months ago