microsoft / eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
☆105Updated this week
Alternatives and similar repositories for eureka-ml-insights:
Users that are interested in eureka-ml-insights are comparing it to the libraries listed below
- Automatic Evals for Instruction-Tuned Models☆100Updated this week
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆101Updated 7 months ago
- Functional Benchmarks and the Reasoning Gap☆82Updated 3 months ago
- Codebase accompanying the Summary of a Haystack paper.☆75Updated 3 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆100Updated last month
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆97Updated 3 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆153Updated last month
- Banishing LLM Hallucinations Requires Rethinking Generalization☆268Updated 6 months ago
- Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"☆117Updated 5 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆126Updated 2 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆154Updated 2 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated 11 months ago
- Evaluating LLMs with fewer examples☆141Updated 9 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆182Updated this week
- Code accompanying "How I learned to start worrying about prompt formatting".☆97Updated 3 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆154Updated 3 months ago
- Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya☆99Updated last week
- LOFT: A 1 Million+ Token Long-Context Benchmark☆164Updated 2 months ago
- ☆137Updated 5 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆134Updated last month
- This is the reproduction repository for my 🤗 Hugging Face blog post on synthetic data☆63Updated 10 months ago
- Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning☆43Updated last year
- ☆89Updated this week
- Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research☆121Updated this week
- Benchmarking LLMs with Challenging Tasks from Real Users☆206Updated 2 months ago
- ☆30Updated 6 months ago
- The first dense retrieval model that can be prompted like an LM☆65Updated 4 months ago
- ☆115Updated this week
- ☆115Updated 3 months ago